From aragorn168b at gmail.com Thu Aug 1 09:27:17 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Thu, 1 Aug 2013 09:27:17 +0200 Subject: [datatable-help] unique.data.frame should create a copy, right? In-Reply-To: References: <0F857B92DB0744C69CFC07AAE0C4DCF4@gmail.com> <2A8FA620F9814DE48058FCA9F41C0AE6@gmail.com> <6C1BF9F6C1454190AA0457AA23DFB386@gmail.com> Message-ID: Steve, Yes, exactly. If you dint have to subset the data.table as in your example, the equivalent operation would be to set the key of DT1 to NULL and then doing `unique` and storing it in DT2 and then setting the key back to "A" on DT1. And it'd be nice to be able to do: `unique(DT1, usekey=FALSE)` or something like that so that we don't have to NULL and set the key of DT1. Arun On Wednesday, July 31, 2013 at 8:02 PM, Steve Lianoglou wrote: > Hi all, > > On Wed, Jul 31, 2013 at 9:09 AM, Arunkumar Srinivasan > wrote: > > Ricardo, > > > > You read my mind.. :) I was thinking of the same as well.. Whether the > > community agrees or not would be interesting as well. It could save trouble > > with "alloc.col" manually. > > > > > It's easy enough to add -- just to be sure, the behavior required from > the OP would be equivalent to calling unique on a data.table that has > no key, right? For example, instead of this: > > R> DT1 <- CJ(A=0:1,B=1:6,D0=0:1,D=0:1)[D>=D0] > R> setkey(DT1,A) > R> DT2 <- unique.data.frame(DT1[,-which(names(DT1)%in%'B'),with=FALSE]) > R> DT2[,gah:=1] # warning: I should have made a copy, apparently > > You could just do: > > R> DT1 <- CJ(A=0:1,B=1:6,D0=0:1,D=0:1)[D>=D0] > R> DT2 <- unique(DT1[, -which(names(DT1)%in%'B'), with=FALSE]) > R> DT2[,gah:=1] > > Right? > > -steve > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lianoglou.steve at gene.com Thu Aug 1 18:58:07 2013 From: lianoglou.steve at gene.com (Steve Lianoglou) Date: Thu, 1 Aug 2013 09:58:07 -0700 Subject: [datatable-help] unique.data.frame should create a copy, right? In-Reply-To: References: <0F857B92DB0744C69CFC07AAE0C4DCF4@gmail.com> <2A8FA620F9814DE48058FCA9F41C0AE6@gmail.com> <6C1BF9F6C1454190AA0457AA23DFB386@gmail.com> Message-ID: Hi, On Thu, Aug 1, 2013 at 12:27 AM, Arunkumar Srinivasan wrote: > Steve, > > Yes, exactly. If you dint have to subset the data.table as in your example, Not sure what subsetting (or not) has to do with it, but ... > the equivalent operation would be to set the key of DT1 to NULL and then > doing `unique` and storing it in DT2 and then setting the key back to "A" on > DT1. > > And it'd be nice to be able to do: `unique(DT1, usekey=FALSE)` or something > like that so that we don't have to NULL and set the key of DT1. Ask and you shall receive :-) I added a `use.key=TRUE` parameter to unique.data.table and duplicated.data.table which is in SVN revision 888. This runs the relevant functions on the data.table as if it were not keyed at all. R> DT1 <- CJ(A=0:1,B=1:6,D0=0:1,D=0:1)[D>=D0] R> setkey(DT1,A) R> DT2 <- unique.data.frame(DT1[,-which(names(DT1)%in%'B'),with=FALSE]) R> dt2 <- unique(DT1[,-which(names(DT1) %in% 'B'),with=FALSE], use.key=FALSE) R> all.equal(DT2, dt2, check.attributes=FALSE) [1] TRUE The all.equal test will fail when check.attributers is TRUE because dt2 is still keyed by 'A'. R> key(DT1) [1] "A" R> key(DT2) NULL R> key(dt2) [1] "A" Hope that covers it, -steve -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech From FErickson at psu.edu Thu Aug 1 19:30:16 2013 From: FErickson at psu.edu (Frank Erickson) Date: Thu, 1 Aug 2013 12:30:16 -0500 Subject: [datatable-help] unique.data.frame should create a copy, right? In-Reply-To: References: <0F857B92DB0744C69CFC07AAE0C4DCF4@gmail.com> <2A8FA620F9814DE48058FCA9F41C0AE6@gmail.com> <6C1BF9F6C1454190AA0457AA23DFB386@gmail.com> Message-ID: Great! Thanks, Steve. Now I can stop using unique.data.frame altogether (as Arun suggested initially). --Frank On Thu, Aug 1, 2013 at 11:58 AM, Steve Lianoglou wrote: > Hi, > > On Thu, Aug 1, 2013 at 12:27 AM, Arunkumar Srinivasan > wrote: > > Steve, > > > > Yes, exactly. If you dint have to subset the data.table as in your > example, > > Not sure what subsetting (or not) has to do with it, but ... > > > the equivalent operation would be to set the key of DT1 to NULL and then > > doing `unique` and storing it in DT2 and then setting the key back to > "A" on > > DT1. > > > > And it'd be nice to be able to do: `unique(DT1, usekey=FALSE)` or > something > > like that so that we don't have to NULL and set the key of DT1. > > Ask and you shall receive :-) > > I added a `use.key=TRUE` parameter to unique.data.table and > duplicated.data.table which is in SVN revision 888. This runs the > relevant functions on the data.table as if it were not keyed at all. > > R> DT1 <- CJ(A=0:1,B=1:6,D0=0:1,D=0:1)[D>=D0] > R> setkey(DT1,A) > R> DT2 <- unique.data.frame(DT1[,-which(names(DT1)%in%'B'),with=FALSE]) > R> dt2 <- unique(DT1[,-which(names(DT1) %in% 'B'),with=FALSE], > use.key=FALSE) > R> all.equal(DT2, dt2, check.attributes=FALSE) > [1] TRUE > > The all.equal test will fail when check.attributers is TRUE because > dt2 is still keyed by 'A'. > > R> key(DT1) > [1] "A" > > R> key(DT2) > NULL > > R> key(dt2) > [1] "A" > > Hope that covers it, > -steve > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From saporta at scarletmail.rutgers.edu Thu Aug 1 22:12:54 2013 From: saporta at scarletmail.rutgers.edu (Ricardo Saporta) Date: Thu, 1 Aug 2013 16:12:54 -0400 Subject: [datatable-help] unique.data.frame should create a copy, right? In-Reply-To: References: <0F857B92DB0744C69CFC07AAE0C4DCF4@gmail.com> <2A8FA620F9814DE48058FCA9F41C0AE6@gmail.com> <6C1BF9F6C1454190AA0457AA23DFB386@gmail.com> Message-ID: Amazing!! Thanks Steve, that's gonna come very much in handy -Rick On Thu, Aug 1, 2013 at 12:58 PM, Steve Lianoglou wrote: > Hi, > > On Thu, Aug 1, 2013 at 12:27 AM, Arunkumar Srinivasan > wrote: > > Steve, > > > > Yes, exactly. If you dint have to subset the data.table as in your > example, > > Not sure what subsetting (or not) has to do with it, but ... > > > the equivalent operation would be to set the key of DT1 to NULL and then > > doing `unique` and storing it in DT2 and then setting the key back to > "A" on > > DT1. > > > > And it'd be nice to be able to do: `unique(DT1, usekey=FALSE)` or > something > > like that so that we don't have to NULL and set the key of DT1. > > Ask and you shall receive :-) > > I added a `use.key=TRUE` parameter to unique.data.table and > duplicated.data.table which is in SVN revision 888. This runs the > relevant functions on the data.table as if it were not keyed at all. > > R> DT1 <- CJ(A=0:1,B=1:6,D0=0:1,D=0:1)[D>=D0] > R> setkey(DT1,A) > R> DT2 <- unique.data.frame(DT1[,-which(names(DT1)%in%'B'),with=FALSE]) > R> dt2 <- unique(DT1[,-which(names(DT1) %in% 'B'),with=FALSE], > use.key=FALSE) > R> all.equal(DT2, dt2, check.attributes=FALSE) > [1] TRUE > > The all.equal test will fail when check.attributers is TRUE because > dt2 is still keyed by 'A'. > > R> key(DT1) > [1] "A" > > R> key(DT2) > NULL > > R> key(dt2) > [1] "A" > > Hope that covers it, > -steve > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.harding at paniscus.com Fri Aug 2 16:43:37 2013 From: p.harding at paniscus.com (Paul Harding) Date: Fri, 2 Aug 2013 15:43:37 +0100 Subject: [datatable-help] Fwd: Data table hanging on memory allocation failure In-Reply-To: References: Message-ID: Hi, I've got a big data table and I'm having memory allocation issues. This isn't about the memory issue per se, rather it's about how it gets handled. The table has 2M+ rows and is about 15G in size. Whilst manipulating the table memory usage grows quite fast, and I'm having to manually garbage collect after each manipulation. Even so it's possibly to reach a point (there are a lot of other developers using this server for all sorts of things) where even though there is 28GB memory free I can't allocate a needed 944MB contiguous chunk. I get the usual error message and it would be convenient if data table exited at that point (then I wouldn't lose my previous work), but it just hangs: 02-06:30:38.8> dt[,pt:=as.integer(p),by=list(sk, ik, pk)]; gc() Error: cannot allocate vector of size 944.8 Mb And the world holds its breath ... and the world starts turning blue ...I've left it like this for hours, nothing further happens. Windows Server 2008 R2 Enterprise SP1 // Intel Zeon CPU E7-4830 @ 2.13Hhz 4 processors // 128GB memory installed, 28.7GB available, R session 65GB R 3.0.0 data.table 1.8.9 rev 874 RStudio 0.97 Incidentally, after finishing a table manipulation and garbage collecting the R session memory usage drops to 33GB. This is consistent behaviour, there were 5 similar calls prior to this one that executed successfully, with the same behavior ( garbage collected after each). Almost as if there were a copy being made. But that's for info, not shooting off at a tangent (I'll try and do some investigation and maybe ask for help around the temporary memory growth issue later). I would be really happy if data table exited on this error or if I had that option, even if it's doing something very clever (waiting for memory availability?) because it doesn't seem to succeed. Regards Paul -------------- next part -------------- An HTML attachment was scrubbed... URL: From lianoglou.steve at gene.com Fri Aug 2 18:44:02 2013 From: lianoglou.steve at gene.com (Steve Lianoglou) Date: Fri, 2 Aug 2013 09:44:02 -0700 Subject: [datatable-help] Fwd: Data table hanging on memory allocation failure In-Reply-To: References: Message-ID: Hi Paul, Is this error always reproducible after the same call? You mentioned you've done 5 (or so?) large data manipulation calls on the data.table before calling the straw that breaks the camel's back -- if you start with the last call first, does it still stall on gc()? If the last call that hangs was changed to do something else (same calling order as yo have now), does it also hang? Just taking random guesses here ... Is there anyway for you to be able to test if you get the same behavior on a *nix machine? I'm guessing it's probably a tall order to find extra hardware lying around that has the specs to match the machine you're reporting the error on, but it might be worth a try. Sorry: no real answers for you yet. -steve We've had some mysterious memory issues in the past which Matthew has done a good job smoking out On Fri, Aug 2, 2013 at 7:43 AM, Paul Harding wrote: > Hi, I've got a big data table and I'm having memory allocation issues. This > isn't about the memory issue per se, rather it's about how it gets handled. > > The table has 2M+ rows and is about 15G in size. Whilst manipulating the > table memory usage grows quite fast, and I'm having to manually garbage > collect after each manipulation. Even so it's possibly to reach a point > (there are a lot of other developers using this server for all sorts of > things) where even though there is 28GB memory free I can't allocate a > needed 944MB contiguous chunk. > > I get the usual error message and it would be convenient if data table > exited at that point (then I wouldn't lose my previous work), but it just > hangs: > > 02-06:30:38.8> dt[,pt:=as.integer(p),by=list(sk, ik, pk)]; gc() > Error: cannot allocate vector of size 944.8 Mb > > And the world holds its breath ... and the world starts turning blue ...I've > left it like this for hours, nothing further happens. > > Windows Server 2008 R2 Enterprise SP1 // Intel Zeon CPU E7-4830 @ 2.13Hhz 4 > processors // 128GB memory installed, 28.7GB available, R session 65GB > R 3.0.0 data.table 1.8.9 rev 874 > RStudio 0.97 > > Incidentally, after finishing a table manipulation and garbage collecting > the R session memory usage drops to 33GB. This is consistent behaviour, > there were 5 similar calls prior to this one that executed successfully, > with the same behavior ( garbage collected after each). Almost as if there > were a copy being made. But that's for info, not shooting off at a tangent > (I'll try and do some investigation and maybe ask for help around the > temporary memory growth issue later). > > I would be really happy if data table exited on this error or if I had that > option, even if it's doing something very clever (waiting for memory > availability?) because it doesn't seem to succeed. > > Regards > Paul > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech From mdowle at mdowle.plus.com Fri Aug 2 18:54:27 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Fri, 02 Aug 2013 17:54:27 +0100 Subject: [datatable-help] Fwd: Data table hanging on memory allocation failure In-Reply-To: References: Message-ID: <51FBE443.6010708@mdowle.plus.com> Hi, Interesting. To hone in on this my first quick thoughts are : 1. Try in plain R at the prompt rather than RStudio, just to isolate that for now. 2. Assign the result dummy<-dt[,pt:=as.integer(p),by=list(sk, ik, pk)]; gc(). That shouldn't make a difference but when printing at the prompt (even just the head and tail) I'm aware that makes an internal copy of the whole object (to be fixed, and in the meantime a manual print(dt) avoids that copy). If it's a script that's being run then maybe printing comes into it. 3. Is it after the last group has been processed, or during grouping? To establish this try printing the value of .GRP inside j; i.e., dt[,pt:={print(.GRP);as.integer(p)},by=list(sk, ik, pk)]. This will give me a clue where it might be. 4. p is definitely a column of the table dt at that point? If p is actually in calling scope it might be doing the wrong thing (over and over again). 5. Does it work with a much smaller subset of dt say 10 rows? Often this reveals that an incorrect (much larger result) is being computed. Maybe related to allow.cartesian. 6. Set options(datatable.verbose=TRUE), run again from scratch in a new session and send us the output. Might be a lot of it but we might get lucky, or give further clues. 7. Otherwise, something reproducible would be great if possible. In cases like this it doesn't have to reproduce the memory allocation problem, it just has to be pasteable into a fresh R session and complete on small data. Then I can stress test it myself and see if I can see where the leak or corruption is happening. Matthew On 02/08/13 15:43, Paul Harding wrote: > Hi, I've got a big data table and I'm having memory allocation issues. > This isn't about the memory issue per se, rather it's about how it > gets handled. > > The table has 2M+ rows and is about 15G in size. Whilst manipulating > the table memory usage grows quite fast, and I'm having to manually > garbage collect after each manipulation. Even so it's possibly to > reach a point (there are a lot of other developers using this server > for all sorts of things) where even though there is 28GB memory free I > can't allocate a needed 944MB contiguous chunk. > > I get the usual error message and it would be convenient if data table > exited at that point (then I wouldn't lose my previous work), but it > just hangs: > > 02-06:30:38.8> dt[,pt:=as.integer(p),by=list(sk, ik, pk)]; gc() > Error: cannot allocate vector of size 944.8 Mb > > And the world holds its breath ... and the world starts turning blue > ...I've left it like this for hours, nothing further happens. > > Windows Server 2008 R2 Enterprise SP1 // Intel Zeon CPU E7-4830 @ > 2.13Hhz 4 processors // 128GB memory installed, 28.7GB available, R > session 65GB > R 3.0.0 data.table 1.8.9 rev 874 > RStudio 0.97 > > Incidentally, after finishing a table manipulation and garbage > collecting the R session memory usage drops to 33GB. This is > consistent behaviour, there were 5 similar calls prior to this one > that executed successfully, with the same behavior ( garbage collected > after each). Almost as if there were a copy being made. But that's for > info, not shooting off at a tangent (I'll try and do some > investigation and maybe ask for help around the temporary memory > growth issue later). > > I would be really happy if data table exited on this error or if I had > that option, even if it's doing something very clever (waiting for > memory availability?) because it doesn't seem to succeed. > > Regards > Paul > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.kerpel2 at gmail.com Fri Aug 2 19:26:59 2013 From: john.kerpel2 at gmail.com (John Kerpel) Date: Fri, 2 Aug 2013 12:26:59 -0500 Subject: [datatable-help] Question about by statements and subsetting Message-ID: I'm a noob to data.table and I've got a couple of questions: 1). Why do I get different answers in the following example: > DT = data.table(a=c(4:13),y=c(1,1,2,2,2,3,3,3,4,4),x=1:10,z=c(1,1,1,1,2,2,2,2,3,3),zz=c(1,1,1,1,1,2,2,2,2,2))> setkeyv(DT,cols=c("a","x","y","z","zz"))> DT[,if(.N>=4) {list(predict(smooth.spline(x,y),*c(4,5,6)*)$y)} ,by=z] z V1 1: 1 2.1000000 2: 1 2.5000000 3: 1 2.9000000 4: 2 0.9998959 5: 2 2.0453352 6: 2 2.9093247 Versus: > DT[,if(.N>=4) {list(predict(smooth.spline(x,y),*a[1:3]*)$y)} ,by=z] z V1 1: 1 2.100000 2: 1 2.500000 3: 1 2.900000 4: 2 2.999995 5: 2 2.954664 6: 2 2.909333 Is some sort of recycling going on here? 2). How can I do some sort of nested "by" statement? Let's say I want to set by=zz, but run the spline statement within each z subset. Do I use .SD somehow? This is great package - it's just taking me some time to get the syntax right. I've found this to be faster than clusterMap on 2 cores... I hope I've used the correct terminology! Best, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From lianoglou.steve at gene.com Fri Aug 2 19:44:51 2013 From: lianoglou.steve at gene.com (Steve Lianoglou) Date: Fri, 2 Aug 2013 10:44:51 -0700 Subject: [datatable-help] Question about by statements and subsetting In-Reply-To: References: Message-ID: Hi John, On Fri, Aug 2, 2013 at 10:26 AM, John Kerpel wrote: > I'm a noob to data.table and I've got a couple of questions: > > 1). Why do I get different answers in the following example: > >> DT = >> data.table(a=c(4:13),y=c(1,1,2,2,2,3,3,3,4,4),x=1:10,z=c(1,1,1,1,2,2,2,2,3,3),zz=c(1,1,1,1,1,2,2,2,2,2)) >> setkeyv(DT,cols=c("a","x","y","z","zz")) >> DT[,if(.N>=4) {list(predict(smooth.spline(x,y),c(4,5,6))$y)} ,by=z] > z V1 > 1: 1 2.1000000 > 2: 1 2.5000000 > 3: 1 2.9000000 > 4: 2 0.9998959 > 5: 2 2.0453352 > 6: 2 2.9093247 > > Versus: > >> DT[,if(.N>=4) {list(predict(smooth.spline(x,y),a[1:3])$y)} ,by=z] > z V1 > 1: 1 2.100000 > 2: 1 2.500000 > 3: 1 2.900000 > 4: 2 2.999995 > 5: 2 2.954664 > 6: 2 2.909333 I'm not sure why you would expect those two calls to give the same result? In the first case, the second parameter to your call to predict is always c(4,5,6), while in the second case, when z is 1, the second param to predict is 4,5,6 (the first three rows in your 2nd are teh same as the first, so fine), but when z=2, the second param to predict becomes c(8,9,10), so ... doesn't that explain the behavior you are seeing? > Is some sort of recycling going on here? Where? You are asking to predict on 3 points (either 4,5,6 or a[1:3]) so you get 3 values back per z group. > 2). How can I do some sort of nested "by" statement? > > Let's say I want to set by=zz, but run the spline statement within each z > subset. Do I use .SD somehow? Not sure what you mean, but does this do it? R> DT[, list(predict(smooth.spline(x, y), a)$y), by=c('zz', 'z')] or something? HTH, -steve -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech From iruckaE at mail2world.com Mon Aug 5 17:36:30 2013 From: iruckaE at mail2world.com (iembry) Date: Mon, 5 Aug 2013 08:36:30 -0700 (PDT) Subject: [datatable-help] data.table on existing data.frame list Message-ID: <1375716990518-4673142.post@n4.nabble.com> Hi all, I am new to data.table and I have some questions. Since fread is still in development stage, I have not used it to read my data files stored on the disk. I have used read.table and the objects are stored as data.frame. How do I use data.table on those existing data.frame objects? This is a reproducible example: dput(df) structure(list(a = c(5, 6, 6), b = c(20.2, 32.9, 0.99), cdo = c(0.2, 32, 90.34)), .Names = c("a", "b", "cdo"), row.names = c(NA, -3L ), class = "data.frame") In my real data set, I have a list of many data frames where I will need to change an existing column. For example, in the "df" data.frame, how would I make a revised "a" = "a" + "b"? The code below represents what I am attempting to do with my real data set: getratings2 <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) getratings[[u]]$y <- getratings[[u]]$y + getratings[[u]]$shift) In this case, I would like to make the column named "y" = "y" + "shift" and perform that operation for each data.frame in the list. Thank you. Irucka Embry -- View this message in context: http://r.789695.n4.nabble.com/data-table-on-existing-data-frame-list-tp4673142.html Sent from the datatable-help mailing list archive at Nabble.com. From mdowle at mdowle.plus.com Mon Aug 5 18:36:17 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Mon, 05 Aug 2013 17:36:17 +0100 Subject: [datatable-help] data.table on existing data.frame list In-Reply-To: <1375716990518-4673142.post@n4.nabble.com> References: <1375716990518-4673142.post@n4.nabble.com> Message-ID: <51FFD481.5020109@mdowle.plus.com> Hi, In general you need to convert the data.frame to a data.table before you can use data.table syntax and features on it : DT = as.data.table(DF) But that's a copy of the whole objects so it's easier and faster to start off with a data.table in the first place; e.g., using fread as you said. Its development status means that fread's current argument names, types and order might possibly change a bit in a backwards incompatible way. It isn't intended to convey that it's too flaky for end user use. Quite a few people are already using it routinely. If you have a list of many data.frame's, then that's difficult work with. My first thought would be to pass that to rbindlist to create one large data.table, then := by group to add the column. Matthew On 05/08/13 16:36, iembry wrote: > Hi all, I am new to data.table and I have some questions. > > Since fread is still in development stage, I have not used it to read my > data files stored on the disk. I have used read.table and the objects are > stored as data.frame. > > How do I use data.table on those existing data.frame objects? > > This is a reproducible example: > > dput(df) > structure(list(a = c(5, 6, 6), b = c(20.2, 32.9, 0.99), cdo = c(0.2, > 32, 90.34)), .Names = c("a", "b", "cdo"), row.names = c(NA, -3L > ), class = "data.frame") > > > In my real data set, I have a list of many data frames where I will need to > change an existing column. > > For example, in the "df" data.frame, how would I make a revised "a" = "a" + > "b"? > > The code below represents what I am attempting to do with my real data set: > > getratings2 <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) > getratings[[u]]$y <- getratings[[u]]$y + getratings[[u]]$shift) > > In this case, I would like to make the column named "y" = "y" + "shift" and > perform that operation for each data.frame in the list. > > Thank you. > > Irucka Embry > > > > -- > View this message in context: http://r.789695.n4.nabble.com/data-table-on-existing-data-frame-list-tp4673142.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > From iruckaE at mail2world.com Mon Aug 5 20:18:30 2013 From: iruckaE at mail2world.com (iembry) Date: Mon, 5 Aug 2013 11:18:30 -0700 (PDT) Subject: [datatable-help] data.table on existing data.frame list In-Reply-To: <51FFD481.5020109@mdowle.plus.com> References: <1375716990518-4673142.post@n4.nabble.com> <51FFD481.5020109@mdowle.plus.com> Message-ID: <1375726710654-4673172.post@n4.nabble.com> Hi Matthew, thank you for your prompt response. I am experimenting with fread and I am having a problem with comments not being ignored. In read.table, I can ignore the comments and create column names. I have not seen any way to either create column names or ignore comments within the fread function. How would I use fread to both ignore the comments and also to add column names as I can with read.table? ratingdepostlistedread <- fread("03217500.exsa.rdb", sep="auto", sep2="auto", header="auto", na.strings="NA", stringsAsFactors=FALSE, verbose=FALSE)) ratingdepostlisted <- read.table("03217500.exsa.rdb", sep = "\t", fill = TRUE, comment.char = "#", header = T, as.is = TRUE, stringsAsFactors = FALSE, na.strings = "NA", col.names = c("y", "shift", "x", "stor") Thank you. Irucka -- View this message in context: http://r.789695.n4.nabble.com/data-table-on-existing-data-frame-list-tp4673142p4673172.html Sent from the datatable-help mailing list archive at Nabble.com. From mdowle at mdowle.plus.com Mon Aug 5 20:40:47 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Mon, 05 Aug 2013 19:40:47 +0100 Subject: [datatable-help] data.table on existing data.frame list In-Reply-To: <1375726710654-4673172.post@n4.nabble.com> References: <1375716990518-4673142.post@n4.nabble.com> <51FFD481.5020109@mdowle.plus.com> <1375726710654-4673172.post@n4.nabble.com> Message-ID: <51FFF1AF.2050500@mdowle.plus.com> Hi, When the file contains the column names (as is best practice) then any text after the last column is read (such as comments) is ignored (with warning is intended). Some improvement here could be made but comments in large files isn't something that's come up before. fread isn't a drop in replacement for read.table yet, but it would be good if it was. If fread is reading your comments into a column, then I guess the first row is data row and the last column contains the comments. Just delete the comments afterwards using DT[,lastcolumn:=NULL]. Or include column names in the file. To hard-name columns, can use setnames() afterwards (either to rename old to new names, or to overwrite whatever names are there, if any). If you want to read a subset of columns, use 'select' argument. The plan is to create a wrapper for fread that can be used as a drop in replacement for read.table. Are your files large and did you create them or are they given? Matthew On 05/08/13 19:18, iembry wrote: > Hi Matthew, thank you for your prompt response. > > I am experimenting with fread and I am having a problem with comments not > being ignored. In read.table, I can ignore the comments and create column > names. I have not seen any way to either create column names or ignore > comments within the fread function. > > How would I use fread to both ignore the comments and also to add column > names as I can with read.table? > > ratingdepostlistedread <- fread("03217500.exsa.rdb", sep="auto", > sep2="auto", header="auto", na.strings="NA", stringsAsFactors=FALSE, > verbose=FALSE)) > > > ratingdepostlisted <- read.table("03217500.exsa.rdb", sep = "\t", fill = > TRUE, comment.char = "#", header = T, as.is = TRUE, stringsAsFactors = > FALSE, na.strings = "NA", col.names = c("y", "shift", "x", "stor") > > > Thank you. > > Irucka > > > > -- > View this message in context: http://r.789695.n4.nabble.com/data-table-on-existing-data-frame-list-tp4673142p4673172.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > From iruckaE at mail2world.com Mon Aug 5 21:38:41 2013 From: iruckaE at mail2world.com (iembry) Date: Mon, 5 Aug 2013 12:38:41 -0700 (PDT) Subject: [datatable-help] data.table on existing data.frame list In-Reply-To: <51FFF1AF.2050500@mdowle.plus.com> References: <1375716990518-4673142.post@n4.nabble.com> <51FFD481.5020109@mdowle.plus.com> <1375726710654-4673172.post@n4.nabble.com> <51FFF1AF.2050500@mdowle.plus.com> Message-ID: <1375731521742-4673181.post@n4.nabble.com> Hi Matthew, this link is in a similar format to the files that I'm processing now: http://waterdata.usgs.gov/nwis/dv?cb_00095=on&cb_00065=on&cb_00060=on&format=rdb&period=&begin_date=2012-08-04&end_date=2013-08-04&site_no=02169570&referred_module=sw Both file formats begin with the comments followed by the column names followed by agency code information and then the actual data. The .rdb text files vary in length (some may range from a few hundred lines long to over 20,000 lines). I am given the files that I am processing. Thank you. Irucka -- View this message in context: http://r.789695.n4.nabble.com/data-table-on-existing-data-frame-list-tp4673142p4673181.html Sent from the datatable-help mailing list archive at Nabble.com. From jholtman at gmail.com Mon Aug 5 21:52:24 2013 From: jholtman at gmail.com (jim holtman) Date: Mon, 5 Aug 2013 15:52:24 -0400 Subject: [datatable-help] data.table on existing data.frame list In-Reply-To: <1375731521742-4673181.post@n4.nabble.com> References: <1375716990518-4673142.post@n4.nabble.com> <51FFD481.5020109@mdowle.plus.com> <1375726710654-4673172.post@n4.nabble.com> <51FFF1AF.2050500@mdowle.plus.com> <1375731521742-4673181.post@n4.nabble.com> Message-ID: Here is what I would do. Read in the file, delete the comments, write it back out and then process it. > myFile <- tempfile() # temp file > input <- readLines('/temp/dv.txt') # this is a copy of the data you posted > # remove comments > input <- input[!grepl("^#", input)] > require(data.table) Loading required package: data.table data.table 1.8.8 For help type: help("data.table") > writeLines(input, myFile) > dv <- fread(myFile) > > str(dv) Classes ?data.table? and 'data.frame': 367 obs. of 21 variables: $ agency_cd : chr "5s" "USGS" "USGS" "USGS" ... $ site_no : chr "15s" "02169570" "02169570" "02169570" ... $ datetime : chr "20d" "2012-08-04" "2012-08-05" "2012-08-06" ... $ 04_00095_00001 : chr "14n" "" "" "" ... $ 04_00095_00001_cd: chr "10s" "" "" "" ... $ 04_00095_00002 : chr "14n" "" "" "" ... $ 04_00095_00002_cd: chr "10s" "" "" "" ... $ 04_00095_00003 : chr "14n" "" "" "" ... $ 04_00095_00003_cd: chr "10s" "" "" "" ... $ 05_00065_00001 : chr "14n" "2.10" "1.71" "1.77" ... $ 05_00065_00001_cd: chr "10s" "A" "A" "A" ... $ 05_00065_00002 : chr "14n" "1.71" "1.56" "1.57" ... $ 05_00065_00002_cd: chr "10s" "A" "A" "A" ... $ 05_00065_00003 : chr "14n" "1.89" "1.62" "1.63" ... $ 05_00065_00003_cd: chr "10s" "A" "A" "A" ... $ 15_00060_00001 : chr "14n" "52" "33" "36" ... $ 15_00060_00001_cd: chr "10s" "A" "A" "A" ... $ 15_00060_00002 : chr "14n" "33" "27" "27" ... $ 15_00060_00002_cd: chr "10s" "A" "A" "A" ... $ 15_00060_00003 : chr "14n" "42" "29" "30" ... $ 15_00060_00003_cd: chr "10s" "A" "A" "A" ... - attr(*, ".internal.selfref")= On Mon, Aug 5, 2013 at 3:38 PM, iembry wrote: > Hi Matthew, this link is in a similar format to the files that I'm > processing > now: > > http://waterdata.usgs.gov/nwis/dv?cb_00095=on&cb_00065=on&cb_00060=on&format=rdb&period=&begin_date=2012-08-04&end_date=2013-08-04&site_no=02169570&referred_module=sw > > Both file formats begin with the comments followed by the column names > followed by agency code information and then the actual data. > > The .rdb text files vary in length (some may range from a few hundred lines > long to over 20,000 lines). I am given the files that I am processing. > > Thank you. > > Irucka > > > > > > > > -- > View this message in context: > http://r.789695.n4.nabble.com/data-table-on-existing-data-frame-list-tp4673142p4673181.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. -------------- next part -------------- An HTML attachment was scrubbed... URL: From iruckaE at mail2world.com Tue Aug 6 02:35:58 2013 From: iruckaE at mail2world.com (iembry) Date: Mon, 5 Aug 2013 17:35:58 -0700 (PDT) Subject: [datatable-help] data.table on existing data.frame list In-Reply-To: References: <1375716990518-4673142.post@n4.nabble.com> <51FFD481.5020109@mdowle.plus.com> <1375726710654-4673172.post@n4.nabble.com> <51FFF1AF.2050500@mdowle.plus.com> <1375731521742-4673181.post@n4.nabble.com> Message-ID: <1375749358345-4673198.post@n4.nabble.com> Hi, thank you for the advice. I'll keep your suggestion in mind for some other projects, but for this project I'm going to have to use read.table and then convert the data.frame to data.table as I have over 1000 files that I'm processing. It has not taken long to process this first small set of around 60 files. This is the code that has worked for me to alter an existing column and delete a column that was no longer needed: getratingsmore <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) data.table(getratings[[u]])) getratingsmore <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) getratingsmore[[u]][, y:=y+shift]) getratingsmore <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) getratingsmore[[u]][, shift:=NULL]) It continues with my initial question that I posed earlier. Thank you. Irucka -- View this message in context: http://r.789695.n4.nabble.com/data-table-on-existing-data-frame-list-tp4673142p4673198.html Sent from the datatable-help mailing list archive at Nabble.com. From mdowle at mdowle.plus.com Tue Aug 6 02:37:04 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 06 Aug 2013 01:37:04 +0100 Subject: [datatable-help] data.table on existing data.frame list In-Reply-To: References: <1375716990518-4673142.post@n4.nabble.com> <51FFD481.5020109@mdowle.plus.com> <1375726710654-4673172.post@n4.nabble.com> <51FFF1AF.2050500@mdowle.plus.com> <1375731521742-4673181.post@n4.nabble.com> Message-ID: <52004530.4040108@mdowle.plus.com> The comments are really a banner at the start of the file it seems. So this is all built in to fread already. But the banner in the example is 34 rows, so the default of autostart=30 isn't enough. Try: fread("03217500.exsa.rsb", autostart=40) That should do it in one shot, including detecting the column names. I've just increased autostart a bit to be within the data block. See ?fread for a detailed description of autostart and the procedure. Btw, if there is more than one table in a single file, then setting autostart to be within each one is how to read each one in. And provided there is no footer, you can set autostart to be very large, too (with downside of time to seek back from the end to find the column names). Matthew On 05/08/13 20:52, jim holtman wrote: > Here is what I would do. Read in the file, delete the comments, write > it back out and then process it. > > > > myFile <- tempfile() # temp file > > input <- readLines('/temp/dv.txt') # this is a copy of the data you > posted > > # remove comments > > input <- input[!grepl("^#", input)] > > require(data.table) > Loading required package: data.table > data.table 1.8.8 For help type: help("data.table") > > writeLines(input, myFile) > > dv <- fread(myFile) > > > > > str(dv) > Classes 'data.table' and 'data.frame': 367 obs. of 21 variables: > $ agency_cd : chr "5s" "USGS" "USGS" "USGS" ... > $ site_no : chr "15s" "02169570" "02169570" "02169570" ... > $ datetime : chr "20d" "2012-08-04" "2012-08-05" > "2012-08-06" ... > $ 04_00095_00001 : chr "14n" "" "" "" ... > $ 04_00095_00001_cd: chr "10s" "" "" "" ... > $ 04_00095_00002 : chr "14n" "" "" "" ... > $ 04_00095_00002_cd: chr "10s" "" "" "" ... > $ 04_00095_00003 : chr "14n" "" "" "" ... > $ 04_00095_00003_cd: chr "10s" "" "" "" ... > $ 05_00065_00001 : chr "14n" "2.10" "1.71" "1.77" ... > $ 05_00065_00001_cd: chr "10s" "A" "A" "A" ... > $ 05_00065_00002 : chr "14n" "1.71" "1.56" "1.57" ... > $ 05_00065_00002_cd: chr "10s" "A" "A" "A" ... > $ 05_00065_00003 : chr "14n" "1.89" "1.62" "1.63" ... > $ 05_00065_00003_cd: chr "10s" "A" "A" "A" ... > $ 15_00060_00001 : chr "14n" "52" "33" "36" ... > $ 15_00060_00001_cd: chr "10s" "A" "A" "A" ... > $ 15_00060_00002 : chr "14n" "33" "27" "27" ... > $ 15_00060_00002_cd: chr "10s" "A" "A" "A" ... > $ 15_00060_00003 : chr "14n" "42" "29" "30" ... > $ 15_00060_00003_cd: chr "10s" "A" "A" "A" ... > - attr(*, ".internal.selfref")= > > > > On Mon, Aug 5, 2013 at 3:38 PM, iembry > wrote: > > Hi Matthew, this link is in a similar format to the files that I'm > processing > now: > http://waterdata.usgs.gov/nwis/dv?cb_00095=on&cb_00065=on&cb_00060=on&format=rdb&period=&begin_date=2012-08-04&end_date=2013-08-04&site_no=02169570&referred_module=sw > > Both file formats begin with the comments followed by the column names > followed by agency code information and then the actual data. > > The .rdb text files vary in length (some may range from a few > hundred lines > long to over 20,000 lines). I am given the files that I am processing. > > Thank you. > > Irucka > > > > > > > > -- > View this message in context: > http://r.789695.n4.nabble.com/data-table-on-existing-data-frame-list-tp4673142p4673181.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > -- > Jim Holtman > Data Munger Guru > > What is the problem that you are trying to solve? > Tell me what you want to do, not how you want to do it. > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Tue Aug 6 02:50:16 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 06 Aug 2013 01:50:16 +0100 Subject: [datatable-help] data.table on existing data.frame list In-Reply-To: <1375749358345-4673198.post@n4.nabble.com> References: <1375716990518-4673142.post@n4.nabble.com> <51FFD481.5020109@mdowle.plus.com> <1375726710654-4673172.post@n4.nabble.com> <51FFF1AF.2050500@mdowle.plus.com> <1375731521742-4673181.post@n4.nabble.com> <1375749358345-4673198.post@n4.nabble.com> Message-ID: <52004848.2070708@mdowle.plus.com> Assuming fread with autostart=40 now works, then quick pseudo might be : big = rbindlist(lapply(fileNameVector, fread, autostart=40)) big[,y:=y+shift] big[,shift:=NULL] Matthew On 06/08/13 01:35, iembry wrote: > Hi, thank you for the advice. > > I'll keep your suggestion in mind for some other projects, but for this > project I'm going to have to use read.table and then convert the data.frame > to data.table as I have over 1000 files that I'm processing. It has not > taken long to process this first small set of around 60 files. > > This is the code that has worked for me to alter an existing column and > delete a column that was no longer needed: > > getratingsmore <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) > data.table(getratings[[u]])) > getratingsmore <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) > getratingsmore[[u]][, y:=y+shift]) > getratingsmore <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) > getratingsmore[[u]][, shift:=NULL]) > > It continues with my initial question that I posed earlier. > > Thank you. > > Irucka > > > > -- > View this message in context: http://r.789695.n4.nabble.com/data-table-on-existing-data-frame-list-tp4673142p4673198.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > From iruckaE at mail2world.com Tue Aug 6 04:12:48 2013 From: iruckaE at mail2world.com (iembry) Date: Mon, 5 Aug 2013 19:12:48 -0700 (PDT) Subject: [datatable-help] data.table on existing data.frame list In-Reply-To: <52004848.2070708@mdowle.plus.com> References: <1375716990518-4673142.post@n4.nabble.com> <51FFD481.5020109@mdowle.plus.com> <1375726710654-4673172.post@n4.nabble.com> <51FFF1AF.2050500@mdowle.plus.com> <1375731521742-4673181.post@n4.nabble.com> <1375749358345-4673198.post@n4.nabble.com> <52004848.2070708@mdowle.plus.com> Message-ID: <1375755168059-4673201.post@n4.nabble.com> Hi Matthew, thank you for your prompt and great assistance. Yes, moving the autostart = 40 does work. Yes, it did detect the column names. In order to read in the .exsa.rdb files I created a function that follows getDataRatingDepotFiles <- function (file, hasHeader = TRUE, separator = "\t") { RDdatatmp <- as.matrix(read.table(file, sep = "\t", fill = TRUE, comment.char = "#", header = T, as.is = TRUE, stringsAsFactors = FALSE, na.strings = "NA", col.names = c("y", "shift", "x", "stor"))) RDdatatmp <- as.matrix(RDdatatmp[c(-1), c(-4)]) RDdatatmp <- as.data.frame(RDdatatmp, stringsAsFactors = FALSE) RDdatatmp$y <- as.numeric(as.character(RDdatatmp$y)) RDdatatmp$x <- as.numeric(as.character(RDdatatmp$x)) RDdatatmp$shift <- as.numeric(as.character(RDdatatmp$shift)) return(RDdatatmp) } I created an object called sitefiles that has the pattern of the file extension that I want. In the same folder there are files with two other file extensions that I do not want to use in this project. sitefiles <- list.files(path ="/tried", pattern <- ".exsa.rdb$", full.names = TRUE) getratings <- lapply(sitefiles, getDataRatingDepotFiles) Is there any way to replicate the above with fread? Irucka The comments are really a banner at the start of the file it seems. So this is all built in to fread already. But the banner in the example is 34 rows, so the default of autostart=30 isn't enough. Try: fread("03217500.exsa.rsb", autostart=40) That should do it in one shot, including detecting the column names. I've just increased autostart a bit to be within the data block. See ?fread for a detailed description of autostart and the procedure. Btw, if there is more than one table in a single file, then setting autostart to be within each one is how to read each one in. And provided there is no footer, you can set autostart to be very large, too (with downside of time to seek back from the end to find the column names). Matthew -- View this message in context: http://r.789695.n4.nabble.com/data-table-on-existing-data-frame-list-tp4673142p4673201.html Sent from the datatable-help mailing list archive at Nabble.com. From iruckaE at mail2world.com Tue Aug 6 05:10:02 2013 From: iruckaE at mail2world.com (iembry) Date: Mon, 5 Aug 2013 20:10:02 -0700 (PDT) Subject: [datatable-help] subset between data.table list and single data.table object Message-ID: <1375758602714-4673202.post@n4.nabble.com> Hi, I started this new topic thread as now I'm concentrating on the subsetting portion of the R code. I have the R objects aimall and getratingsmore. aimall is a data.table with the column names "mean" and "p50" and it is 59 rows long, but I have truncated it for this question & getratingsmore is a list of 59 data.table objects with the column names "y" and "x". dput(aimall) structure(list(mean = c(3882.65, 819.82, 23742.37), p50 = c(1830, 382, 10400)), .Names = c("mean", "p50"), row.names = c(NA, -3L), class = c("data.table", "data.frame"), .internal.selfref = ) dput(getratingsmore) list(structure(list(y = c(14.8, 14.81, 14.82), x = c(7900, 7920, 7930)), .Names = c("y", "x"), row.names = c(NA, -2721L), class = c("data.table", "data.frame"), .internal.selfref = ), structure(list(y = c(4, 4.01, 4.02), x = c(21, 21, 22)), .Names = c("y", "x"), row.names = c(NA, -1464L), class = c("data.table", "data.frame"), .internal.selfref = ), structure(list(y = c(73.05, 73.06, 73.07), x = c(70, 76, 82)), .Names = c("y", "x"), row.names = c(NA, -1996L), class = c("data.table", "data.frame"), .internal.selfref = )) I have used the following code to attempt to subset getratingsmore and the dput follows: mp <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) {ifelse(aim[1]$mean[u] < min(getratingsmore[[u]]$x), subset(getratings[[u]], aim[1]$mean[u] > min(getratingsmore[[u]]$x) & aim[1]$mean[u], aim[u]$mean[u] > min(getratingsmore[[u]]$x)), aim[1]$mean[u])}) dput(mp) list(list(NULL), 819.82, 23742.37) I had created a key for aimall, but then the data.table was sorted and it lost its connection to getratingsmore. Right now, aimall and getratingsmore represent the same station in their current order. Is there any way to set a key to each row of aimall that will match each data.frame of getratingsmore? Is there a way to subset in the way that I have described without a key using data.table? I want to know which stations have a mean and p50 value < the lowest "x" in each station data.frame of getratingsmore so that those stations can then use extrapolation in the next step. I also want to know which stations ahve a mean and p50 value > the lowest "x" in each station data.frame of getratingsmore so that those stations can then use interpolation in the next step. Thank you. Irucka -- View this message in context: http://r.789695.n4.nabble.com/subset-between-data-table-list-and-single-data-table-object-tp4673202.html Sent from the datatable-help mailing list archive at Nabble.com. From mdowle at mdowle.plus.com Tue Aug 6 10:49:44 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 06 Aug 2013 09:49:44 +0100 Subject: [datatable-help] data.table on existing data.frame list In-Reply-To: <1375755168059-4673201.post@n4.nabble.com> References: <1375716990518-4673142.post@n4.nabble.com> <51FFD481.5020109@mdowle.plus.com> <1375726710654-4673172.post@n4.nabble.com> <51FFF1AF.2050500@mdowle.plus.com> <1375731521742-4673181.post@n4.nabble.com> <1375749358345-4673198.post@n4.nabble.com> <52004848.2070708@mdowle.plus.com> <1375755168059-4673201.post@n4.nabble.com> Message-ID: <5200B8A8.9050100@mdowle.plus.com> On 06/08/13 03:12, iembry wrote: > Hi Matthew, thank you for your prompt and great assistance. > > Yes, moving the autostart = 40 does work. Yes, it did detect the column > names. Great. > > In order to read in the .exsa.rdb files I created a function that follows > > getDataRatingDepotFiles <- function (file, hasHeader = TRUE, separator = > "\t") > { > RDdatatmp <- as.matrix(read.table(file, sep = "\t", fill = TRUE, > comment.char = "#", header = T, as.is = TRUE, stringsAsFactors = FALSE, > na.strings = "NA", col.names = c("y", "shift", "x", "stor"))) > RDdatatmp <- as.matrix(RDdatatmp[c(-1), c(-4)]) > RDdatatmp <- as.data.frame(RDdatatmp, stringsAsFactors = FALSE) > RDdatatmp$y <- as.numeric(as.character(RDdatatmp$y)) > RDdatatmp$x <- as.numeric(as.character(RDdatatmp$x)) > RDdatatmp$shift <- as.numeric(as.character(RDdatatmp$shift)) > return(RDdatatmp) > } > > I created an object called sitefiles that has the pattern of the file > extension that I want. In the same folder there are files with two other > file extensions that I do not want to use in this project. > > sitefiles <- list.files(path ="/tried", pattern <- ".exsa.rdb$", full.names > = TRUE) > getratings <- lapply(sitefiles, getDataRatingDepotFiles) > > Is there any way to replicate the above with fread? I don't follow. fread reads the file. 'select' arg can be used to select columns, or you can use setnames() afterwards to rename them. fread doesn't create factors anyway. The numeric columns should be detected automatically but you can pass 'colClasses' manually to fread if you need to read integer data as a numeric type, in the latest version. Or are you asking if fread can read multiple files? > > Irucka > > > > > > > > > The comments are really a banner at the start of the file it seems. So this > is all built in to fread already. But the banner in the example is 34 rows, > so the default of autostart=30 isn't enough. Try: > > fread("03217500.exsa.rsb", autostart=40) > > That should do it in one shot, including detecting the column names. I've > just increased autostart a bit to be within the data block. See ?fread for > a detailed description of autostart and the procedure. > > Btw, if there is more than one table in a single file, then setting > autostart to be within each one is how to read each one in. And provided > there is no footer, you can set autostart to be very large, too (with > downside of time to seek back from the end to find the column names). > > Matthew > > > > -- > View this message in context: http://r.789695.n4.nabble.com/data-table-on-existing-data-frame-list-tp4673142p4673201.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > From mdowle at mdowle.plus.com Tue Aug 6 11:17:13 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 06 Aug 2013 10:17:13 +0100 Subject: [datatable-help] subset between data.table list and single data.table object In-Reply-To: <1375758602714-4673202.post@n4.nabble.com> References: <1375758602714-4673202.post@n4.nabble.com> Message-ID: <5200BF19.3060001@mdowle.plus.com> Hi, This one isn't quite clear enough to answer quickly and easily. I'd expect there to be similar questions on Stack Overflow with code formatting, data in the question and benchmarks. Trick is how to filter the 708 data.table questions. Try these 61 returned by "[data.table] join two" : http://stackoverflow.com/search?q=%5Bdata.table%5D+join+two Let us know if that doesn't help, or raise a new question there. Also maybe try "[r] -[data.table] data.table" with is:question or is:answer as well. That returns answers using data.table for questions which weren't about data.table. Matthew On 06/08/13 04:10, iembry wrote: > Hi, I started this new topic thread as now I'm concentrating on the > subsetting portion of the R code. > > I have the R objects aimall and getratingsmore. aimall is a data.table with > the column names "mean" and "p50" and it is 59 rows long, but I have > truncated it for this question & getratingsmore is a list of 59 data.table > objects with the column names "y" and "x". > > dput(aimall) > structure(list(mean = c(3882.65, 819.82, 23742.37), p50 = c(1830, 382, > 10400)), .Names = c("mean", "p50"), row.names = c(NA, -3L), class = > c("data.table", "data.frame"), .internal.selfref = ) > > > dput(getratingsmore) > list(structure(list(y = c(14.8, 14.81, 14.82), x = c(7900, 7920, 7930)), > .Names = c("y", "x"), row.names = c(NA, > -2721L), class = c("data.table", "data.frame"), .internal.selfref = > ), structure(list(y = c(4, 4.01, 4.02), x = c(21, 21, > 22)), .Names = c("y", "x"), row.names = c(NA, -1464L), class = > c("data.table", "data.frame"), .internal.selfref = ), > structure(list(y = c(73.05, 73.06, 73.07), x = c(70, 76, 82)), .Names = > c("y", "x"), row.names = c(NA, -1996L), class = c("data.table", > "data.frame"), .internal.selfref = )) > > > I have used the following code to attempt to subset getratingsmore and the > dput follows: > > mp <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) > {ifelse(aim[1]$mean[u] < min(getratingsmore[[u]]$x), subset(getratings[[u]], > aim[1]$mean[u] > min(getratingsmore[[u]]$x) & aim[1]$mean[u], aim[u]$mean[u] >> min(getratingsmore[[u]]$x)), aim[1]$mean[u])}) > dput(mp) > list(list(NULL), 819.82, 23742.37) > > > I had created a key for aimall, but then the data.table was sorted and it > lost its connection to getratingsmore. Right now, aimall and getratingsmore > represent the same station in their current order. > > Is there any way to set a key to each row of aimall that will match each > data.frame of getratingsmore? > > Is there a way to subset in the way that I have described without a key > using data.table? > > I want to know which stations have a mean and p50 value < the lowest "x" in > each station data.frame of getratingsmore so that those stations can then > use extrapolation in the next step. > > I also want to know which stations ahve a mean and p50 value > the lowest > "x" in each station data.frame of getratingsmore so that those stations can > then use interpolation in the next step. > > Thank you. > > Irucka > > > > -- > View this message in context: http://r.789695.n4.nabble.com/subset-between-data-table-list-and-single-data-table-object-tp4673202.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From iruckaE at mail2world.com Tue Aug 6 17:56:39 2013 From: iruckaE at mail2world.com (Irucka Embry) Date: Tue, 6 Aug 2013 08:56:39 -0700 Subject: [datatable-help] data.table on existing data.frame list Message-ID: <55425675B955400E89831C4727509420@mail2world.com> Hi Matthew, how are you? Thank you for the notes on fread. I had tried fread to read sitefiles (see the previous e-mail), but this error message was returned: Error in fread(sitefiles) : 'input' must be a single character string containing a file name, full path to a file, a URL starting 'http://' or 'file://', or the input data itself Is there a work around to get fread to read a file path like sitefiles? I was detailing what I was doing with read.table to make sure that fread could also accomplish those same objectives with the files. Thank you. Irucka <-----Original Message-----> >From: Matthew Dowle [mdowle at mdowle.plus.com] >Sent: 8/6/2013 3:49:44 AM >To: iruckaE at mail2world.com >Cc: datatable-help at lists.r-forge.r-project.org >Subject: Re: [datatable-help] data.table on existing data.frame list > >On 06/08/13 03:12, iembry wrote: >> Hi Matthew, thank you for your prompt and great assistance. >> >> Yes, moving the autostart = 40 does work. Yes, it did detect the column >> names. >Great. >> >> In order to read in the .exsa.rdb files I created a function that follows >> >> getDataRatingDepotFiles <- function (file, hasHeader = TRUE, separator = >> "\t") >> { >> RDdatatmp <- as.matrix(read.table(file, sep = "\t", fill = TRUE, >> comment.char = "#", header = T, as.is = TRUE, stringsAsFactors = FALSE, >> na.strings = "NA", col.names = c("y", "shift", "x", "stor"))) >> RDdatatmp <- as.matrix(RDdatatmp[c(-1), c(-4)]) >> RDdatatmp <- as.data.frame(RDdatatmp, stringsAsFactors = FALSE) >> RDdatatmp$y <- as.numeric(as.character(RDdatatmp$y)) >> RDdatatmp$x <- as.numeric(as.character(RDdatatmp$x)) >> RDdatatmp$shift <- as.numeric(as.character(RDdatatmp$shift)) >> return(RDdatatmp) >> } >> >> I created an object called sitefiles that has the pattern of the file >> extension that I want. In the same folder there are files with two other >> file extensions that I do not want to use in this project. >> >> sitefiles <- list.files(path ="/tried", pattern <- ".exsa.rdb$", full.names >> = TRUE) >> getratings <- lapply(sitefiles, getDataRatingDepotFiles) >> >> Is there any way to replicate the above with fread? >I don't follow. fread reads the file. 'select' arg can be used to >select columns, or you can use setnames() afterwards to rename them. >fread doesn't create factors anyway. The numeric columns should be >detected automatically but you can pass 'colClasses' manually to fread >if you need to read integer data as a numeric type, in the latest >version. Or are you asking if fread can read multiple files? > > >> >> Irucka >> >> >> >> >> >> >> >> >> The comments are really a banner at the start of the file it seems. So this >> is all built in to fread already. But the banner in the example is 34 rows, >> so the default of autostart=30 isn't enough. Try: >> >> fread("03217500.exsa.rsb", autostart=40) >> >> That should do it in one shot, including detecting the column names. I've >> just increased autostart a bit to be within the data block. See ?fread for >> a detailed description of autostart and the procedure. >> >> Btw, if there is more than one table in a single file, then setting >> autostart to be within each one is how to read each one in. And provided >> there is no footer, you can set autostart to be very large, too (with >> downside of time to seek back from the end to find the column names). >> >> Matthew >> >> >> >> -- >> View this message in context: http://r.789695.n4.nabble.com/data-table-on-existing-data- >frame-list-tp4673142p4673201.html >> Sent from the datatable-help mailing list archive at Nabble.com. >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-h elp >> > >. >

_______________________________________________________________
Get the Free email that has everyone talking at http://www.mail2world.com
Unlimited Email Storage – POP3 – Calendar – SMS – Translator – Much More!
-------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Tue Aug 6 19:41:44 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 06 Aug 2013 18:41:44 +0100 Subject: [datatable-help] data.table on existing data.frame list In-Reply-To: <55425675B955400E89831C4727509420@mail2world.com> References: <55425675B955400E89831C4727509420@mail2world.com> Message-ID: <52013558.3020508@mdowle.plus.com> What I previously suggested should work; i.e., big = rbindlist(lapply(fileNameVector, fread, autostart=40)) big[,y:=y+shift] big[,shift:=NULL] just replace 'fileNameVector' with 'sitefiles'. On 06/08/13 16:56, Irucka Embry wrote: > Hi Matthew, how are you? > > Thank you for the notes on fread. I had tried fread to read sitefiles > (see the previous e-mail), but this error message was returned: > > Error in fread(sitefiles) : > 'input' must be a single character string containing a file name, full > path to a file, a URL starting 'http://' or 'file://', or the input > data itself > > Is there a work around to get fread to read a file path like sitefiles? > > I was detailing what I was doing with read.table to make sure that > fread could also accomplish those same objectives with the files. > > Thank you. > > Irucka > > > > <-----Original Message-----> > >From: Matthew Dowle [mdowle at mdowle.plus.com] > >Sent: 8/6/2013 3:49:44 AM > >To: iruckaE at mail2world.com > >Cc: datatable-help at lists.r-forge.r-project.org > >Subject: Re: [datatable-help] data.table on existing data.frame list > > > >On 06/08/13 03:12, iembry wrote: > >> Hi Matthew, thank you for your prompt and great assistance. > >> > >> Yes, moving the autostart = 40 does work. Yes, it did detect the column > >> names. > >Great. > >> > >> In order to read in the .exsa.rdb files I created a function that > follows > >> > >> getDataRatingDepotFiles <- function (file, hasHeader = TRUE, > separator = > >> "\t") > >> { > >> RDdatatmp <- as.matrix(read.table(file, sep = "\t", fill = TRUE, > >> comment.char = "#", header = T, as.is = TRUE, stringsAsFactors = FALSE, > >> na.strings = "NA", col.names = c("y", "shift", "x", "stor"))) > >> RDdatatmp <- as.matrix(RDdatatmp[c(-1), c(-4)]) > >> RDdatatmp <- as.data.frame(RDdatatmp, stringsAsFactors = FALSE) > >> RDdatatmp$y <- as.numeric(as.character(RDdatatmp$y)) > >> RDdatatmp$x <- as.numeric(as.character(RDdatatmp$x)) > >> RDdatatmp$shift <- as.numeric(as.character(RDdatatmp$shift)) > >> return(RDdatatmp) > >> } > >> > >> I created an object called sitefiles that has the pattern of the file > >> extension that I want. In the same folder there are files with two > other > >> file extensions that I do not want to use in this project. > >> > >> sitefiles <- list.files(path ="/tried", pattern <- ".exsa.rdb$", > full.names > >> = TRUE) > >> getratings <- lapply(sitefiles, getDataRatingDepotFiles) > >> > >> Is there any way to replicate the above with fread? > >I don't follow. fread reads the file. 'select' arg can be used to > >select columns, or you can use setnames() afterwards to rename them. > >fread doesn't create factors anyway. The numeric columns should be > >detected automatically but you can pass 'colClasses' manually to fread > >if you need to read integer data as a numeric type, in the latest > >version. Or are you asking if fread can read multiple files? > > > > > >> > >> Irucka > >> > >> > >> > >> > >> > >> > >> > >> > >> The comments are really a banner at the start of the file it seems. > So this > >> is all built in to fread already. But the banner in the example is > 34 rows, > >> so the default of autostart=30 isn't enough. Try: > >> > >> fread("03217500.exsa.rsb", autostart=40) > >> > >> That should do it in one shot, including detecting the column > names. I've > >> just increased autostart a bit to be within the data block. See > ?fread for > >> a detailed description of autostart and the procedure. > >> > >> Btw, if there is more than one table in a single file, then setting > >> autostart to be within each one is how to read each one in. And > provided > >> there is no footer, you can set autostart to be very large, too (with > >> downside of time to seek back from the end to find the column names). > >> > >> Matthew > >> > >> > >> > >> -- > >> View this message in context: > http://r.789695.n4.nabble.com/data-table-on-existing-data- > >frame-list-tp4673142p4673201.html > >> Sent from the datatable-help mailing list archive at Nabble.com. > >> _______________________________________________ > >> datatable-help mailing list > >> datatable-help at lists.r-forge.r-project.org > >> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > >> > > > >. > > > > _______________________________________________________________ > Get the Free email that has everyone talking at http://www.mail2world.com > Unlimited Email Storage -- POP3 -- Calendar -- SMS -- Translator -- > Much More! > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mailinglist.honeypot at gmail.com Wed Aug 7 00:45:36 2013 From: mailinglist.honeypot at gmail.com (Steve Lianoglou) Date: Tue, 6 Aug 2013 15:45:36 -0700 Subject: [datatable-help] (no subject) Message-ID: Hi John, (resending because I was bounced from list due to sending from wrong email address) Please use "reply-all" when replying to emails on this list so that discussion stays "on list" and others can help with and benefit from the discussion. Comments below: On Aug 6, 2013, at 2:40 PM, John Kerpel wrote: > Steve: > > To follow up on my question from a couple of days ago, assuming the > following: > DT = data.table(a=c(4:13),y=c(1,1,2,2,2,3,3,3,4,4),x=1:10,z=c(1,1,1,1,2,2,2,2,3,3),zz=c(1,1,1,1,1,2,2,2,2,2)) > setkeyv(DT,cols=c("a","x","y","z","zz")) > #DT[,if(.N>=4) {list(predict(smooth.spline(x,y),a)$y)} ,by=c('z', 'zz')] > > a=c(4:13) > y=c(1,1,2,2,2,3,3,3,4,4) > x=1:10 > predict(smooth.spline(x[1:4],y[1:4]),a[1:5])$y > [1] 2.1 2.5 2.9 3.3 3.7 > predict(smooth.spline(x[5:8],y[5:8]),a[6:10])$y > [1] 2.954664 2.909333 2.864003 2.818672 2.773341 > So in this example the predictor a is indexed by zz and (x,y) is indexed by > z. Is there a way to do this in the "by" statement? I've got a workaround > that uses clusterMap, but I'd like to use data.table instead via some > statement like what is commented out above. > Thanks for your help. This seems like the data is setup in a rather strange way -- you'd like to have objects (smooth splines) predict on elements (the `a`s) that are trained on different sets that you want to predict .. there's no "natural" way to use the same data for training and prediction by iterating over subsets at the same time. Perhaps you provided a toy example which isn't how your real data is set up, but if not, I'd recommend perhaps having two different tables (one with your zz's and your z's split), eg: train <- data.table(x=whatever, y=whatever, z=z-index) predict.on <- data.table(a=a.values, z=z-index) Anyway, I'll just leave the code that uses data.table with your current data below with no further comment -- it'll do what you want. library(data.table) a <- c(4:13) y <- c(1,1,2,2,2,3,3,3,4,4) x <- 1:10 z <- c(1,1,1,1,2,2,2,2,3,3) zz <- c(1,1,1,1,1,2,2,2,2,2) DT <- data.table(a=a, y=y, x=x, z=z, zz=zz) setkeyv(DT, 'z') Zs <- unique(DT)$z splines <- lapply(Zs, function(zval) { dt <- DT[J(zval)] if (nrow(dt) >= 4) { ss <- smooth.spline(dt$x, dt$y) } else { ss <- NULL } data.table(zz=zval, ss=list(ss), is.spline=!is.null(ss)) }) splines <- rbindlist(splines)[is.spline == TRUE] setkeyv(splines, 'zz') setkeyv(DT, 'zz') splines[DT, list(preds=predict(ss[[1]], a)$y)] zz preds 1: 1 2.100000 2: 1 2.500000 3: 1 2.900000 4: 1 3.300000 5: 1 3.700000 6: 2 2.954664 7: 2 2.909333 8: 2 2.864003 9: 2 2.818672 10: 2 2.773341 HTH, -steve From john.kerpel2 at gmail.com Wed Aug 7 03:50:57 2013 From: john.kerpel2 at gmail.com (John Kerpel) Date: Tue, 6 Aug 2013 20:50:57 -0500 Subject: [datatable-help] (no subject) In-Reply-To: References: Message-ID: Wow, thx! I didn't think it would be straightforward - but to your point I will try to set up my data differently to see if I can simplify the process. On Tue, Aug 6, 2013 at 5:45 PM, Steve Lianoglou < mailinglist.honeypot at gmail.com> wrote: > Hi John, > > (resending because I was bounced from list due to sending from wrong > email address) > > Please use "reply-all" when replying to emails on this list so that > discussion stays "on list" and others can help with and benefit from > the discussion. > > Comments below: > > On Aug 6, 2013, at 2:40 PM, John Kerpel wrote: > > > Steve: > > > > To follow up on my question from a couple of days ago, assuming the > > following: > > > DT = > data.table(a=c(4:13),y=c(1,1,2,2,2,3,3,3,4,4),x=1:10,z=c(1,1,1,1,2,2,2,2,3,3),zz=c(1,1,1,1,1,2,2,2,2,2)) > > setkeyv(DT,cols=c("a","x","y","z","zz")) > > #DT[,if(.N>=4) {list(predict(smooth.spline(x,y),a)$y)} ,by=c('z', 'zz')] > > > > a=c(4:13) > > y=c(1,1,2,2,2,3,3,3,4,4) > > x=1:10 > > predict(smooth.spline(x[1:4],y[1:4]),a[1:5])$y > > [1] 2.1 2.5 2.9 3.3 3.7 > > predict(smooth.spline(x[5:8],y[5:8]),a[6:10])$y > > [1] 2.954664 2.909333 2.864003 2.818672 2.773341 > > > So in this example the predictor a is indexed by zz and (x,y) is indexed > by > > z. Is there a way to do this in the "by" statement? I've got a > workaround > > that uses clusterMap, but I'd like to use data.table instead via some > > statement like what is commented out above. > > > Thanks for your help. > > This seems like the data is setup in a rather strange way -- you'd > like to have objects (smooth splines) predict on elements (the `a`s) > that are trained on different sets that you want to predict .. there's > no "natural" way to use the same data for training and prediction by > iterating over subsets at the same time. > > Perhaps you provided a toy example which isn't how your real data is > set up, but if not, I'd recommend perhaps having two different tables > (one with your zz's and your z's split), eg: > > train <- data.table(x=whatever, y=whatever, z=z-index) > predict.on <- data.table(a=a.values, z=z-index) > > Anyway, I'll just leave the code that uses data.table with your > current data below with no further comment -- it'll do what you want. > > library(data.table) > > a <- c(4:13) > y <- c(1,1,2,2,2,3,3,3,4,4) > x <- 1:10 > z <- c(1,1,1,1,2,2,2,2,3,3) > zz <- c(1,1,1,1,1,2,2,2,2,2) > DT <- data.table(a=a, y=y, x=x, z=z, zz=zz) > setkeyv(DT, 'z') > Zs <- unique(DT)$z > > splines <- lapply(Zs, function(zval) { > dt <- DT[J(zval)] > if (nrow(dt) >= 4) { > ss <- smooth.spline(dt$x, dt$y) > } else { > ss <- NULL > } > data.table(zz=zval, ss=list(ss), is.spline=!is.null(ss)) > }) > splines <- rbindlist(splines)[is.spline == TRUE] > setkeyv(splines, 'zz') > setkeyv(DT, 'zz') > > splines[DT, list(preds=predict(ss[[1]], a)$y)] > zz preds > 1: 1 2.100000 > 2: 1 2.500000 > 3: 1 2.900000 > 4: 1 3.300000 > 5: 1 3.700000 > 6: 2 2.954664 > 7: 2 2.909333 > 8: 2 2.864003 > 9: 2 2.818672 > 10: 2 2.773341 > > HTH, > -steve > -------------- next part -------------- An HTML attachment was scrubbed... URL: From iruckaE at mail2world.com Wed Aug 7 08:24:46 2013 From: iruckaE at mail2world.com (iembry) Date: Tue, 6 Aug 2013 23:24:46 -0700 (PDT) Subject: [datatable-help] subset between data.table list and single data.table object In-Reply-To: <5200BF19.3060001@mdowle.plus.com> References: <1375758602714-4673202.post@n4.nabble.com> <5200BF19.3060001@mdowle.plus.com> Message-ID: <1375856686777-4673265.post@n4.nabble.com> Hi Matthew, partly based on your suggestion I have the following R code: big = rbindlist(lapply(sitefiles,freadDataRatingDepotFiles)) big <- setnames(big,c("y", "shift", "x", "stor")) big <- big[, y:=as.numeric(y)] big <- big[, x:=as.numeric(x)] big <- big[, shift:=as.numeric(shift)] big <- big[, c("stor"):=NULL] big <- na.omit(big) big <- big[,y:=y+shift] big <- big[,shift:=NULL] Thus instead of a list of 59 data.table objects I have one list of over 100,000 rows. How do I know which row range belongs to a certain data.table object (59 of them) for the other calculations? As before I want to subset big (or the list of 59 data.tables) based on their connection to aimall (see below). aimall contains each of the 59 station numbers & the order of aimall matches the order of the 59 data.tables. Does this help clarify what I had previously asked? Thank you. Irucka str(big) Classes ?data.table? and 'data.frame': 112253 obs. of 2 variables: $ y: num 14.8 14.8 14.8 14.8 14.8 ... $ x: num 7900 7920 7930 7950 7970 7980 8000 8010 8030 8050 ... - attr(*, ".internal.selfref")= dput(aimall) structure(list(site_no = c("02437100", "02446500", "02467000", "03217500", "03219500", "03227500", "03230700", "03231500", "03439000", "03441000", "03455000", "03479000", "04185000", "04186500", "04189000", "04191500", "04192500", "04193500", "06191500", "06214500", "06218500", "06225500", "06228000", "06235500", "06276500", "06279500", "06287000", "06289000", "06311000", "06313500", "06317000", "06320000", "06320500", "06323000", "06324000", "06324500", "06326500", "06329500", "06342500", "06426500", "06428500", "06436000", "06437000", "06438000", "06818000", "06821500", "06856600", "06860000", "06864000", "06864500", "06865500", "06877600", "06887500", "06889000", "06891000", "06893000", "06934500", "07010000", "07289000"), mean = c(3882.65, 819.82, 23742.37, 224.72, 496.79, 1491.39, 3170.14, 3682.46, 237.02, 127.9, 2955.14, 176.1, 345.72, 296.23, 275.35, 1870.93, 4544.74, 5157.63, 3106.7, 6940.54, 167.04, 1172.53, 771.23, 559.23, 407.46, 2144.53, 3384.37, 148.67, 14.99, 195.91, 267.9, 47.49, 63.49, 96.74, 184.16, 446.52, 565.5, 12419.4, 22372.86, 23.34, 92.56, 100.45, 296.65, 391.31, 43534.12, 16.65, 915.93, 20.16, 197.09, 227.78, 274.43, 1517.04, 5042.7, 5632.7, 7018.45, 52604.19, 81758.03, 186504.25, 755685.3 ), p50 = c(1830, 382, 10400, 50, 140, 500, 1520, 1600, 188, 99, 2260, 115, 130, 75, 62, 460, 1470, 1700, 1390, 3670, 80, 559, 380, 257, 223, 1550, 2730, 82, 3.8, 120, 130, 23, 46, 34, 86, 216, 231, 7900, 20400, 2.9, 36, 7.5, 120, 114, 38200, 6.3, 430, 1, 37, 58, 73, 541, 2320, 2620, 3300, 43200, 61200, 147000, 687000 )), .Names = c("site_no", "mean", "p50"), row.names = c(4463L, 4495L, 4586L, 5353L, 5357L, 5378L, 5393L, 5397L, 6165L, 6169L, 6203L, 6253L, 7304L, 7308L, 7317L, 7326L, 7328L, 7330L, 9633L, 9698L, 9710L, 9725L, 9733L, 9756L, 9832L, 9840L, 9877L, 9889L, 9988L, 9997L, 10010L, 10019L, 10022L, 10029L, 10031L, 10032L, 10041L, 10052L, 10118L, 10284L, 10288L, 10306L, 10317L, 10322L, 11165L, 11185L, 11261L, 11268L, 11281L, 11283L, 11284L, 11325L, 11363L, 11370L, 11401L, 11421L, 11606L, 11626L, 12714L), class = "data.frame") -- View this message in context: http://r.789695.n4.nabble.com/subset-between-data-table-list-and-single-data-table-object-tp4673202p4673265.html Sent from the datatable-help mailing list archive at Nabble.com. From mdowle at mdowle.plus.com Wed Aug 7 10:49:41 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Wed, 07 Aug 2013 09:49:41 +0100 Subject: [datatable-help] subset between data.table list and single data.table object In-Reply-To: <1375856686777-4673265.post@n4.nabble.com> References: <1375758602714-4673202.post@n4.nabble.com> <5200BF19.3060001@mdowle.plus.com> <1375856686777-4673265.post@n4.nabble.com> Message-ID: <52020A25.1090407@mdowle.plus.com> Hi, Yes this is much clearer now, thanks. In this case, inside the freadDataRatingDepotFiles function, add a line at the end to add its argument (say 'funArg') as a column before returning it; i.e., ret[,site:=funArg]. Likely then key big by site. Since site is added by := by reference without copying the file's data that has just been read, reading and stacking multiple files should be quite fast using fread and data.table together if there's a bit of tweaking to be done on each file before stacking. Matthew On 07/08/13 07:24, iembry wrote: > Hi Matthew, partly based on your suggestion I have the following R code: > > big = rbindlist(lapply(sitefiles,freadDataRatingDepotFiles)) > big <- setnames(big,c("y", "shift", "x", "stor")) > big <- big[, y:=as.numeric(y)] > big <- big[, x:=as.numeric(x)] > big <- big[, shift:=as.numeric(shift)] > big <- big[, c("stor"):=NULL] > big <- na.omit(big) > big <- big[,y:=y+shift] > big <- big[,shift:=NULL] > > Thus instead of a list of 59 data.table objects I have one list of over > 100,000 rows. > > How do I know which row range belongs to a certain data.table object (59 of > them) for the other calculations? > > As before I want to subset big (or the list of 59 data.tables) based on > their connection to aimall (see below). aimall contains each of the 59 > station numbers & the order of aimall matches the order of the 59 > data.tables. > > Does this help clarify what I had previously asked? > > Thank you. > > Irucka > > > > str(big) > Classes ?data.table? and 'data.frame': 112253 obs. of 2 variables: > $ y: num 14.8 14.8 14.8 14.8 14.8 ... > $ x: num 7900 7920 7930 7950 7970 7980 8000 8010 8030 8050 ... > - attr(*, ".internal.selfref")= > > dput(aimall) > structure(list(site_no = c("02437100", "02446500", "02467000", > "03217500", "03219500", "03227500", "03230700", "03231500", "03439000", > "03441000", "03455000", "03479000", "04185000", "04186500", "04189000", > "04191500", "04192500", "04193500", "06191500", "06214500", "06218500", > "06225500", "06228000", "06235500", "06276500", "06279500", "06287000", > "06289000", "06311000", "06313500", "06317000", "06320000", "06320500", > "06323000", "06324000", "06324500", "06326500", "06329500", "06342500", > "06426500", "06428500", "06436000", "06437000", "06438000", "06818000", > "06821500", "06856600", "06860000", "06864000", "06864500", "06865500", > "06877600", "06887500", "06889000", "06891000", "06893000", "06934500", > "07010000", "07289000"), mean = c(3882.65, 819.82, 23742.37, > 224.72, 496.79, 1491.39, 3170.14, 3682.46, 237.02, 127.9, 2955.14, > 176.1, 345.72, 296.23, 275.35, 1870.93, 4544.74, 5157.63, 3106.7, > 6940.54, 167.04, 1172.53, 771.23, 559.23, 407.46, 2144.53, 3384.37, > 148.67, 14.99, 195.91, 267.9, 47.49, 63.49, 96.74, 184.16, 446.52, > 565.5, 12419.4, 22372.86, 23.34, 92.56, 100.45, 296.65, 391.31, > 43534.12, 16.65, 915.93, 20.16, 197.09, 227.78, 274.43, 1517.04, > 5042.7, 5632.7, 7018.45, 52604.19, 81758.03, 186504.25, 755685.3 > ), p50 = c(1830, 382, 10400, 50, 140, 500, 1520, 1600, 188, 99, > 2260, 115, 130, 75, 62, 460, 1470, 1700, 1390, 3670, 80, 559, > 380, 257, 223, 1550, 2730, 82, 3.8, 120, 130, 23, 46, 34, 86, > 216, 231, 7900, 20400, 2.9, 36, 7.5, 120, 114, 38200, 6.3, 430, > 1, 37, 58, 73, 541, 2320, 2620, 3300, 43200, 61200, 147000, 687000 > )), .Names = c("site_no", "mean", "p50"), row.names = c(4463L, > 4495L, 4586L, 5353L, 5357L, 5378L, 5393L, 5397L, 6165L, 6169L, > 6203L, 6253L, 7304L, 7308L, 7317L, 7326L, 7328L, 7330L, 9633L, > 9698L, 9710L, 9725L, 9733L, 9756L, 9832L, 9840L, 9877L, 9889L, > 9988L, 9997L, 10010L, 10019L, 10022L, 10029L, 10031L, 10032L, > 10041L, 10052L, 10118L, 10284L, 10288L, 10306L, 10317L, 10322L, > 11165L, 11185L, 11261L, 11268L, 11281L, 11283L, 11284L, 11325L, > 11363L, 11370L, 11401L, 11421L, 11606L, 11626L, 12714L), class = > "data.frame") > > > > -- > View this message in context: http://r.789695.n4.nabble.com/subset-between-data-table-list-and-single-data-table-object-tp4673202p4673265.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From iruckaE at mail2world.com Wed Aug 7 18:42:55 2013 From: iruckaE at mail2world.com (iembry) Date: Wed, 7 Aug 2013 09:42:55 -0700 (PDT) Subject: [datatable-help] subset between data.table list and single data.table object In-Reply-To: <52020A25.1090407@mdowle.plus.com> References: <1375758602714-4673202.post@n4.nabble.com> <5200BF19.3060001@mdowle.plus.com> <1375856686777-4673265.post@n4.nabble.com> <52020A25.1090407@mdowle.plus.com> Message-ID: <1375893775454-4673285.post@n4.nabble.com> Hi Matthew, thank you. This is my function and I added the modified line that you suggested below: freadDataRatingDepotFiles <- function (file) { RDdatatmp <- fread(file, autostart=40) RDdatatmp <- RDdatatmp[,site:=funArg] } I used the function on the files and I received the error below: big = rbindlist(lapply(sitefiles,freadDataRatingDepotFiles)) Error in eval(expr, envir, enclos) : object 'funArg' not found Thank you. Irucka -- View this message in context: http://r.789695.n4.nabble.com/subset-between-data-table-list-and-single-data-table-object-tp4673202p4673285.html Sent from the datatable-help mailing list archive at Nabble.com. From mailinglist.honeypot at gmail.com Wed Aug 7 19:03:32 2013 From: mailinglist.honeypot at gmail.com (Steve Lianoglou) Date: Wed, 7 Aug 2013 10:03:32 -0700 Subject: [datatable-help] subset between data.table list and single data.table object In-Reply-To: <1375893775454-4673285.post@n4.nabble.com> References: <1375758602714-4673202.post@n4.nabble.com> <5200BF19.3060001@mdowle.plus.com> <1375856686777-4673265.post@n4.nabble.com> <52020A25.1090407@mdowle.plus.com> <1375893775454-4673285.post@n4.nabble.com> Message-ID: Hi, On Wed, Aug 7, 2013 at 9:42 AM, iembry wrote: > Hi Matthew, thank you. > > This is my function and I added the modified line that you suggested below: > freadDataRatingDepotFiles <- function (file) > { > RDdatatmp <- fread(file, autostart=40) > RDdatatmp <- RDdatatmp[,site:=funArg] > } > > I used the function on the files and I received the error below: > > big = rbindlist(lapply(sitefiles,freadDataRatingDepotFiles)) > Error in eval(expr, envir, enclos) : object 'funArg' not found The use of `funArg` wasn't a literal suggestion ... the "normal" rules of programming apply here, which is to say that `funArg` (or whatever) needs to be a defined variable before you can use it(!) Likely Matthew used `funArg` as shorthand for "function argument", so you could write your function like so (note that `:=` returns the data.table it modified invisibly, so no need to reassign or call `return()`): freadDataRatingDepotFiles <- function(filename) { tmp <- fread(filename, autostart=40) tmp[, site := filename] } Now, split your code up into more manageable pieces so you can see and verify what is going on: R> dts <- lapply(sitefiles,freadDataRatingDepotFiles) R> all(sapply(dts, is.data.table)) If the last statement doesn't evaluate to TRUE then you have a problem. Assuming it is TRUE, now you simply: R> big <- rbindlist(dts) and continue ... HTH, -steve -- Steve Lianoglou Computational Biologist Department of Bioinformatics and Computational Biology Genentech From saporta at scarletmail.rutgers.edu Wed Aug 7 23:56:02 2013 From: saporta at scarletmail.rutgers.edu (Ricardo Saporta) Date: Wed, 7 Aug 2013 17:56:02 -0400 Subject: [datatable-help] Discrepancy between as.data.frame & as.data.table when handling nested lists Message-ID: Hi all, Note the following discrepancy in structure between as.data.frame & as.data.table when called on a nested list. as.data.frame converts the sublist into individual columns whereas as.data.table stacks them into a single column and creates additional rows. Is this intentional? -Rick as.data.frame(X) # start type end data.editDist data.second # 1 start_node is_similar end_node 1 HelloWorld as.data.table(X) # start type end data # 1: start_node is_similar end_node 1 # 2: start_node is_similar end_node HelloWorld ### Copy+Paste'able Below ### # Example 1: X <- structure(list(start = "start_node", type = "is_similar", end = "end_node", data = structure(list(editDist = 1, second = "HelloWorld"), .Names = c("editDist", "second"))), .Names = c("start", "type", "end", "data")) as.data.frame(X) as.data.table(X) as.data.table(as.data.frame(X)) # Example 2, with more elements: Y <- structure(list(start = c("start_node", "start_node"), type = c("is_similar", "is_similar"), end = c("end_node", "end_node"), data = structure(list(editDist = c(1, 1), second = c("HelloWorld", "HelloWorld")), .Names = c("editDist", "second"))), .Names = c("start", "type", "end", "data")) as.data.frame(Y) as.data.table(Y) -------------- next part -------------- An HTML attachment was scrubbed... URL: From iruckaE at mail2world.com Thu Aug 8 00:44:25 2013 From: iruckaE at mail2world.com (iembry) Date: Wed, 7 Aug 2013 15:44:25 -0700 (PDT) Subject: [datatable-help] subset between data.table list and single data.table object In-Reply-To: References: <1375758602714-4673202.post@n4.nabble.com> <5200BF19.3060001@mdowle.plus.com> <1375856686777-4673265.post@n4.nabble.com> <52020A25.1090407@mdowle.plus.com> <1375893775454-4673285.post@n4.nabble.com> Message-ID: <1375915465645-4673308.post@n4.nabble.com> Hi Steve and Matthew, thank you both for your suggestions. This is the code that I have now: freadDataRatingDepotFiles <- function (file) { RDdatatmp <- fread(file, autostart=40) RDdatatmp[, site:= file] } big <- lapply(sitefiles,freadDataRatingDepotFiles) big <- rbindlist(big) big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) setnames(big[[u]], c("y", "shift", "x", "stor", "site_no"))) big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) big[[u]][, y:=as.numeric(y)]) big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) big[[u]][, x:=as.numeric(x)]) big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) big[[u]][, shift:=as.numeric(shift)]) big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) big[[u]][, stor:=NULL]) big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) na.omit(big[[u]])) big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) big[[u]][,y:=y+shift]) big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) big[[u]][,shift:=NULL]) big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) setkey(big[[u]], site_no)) I am trying to subset big based on the mean and median values in aimjoin (as described previously in this message thread). This is the first row of aimjoin: dput(aimjoin[1]) structure(list(site_no = "02437100", mean = 3882.65, p50 = 1830), .Names = c("site_no", "mean", "p50"), sorted = "site_no", class = c("data.table", "data.frame" ), row.names = c(NA, -1L), .internal.selfref = ) This is one element of big: tempbigdata <- data.frame(c(14.80, 14.81, 14.82), c(7900, 7920, 7930), c("/tried/02437100.exsa.rdb", "/tried/02437100.exsa.rdb", "/tried/02437100.exsa.rdb"), stringsAsFactors = FALSE) names(tempbigdata) <- c("y", "x", "site_no") tempbigdat <- gsub("/tried/", "", tempbigdata) tempbigdat <- gsub(".exsa.rdb", "", tempbigdat) # I tried to remove all characters in the column site_no except for the actual site number, but I ended up with a character vector instead of a data.table This is a revised version of the code that I had written previously to perform the subsetting (prior to using data.table): mp <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) {ifelse(aimjoin[1]$mean[u] < min(big[[u]]$x), subset(getratings[[u]], aimjoin[1]$mean[u] > min(big[[u]]$x) & aimjoin[1]$mean[u], aimjoin[u]$mean[u] > min(big[[u]]$x)), aimjoin[1]$mean[u])}) I have tried to join aimjoin and big, but I received the error message below: aimjoin[J(big$site_no)] Error in `[.data.table`(aimjoin, J(big$site_no)) : x.'site_no' is a character column being joined to i.'V1' which is type 'NULL'. Character columns must join to factor or character columns. I also tried to merge aimjoin and big, but it was not what I wanted. I would like for the mean and p50 values -- for each site number -- to be joined to the site number in big. I figure that would make it easier to perform the subsetting. I want to subset big based on whether or not the mean or median in aimjoin is less than the minimum value of x in big. Those mean or median values in aimjoin that are smaller than x in big will have to be grouped together for a future step & those mean or median values in aimjoin that are equal to or larger than the x in big will be grouped together for a future step. Can you provide me with advice on how to proceed with the subsetting? Thank you. Irucka -- View this message in context: http://r.789695.n4.nabble.com/subset-between-data-table-list-and-single-data-table-object-tp4673202p4673308.html Sent from the datatable-help mailing list archive at Nabble.com. From FErickson at psu.edu Thu Aug 8 02:14:59 2013 From: FErickson at psu.edu (Frank Erickson) Date: Wed, 7 Aug 2013 19:14:59 -0500 Subject: [datatable-help] Discrepancy between as.data.frame & as.data.table when handling nested lists In-Reply-To: References: Message-ID: Hi Rick, I guess it's intentional: Matthew saw this SO question (since he edited one of the answers): http://stackoverflow.com/questions/9547518/creating-a-data-frame-where-a-column-is-a-list Some musings: Of course, to reproduce as.data.frame-like behavior, you can un-nest the list, so both functions treat it the same way. Z <- unlist(Y,recursive=FALSE) identical(as.data.table(Z),as.data.table(as.data.frame(Z))) # TRUE # or, equivalently (?) identical(do.call(data.table,Z),data.table(do.call(data.frame,Z))) # TRUE On the other hand, going back the other direction (getting data.table-like behavior when data.frame's is the default) is more awkward, as seen in that SO question (where they mention protecting each sublist with the I() function). Besides, I'm with @flodel, who asked the SO question, in expecting data.table's behavior: one top-level item in the list mapping to one column in the result... --Frank On Wed, Aug 7, 2013 at 4:56 PM, Ricardo Saporta < saporta at scarletmail.rutgers.edu> wrote: > Hi all, > > Note the following discrepancy in structure between as.data.frame & > as.data.table when called on a nested list. > as.data.frame converts the sublist into individual columns whereas > as.data.table stacks them into a single column and creates additional rows. > > Is this intentional? > -Rick > > > as.data.frame(X) > # start type end data.editDist data.second > # 1 start_node is_similar end_node 1 HelloWorld > > as.data.table(X) > # start type end data > # 1: start_node is_similar end_node 1 > # 2: start_node is_similar end_node HelloWorld > > > > > ### Copy+Paste'able Below ### > > # Example 1: > X <- structure(list(start = "start_node", type = "is_similar", end = > "end_node", > data = structure(list(editDist = 1, second = "HelloWorld"), .Names = > c("editDist", > "second"))), .Names = c("start", "type", "end", "data")) > > as.data.frame(X) > as.data.table(X) > > as.data.table(as.data.frame(X)) > > > # Example 2, with more elements: > Y <- structure(list(start = c("start_node", "start_node"), type = > c("is_similar", "is_similar"), end = c("end_node", "end_node"), data = > structure(list(editDist = c(1, 1), second = c("HelloWorld", "HelloWorld")), > .Names = c("editDist", "second"))), .Names = c("start", "type", "end", > "data")) > > as.data.frame(Y) > as.data.table(Y) > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From saporta at scarletmail.rutgers.edu Thu Aug 8 05:30:07 2013 From: saporta at scarletmail.rutgers.edu (Ricardo Saporta) Date: Wed, 7 Aug 2013 23:30:07 -0400 Subject: [datatable-help] Discrepancy between as.data.frame & as.data.table when handling nested lists In-Reply-To: References: Message-ID: Hey Frank, Thanks for pointing out that SO link, I had missed it. All, I'm curious as to which used cases this functionality would be used in (used for?) thanks, Rick On Wed, Aug 7, 2013 at 8:14 PM, Frank Erickson wrote: > Hi Rick, > > I guess it's intentional: Matthew saw this SO question (since he edited > one of the answers): > http://stackoverflow.com/questions/9547518/creating-a-data-frame-where-a-column-is-a-list > > Some musings: Of course, to reproduce as.data.frame-like behavior, you can > un-nest the list, so both functions treat it the same way. > > Z <- unlist(Y,recursive=FALSE) > > identical(as.data.table(Z),as.data.table(as.data.frame(Z))) # TRUE > # or, equivalently (?) > identical(do.call(data.table,Z),data.table(do.call(data.frame,Z))) # TRUE > > > On the other hand, going back the other direction (getting data.table-like > behavior when data.frame's is the default) is more awkward, as seen in that > SO question (where they mention protecting each sublist with the I() > function). Besides, I'm with @flodel, who asked the SO question, in > expecting data.table's behavior: one top-level item in the list mapping to > one column in the result... > > --Frank > > On Wed, Aug 7, 2013 at 4:56 PM, Ricardo Saporta < > saporta at scarletmail.rutgers.edu> wrote: > >> Hi all, >> >> Note the following discrepancy in structure between as.data.frame & >> as.data.table when called on a nested list. >> as.data.frame converts the sublist into individual columns whereas >> as.data.table stacks them into a single column and creates additional rows. >> >> Is this intentional? >> -Rick >> >> >> as.data.frame(X) >> # start type end data.editDist data.second >> # 1 start_node is_similar end_node 1 HelloWorld >> >> as.data.table(X) >> # start type end data >> # 1: start_node is_similar end_node 1 >> # 2: start_node is_similar end_node HelloWorld >> >> >> >> >> ### Copy+Paste'able Below ### >> >> # Example 1: >> X <- structure(list(start = "start_node", type = "is_similar", end = >> "end_node", >> data = structure(list(editDist = 1, second = "HelloWorld"), .Names = >> c("editDist", >> "second"))), .Names = c("start", "type", "end", "data")) >> >> as.data.frame(X) >> as.data.table(X) >> >> as.data.table(as.data.frame(X)) >> >> >> # Example 2, with more elements: >> Y <- structure(list(start = c("start_node", "start_node"), type = >> c("is_similar", "is_similar"), end = c("end_node", "end_node"), data = >> structure(list(editDist = c(1, 1), second = c("HelloWorld", "HelloWorld")), >> .Names = c("editDist", "second"))), .Names = c("start", "type", "end", >> "data")) >> >> as.data.frame(Y) >> as.data.table(Y) >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Thu Aug 8 06:11:05 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Wed, 7 Aug 2013 23:11:05 -0500 Subject: [datatable-help] Discrepancy between as.data.frame & as.data.table when handling nested lists In-Reply-To: References: Message-ID: This seems like a pretty natural interpretation of list->data.table to me, although it would be nice to maybe get a warning I think here: X = list(a = list(1,2), b = list(1,2,3)) as.data.table(X) especially since this simply refuses to do anything: data.table(a = c(1,2), b = c(1,2,3)) On Wed, Aug 7, 2013 at 10:30 PM, Ricardo Saporta < saporta at scarletmail.rutgers.edu> wrote: > Hey Frank, > > Thanks for pointing out that SO link, I had missed it. > > All, > > I'm curious as to which used cases this functionality would be used in > (used for?) > > thanks, > Rick > > > > On Wed, Aug 7, 2013 at 8:14 PM, Frank Erickson wrote: > >> Hi Rick, >> >> I guess it's intentional: Matthew saw this SO question (since he edited >> one of the answers): >> http://stackoverflow.com/questions/9547518/creating-a-data-frame-where-a-column-is-a-list >> >> Some musings: Of course, to reproduce as.data.frame-like behavior, you >> can un-nest the list, so both functions treat it the same way. >> >> Z <- unlist(Y,recursive=FALSE) >> >> identical(as.data.table(Z),as.data.table(as.data.frame(Z))) # TRUE >> # or, equivalently (?) >> identical(do.call(data.table,Z),data.table(do.call(data.frame,Z))) # TRUE >> >> >> On the other hand, going back the other direction (getting >> data.table-like behavior when data.frame's is the default) is more awkward, >> as seen in that SO question (where they mention protecting each sublist >> with the I() function). Besides, I'm with @flodel, who asked the SO >> question, in expecting data.table's behavior: one top-level item in the >> list mapping to one column in the result... >> >> --Frank >> >> On Wed, Aug 7, 2013 at 4:56 PM, Ricardo Saporta < >> saporta at scarletmail.rutgers.edu> wrote: >> >>> Hi all, >>> >>> Note the following discrepancy in structure between as.data.frame & >>> as.data.table when called on a nested list. >>> as.data.frame converts the sublist into individual columns whereas >>> as.data.table stacks them into a single column and creates additional rows. >>> >>> Is this intentional? >>> -Rick >>> >>> >>> as.data.frame(X) >>> # start type end data.editDist data.second >>> # 1 start_node is_similar end_node 1 HelloWorld >>> >>> as.data.table(X) >>> # start type end data >>> # 1: start_node is_similar end_node 1 >>> # 2: start_node is_similar end_node HelloWorld >>> >>> >>> >>> >>> ### Copy+Paste'able Below ### >>> >>> # Example 1: >>> X <- structure(list(start = "start_node", type = "is_similar", end = >>> "end_node", >>> data = structure(list(editDist = 1, second = "HelloWorld"), .Names = >>> c("editDist", >>> "second"))), .Names = c("start", "type", "end", "data")) >>> >>> as.data.frame(X) >>> as.data.table(X) >>> >>> as.data.table(as.data.frame(X)) >>> >>> >>> # Example 2, with more elements: >>> Y <- structure(list(start = c("start_node", "start_node"), type = >>> c("is_similar", "is_similar"), end = c("end_node", "end_node"), data = >>> structure(list(editDist = c(1, 1), second = c("HelloWorld", "HelloWorld")), >>> .Names = c("editDist", "second"))), .Names = c("start", "type", "end", >>> "data")) >>> >>> as.data.frame(Y) >>> as.data.table(Y) >>> >>> >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >> >> > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Thu Aug 8 06:16:37 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Thu, 08 Aug 2013 05:16:37 +0100 Subject: [datatable-help] subset between data.table list and single data.table object In-Reply-To: <1375915465645-4673308.post@n4.nabble.com> References: <1375758602714-4673202.post@n4.nabble.com> <5200BF19.3060001@mdowle.plus.com> <1375856686777-4673265.post@n4.nabble.com> <52020A25.1090407@mdowle.plus.com> <1375893775454-4673285.post@n4.nabble.com> <1375915465645-4673308.post@n4.nabble.com> Message-ID: <52031BA5.7060904@mdowle.plus.com> Hm. Have you worked through the examples of data.table? Type example(data.table) and try to thoroughly understand each and every example. Just forget your immediate problem for the moment, then come back to it once you've looked at the examples. Further comments inline ... On 07/08/13 23:44, iembry wrote: > Hi Steve and Matthew, thank you both for your suggestions. This is the code > that I have now: > > freadDataRatingDepotFiles <- function (file) > { > RDdatatmp <- fread(file, autostart=40) > RDdatatmp[, site:= file] > } > > big <- lapply(sitefiles,freadDataRatingDepotFiles) > big <- rbindlist(big) > big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) > setnames(big[[u]], c("y", "shift", "x", "stor", "site_no"))) That lapply and big[[u]] doesn't make much sense. big is one big table, with one set of column names. Why loop setnames? > big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) big[[u]][, > y:=as.numeric(y)]) > big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) big[[u]][, > x:=as.numeric(x)]) > big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) big[[u]][, > shift:=as.numeric(shift)]) > big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) big[[u]][, > stor:=NULL]) > big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) > na.omit(big[[u]])) > big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) > big[[u]][,y:=y+shift]) > big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) > big[[u]][,shift:=NULL]) > big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) > setkey(big[[u]], site_no)) Again, all these lapply don't make much sense now big is one big table. > > I am trying to subset big based on the mean and median values in aimjoin (as > described previously in this message thread). But that part of the message thread is no longer here. So I'd have to go and hunt for it. > > This is the first row of aimjoin: > dput(aimjoin[1]) > structure(list(site_no = "02437100", mean = 3882.65, p50 = 1830), .Names = > c("site_no", > "mean", "p50"), sorted = "site_no", class = c("data.table", "data.frame" > ), row.names = c(NA, -1L), .internal.selfref = ) > > This is one element of big: > tempbigdata <- data.frame(c(14.80, 14.81, 14.82), c(7900, 7920, 7930), > c("/tried/02437100.exsa.rdb", "/tried/02437100.exsa.rdb", > "/tried/02437100.exsa.rdb"), stringsAsFactors = FALSE) > names(tempbigdata) <- c("y", "x", "site_no") > tempbigdat <- gsub("/tried/", "", tempbigdata) > tempbigdat <- gsub(".exsa.rdb", "", tempbigdat) Please paste the data itself laid out just like you see it at the prompt. I find it difficult to parse dput output in emails. And longer to paste it into an R session before I see. I often read and reply from a mobile phone, as do others I guess. Questions like this are better presented on stack overflow. > # I tried to remove all > characters in the column site_no except for the actual site number, but I > ended up with a character vector instead of a data.table > > This is a revised version of the code that I had written previously to > perform the subsetting (prior to using data.table): > mp <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) > {ifelse(aimjoin[1]$mean[u] < min(big[[u]]$x), subset(getratings[[u]], > aimjoin[1]$mean[u] > min(big[[u]]$x) & aimjoin[1]$mean[u], > aimjoin[u]$mean[u] > min(big[[u]]$x)), aimjoin[1]$mean[u])}) Again, maybe by big[[u]] you mean big[u] if big is keyed, but I didn't see a setkey above. Seems like you maybe want [,...,by=site]. > > > I have tried to join aimjoin and big, but I received the error message > below: > > aimjoin[J(big$site_no)] > Error in `[.data.table`(aimjoin, J(big$site_no)) : > x.'site_no' is a character column being joined to i.'V1' which is type > 'NULL'. Character columns must join to factor or character columns. I guess that 'site_no' isn't a column of big ... typo of 'site_no'? anyList$notthere is NULL in R and only NULL itself is type NULL, hence the guess. > > > I also tried to merge aimjoin and big, but it was not what I wanted. I would > like for the mean and p50 values -- for each site number -- to be joined to > the site number in big. I figure that would make it easier to perform the > subsetting. Please see examples of good questions on Stack Overflow. There you see people put examples of their input and what their desired output is for that input data. I really can't see what you're trying to do. > > I want to subset big based on whether or not the mean or median in aimjoin > is less than the minimum value of x in big. Those mean or median values in > aimjoin that are smaller than x in big will have to be grouped together for > a future step & those mean or median values in aimjoin that are equal to or > larger than the x in big will be grouped together for a future step. > > Can you provide me with advice on how to proceed with the subsetting? Try to construct a really good toy example that demonstrates what you want. Show input and desired output. In this case 2 groups of 5 rows each should be enough to demonstrate. > > Thank you. > > Irucka > > > > -- > View this message in context: http://r.789695.n4.nabble.com/subset-between-data-table-list-and-single-data-table-object-tp4673202p4673308.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > From mdowle at mdowle.plus.com Thu Aug 8 06:38:35 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Thu, 08 Aug 2013 05:38:35 +0100 Subject: [datatable-help] Discrepancy between as.data.frame & as.data.table when handling nested lists In-Reply-To: References: Message-ID: <520320CB.1050001@mdowle.plus.com> Agreed, intentional. > L = list(1,2,3) > as.data.table(L) V1 V2 V3 # 3 columns, not one list column 1: 1 2 3 > L = list(1,2,3,list("a",4L,3:10)) # the one nested list here creates one list column > as.data.table(L) V1 V2 V3 V4 1: 1 2 3 a 2: 1 2 3 4 3: 1 2 3 3,4,5,6,7,8, Rick - are you asking for use cases of list columns full stop or use cases of converting nested lists to data.table containing list columns ? On 08/08/13 04:30, Ricardo Saporta wrote: > Hey Frank, > > Thanks for pointing out that SO link, I had missed it. > > All, > > I'm curious as to which used cases this functionality would be used in > (used for?) > > thanks, > Rick > > > > On Wed, Aug 7, 2013 at 8:14 PM, Frank Erickson > wrote: > > Hi Rick, > > I guess it's intentional: Matthew saw this SO question (since he > edited one of the answers): > http://stackoverflow.com/questions/9547518/creating-a-data-frame-where-a-column-is-a-list > > Some musings: Of course, to reproduce as.data.frame-like behavior, > you can un-nest the list, so both functions treat it the same way. > > Z <- unlist(Y,recursive=FALSE) > > identical(as.data.table(Z),as.data.table(as.data.frame(Z))) # TRUE > # or, equivalently (?) > identical(do.call(data.table,Z),data.table(do.call(data.frame,Z))) > # TRUE > > > On the other hand, going back the other direction (getting > data.table-like behavior when data.frame's is the default) is more > awkward, as seen in that SO question (where they mention > protecting each sublist with the I() function). Besides, I'm with > @flodel, who asked the SO question, in expecting data.table's > behavior: one top-level item in the list mapping to one column in > the result... > > --Frank > > On Wed, Aug 7, 2013 at 4:56 PM, Ricardo Saporta > > wrote: > > Hi all, > > Note the following discrepancy in structure between > as.data.frame & as.data.table when called on a nested list. > as.data.frame converts the sublist into individual columns > whereas as.data.table stacks them into a single column and > creates additional rows. > > Is this intentional? > -Rick > > > as.data.frame(X) > # start type end data.editDist data.second > # 1 start_node is_similar end_node 1 HelloWorld > > as.data.table(X) > # start type end data > # 1: start_node is_similar end_node 1 > # 2: start_node is_similar end_node HelloWorld > > > > > ### Copy+Paste'able Below ### > > # Example 1: > X <- structure(list(start = "start_node", type = > "is_similar", end = "end_node", > data = structure(list(editDist = 1, second = > "HelloWorld"), .Names = c("editDist", > "second"))), .Names = c("start", "type", "end", "data")) > > as.data.frame(X) > as.data.table(X) > > as.data.table(as.data.frame(X)) > > > # Example 2, with more elements: > Y <- structure(list(start = c("start_node", "start_node"), > type = c("is_similar", "is_similar"), end = c("end_node", > "end_node"), data = structure(list(editDist = c(1, 1), second > = c("HelloWorld", "HelloWorld")), .Names = c("editDist", > "second"))), .Names = c("start", "type", "end", "data")) > > as.data.frame(Y) > as.data.table(Y) > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Thu Aug 8 06:48:47 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Thu, 08 Aug 2013 05:48:47 +0100 Subject: [datatable-help] Discrepancy between as.data.frame & as.data.table when handling nested lists In-Reply-To: References: Message-ID: <5203232F.4020304@mdowle.plus.com> On 08/08/13 05:11, Eduard Antonyan wrote: > This seems like a pretty natural interpretation of list->data.table to > me, although it would be nice to maybe get a warning I think here: > > X = list(a = list(1,2), b = list(1,2,3)) > > as.data.table(X) Good point, that should be recycle-remainder warning. > > especially since this simply refuses to do anything: > > data.table(a = c(1,2), b = c(1,2,3)) Ah but that's for consistency with data.frame ;) > data.frame(1:2,1:3) Error in data.frame(1:2, 1:3) : arguments imply differing number of rows: 2, 3 > Happy to recycle and change both above to recyle-remainder warning. If there's no recycle remainder then no warning, right? Repeating singletons is a common use case I wouldn't want a warning about, for example. Could you file a request please? > > > On Wed, Aug 7, 2013 at 10:30 PM, Ricardo Saporta > > wrote: > > Hey Frank, > > Thanks for pointing out that SO link, I had missed it. > > All, > > I'm curious as to which used cases this functionality would be > used in (used for?) > > thanks, > Rick > > > > On Wed, Aug 7, 2013 at 8:14 PM, Frank Erickson > wrote: > > Hi Rick, > > I guess it's intentional: Matthew saw this SO question (since > he edited one of the answers): > http://stackoverflow.com/questions/9547518/creating-a-data-frame-where-a-column-is-a-list > > Some musings: Of course, to reproduce as.data.frame-like > behavior, you can un-nest the list, so both functions treat it > the same way. > > Z <- unlist(Y,recursive=FALSE) > > identical(as.data.table(Z),as.data.table(as.data.frame(Z))) # > TRUE > # or, equivalently (?) > identical(do.call(data.table,Z),data.table(do.call(data.frame,Z))) > # TRUE > > > On the other hand, going back the other direction (getting > data.table-like behavior when data.frame's is the default) is > more awkward, as seen in that SO question (where they mention > protecting each sublist with the I() function). Besides, I'm > with @flodel, who asked the SO question, in expecting > data.table's behavior: one top-level item in the list mapping > to one column in the result... > > --Frank > > On Wed, Aug 7, 2013 at 4:56 PM, Ricardo Saporta > > wrote: > > Hi all, > > Note the following discrepancy in structure between > as.data.frame & as.data.table when called on a nested list. > as.data.frame converts the sublist into individual columns > whereas as.data.table stacks them into a single column and > creates additional rows. > > Is this intentional? > -Rick > > > as.data.frame(X) > # start type end data.editDist data.second > # 1 start_node is_similar end_node 1 HelloWorld > > as.data.table(X) > # start type end data > # 1: start_node is_similar end_node 1 > # 2: start_node is_similar end_node HelloWorld > > > > > ### Copy+Paste'able Below ### > > # Example 1: > X <- structure(list(start = "start_node", type = > "is_similar", end = "end_node", > data = structure(list(editDist = 1, second = > "HelloWorld"), .Names = c("editDist", > "second"))), .Names = c("start", "type", "end", "data")) > > as.data.frame(X) > as.data.table(X) > > as.data.table(as.data.frame(X)) > > > # Example 2, with more elements: > Y <- structure(list(start = c("start_node", "start_node"), > type = c("is_similar", "is_similar"), end = c("end_node", > "end_node"), data = structure(list(editDist = c(1, 1), > second = c("HelloWorld", "HelloWorld")), .Names = > c("editDist", "second"))), .Names = c("start", "type", > "end", "data")) > > as.data.frame(Y) > as.data.table(Y) > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From iruckaE at mail2world.com Thu Aug 8 18:12:25 2013 From: iruckaE at mail2world.com (Irucka Embry) Date: Thu, 8 Aug 2013 09:12:25 -0700 Subject: [datatable-help] subset between data.table list and single data.table object Message-ID: Hi Matthew, thank you for your advice. I went over the examples in data.table, thank you for the suggestion. I also got rid of the lapply statements too. big <- lapply(sitefiles,freadDataRatingDepotFiles) big <- rbindlist(big) setnames(big,c("y", "shift", "x", "stor", "site_no")) big <- big[, y:=as.numeric(y)] big <- big[, x:=as.numeric(x)] big <- big[, shift:=as.numeric(shift)] big <- big[, stor:=NULL] big <- na.omit(big) big <- big[,y:=y+shift] big <- big[,shift:=NULL] big <- setkey(big, site_no) I have used dput as people on the main R help list had suggested that dput be used instead of unformatted tables due to text-based e-mail and help list. Based on your suggestions I have the input, intermediate table, and the output tables. Thank you. Irucka INPUT big y x site_no 1: 14.80 7900 /tried/02437100.exsa.rdb 2: 14.81 7920 /tried/02437100.exsa.rdb 3: 14.82 7930 /tried/02437100.exsa.rdb 4: 14.83 7950 /tried/02437100.exsa.rdb 5: 14.84 7970 /tried/02437100.exsa.rdb --- 112249: 57.86 2400000 /tried/07289000.exsa.rdb 112250: 57.87 2410000 /tried/07289000.exsa.rdb 112251: 57.88 2410000 /tried/07289000.exsa.rdb 112252: 57.89 2420000 /tried/07289000.exsa.rdb 112253: 57.90 2430000 /tried/07289000.exsa.rdb aimjoin site_no mean p50 1: 02437100 3882.65 1830.0 2: 02446500 819.82 382.0 3: 02467000 23742.37 10400.0 4: 03217500 224.72 50.0 5: 03219500 496.79 140.0 --- 54: 06889000 5632.70 2620.0 55: 06891000 7018.45 3300.0 56: 06893000 52604.19 43200.0 57: 06934500 81758.03 61200.0 58: 07010000 186504.25 147000.0 59: 07289000 755685.30 687000.0 site_no mean p50 INTERMEDIATE bigintermediate y x site_no mean p50 1: 14.80 7900 02437100 3882.65 1830.0 2: 14.81 7920 02437100 3882.65 1830.0 3: 14.82 7930 02437100 3882.65 1830.0 4: 14.83 7950 02437100 3882.65 1830.0 5: 14.84 7970 02437100 3882.65 1830.0 --- 112249: 57.86 2400000 07289000 755685.30 687000.0 112250: 57.87 2410000 07289000 755685.30 687000.0 112251: 57.88 2410000 07289000 755685.30 687000.0 112252: 57.89 2420000 07289000 755685.30 687000.0 112253: 57.90 2430000 07289000 755685.30 687000.0 OUTPUT bigintermean [where mean of site_no > min(x)] y x site_no mean --- ... 112249: 57.86 2400000 07289000 755685.30 112250: 57.87 2410000 07289000 755685.30 112251: 57.88 2410000 07289000 755685.30 112252: 57.89 2420000 07289000 755685.30 112253: 57.90 2430000 07289000 755685.30 total of 109,452 rows bigintermedian [where p50 of site_no > min(x)] y x site_no p50 --- ... 112249: 57.86 2400000 07289000 687000.0 112250: 57.87 2410000 07289000 687000.0 112251: 57.88 2410000 07289000 687000.0 112252: 57.89 2420000 07289000 687000.0 112253: 57.90 2430000 07289000 687000.0 total of 109,452 rows bigextramean [where mean of site_no < min(x)] y x site_no mean 1: 14.80 7900 02437100 3882.65 2: 14.81 7920 02437100 3882.65 3: 14.82 7930 02437100 3882.65 4: 14.83 7950 02437100 3882.65 5: 14.84 7970 02437100 3882.65 total of 2671 rows bigextramedian [where p50 of site_no < min(x)] y x site_no p50 1: 14.80 7900 02437100 1830.0 2: 14.81 7920 02437100 1830.0 3: 14.82 7930 02437100 1830.0 4: 14.83 7950 02437100 1830.0 5: 14.84 7970 02437100 1830.0 total of 2671 rows bigextrameanmax [where mean of site_no > max(x)] y x site_no mean 1: 14.80 7900 02437100 3882.65 2: 14.81 7920 02437100 3882.65 3: 14.82 7930 02437100 3882.65 4: 14.83 7950 02437100 3882.65 5: 14.84 7970 02437100 3882.65 total of 2671 rows bigextramedianmax [where p50 of site_no > max(x)] y x site_no p50 1: 14.80 7900 02437100 1830.0 2: 14.81 7920 02437100 1830.0 3: 14.82 7930 02437100 1830.0 4: 14.83 7950 02437100 1830.0 5: 14.84 7970 02437100 1830.0 total of 2671 rows <-----Original Message-----> >From: Matthew Dowle [mdowle at mdowle.plus.com] >Sent: 8/7/2013 11:16:37 PM >To: iruckaE at mail2world.com >Cc: datatable-help at lists.r-forge.r-project.org >Subject: Re: [datatable-help] subset between data.table list and single data.table object > >Hm. Have you worked through the examples of data.table? Type >example(data.table) and try to thoroughly understand each and every >example. Just forget your immediate problem for the moment, then come >back to it once you've looked at the examples. > >Further comments inline ... > > >On 07/08/13 23:44, iembry wrote: >> Hi Steve and Matthew, thank you both for your suggestions. This is the code >> that I have now: >> >> freadDataRatingDepotFiles <- function (file) >> { >> RDdatatmp <- fread(file, autostart=40) >> RDdatatmp[, site:= file] >> } >> >> big <- lapply(sitefiles,freadDataRatingDepotFiles) >> big <- rbindlist(big) >> big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) >> setnames(big[[u]], c("y", "shift", "x", "stor", "site_no"))) >That lapply and big[[u]] doesn't make much sense. big is one big table, >with one set of column names. Why loop setnames? >> big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) big[[u]][, >> y:=as.numeric(y)]) >> big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) big[[u]][, >> x:=as.numeric(x)]) >> big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) big[[u]][, >> shift:=as.numeric(shift)]) >> big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) big[[u]][, >> stor:=NULL]) >> big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) >> na.omit(big[[u]])) >> big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) >> big[[u]][,y:=y+shift]) >> big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) >> big[[u]][,shift:=NULL]) >> big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) >> setkey(big[[u]], site_no)) >Again, all these lapply don't make much sense now big is one big table. >> >> I am trying to subset big based on the mean and median values in aimjoin (as >> described previously in this message thread). > >But that part of the message thread is no longer here. So I'd have to >go and hunt for it. > >> >> This is the first row of aimjoin: >> dput(aimjoin[1]) >> structure(list(site_no = "02437100", mean = 3882.65, p50 = 1830), .Names = >> c("site_no", >> "mean", "p50"), sorted = "site_no", class = c("data.table", "data.frame" >> ), row.names = c(NA, -1L), .internal.selfref = ) >> >> This is one element of big: >> tempbigdata <- data.frame(c(14.80, 14.81, 14.82), c(7900, 7920, 7930), >> c("/tried/02437100.exsa.rdb", "/tried/02437100.exsa.rdb", >> "/tried/02437100.exsa.rdb"), stringsAsFactors = FALSE) >> names(tempbigdata) <- c("y", "x", "site_no") >> tempbigdat <- gsub("/tried/", "", tempbigdata) >> tempbigdat <- gsub(".exsa.rdb", "", tempbigdat) > >Please paste the data itself laid out just like you see it at the >prompt. I find it difficult to parse dput output in emails. And longer >to paste it into an R session before I see. I often read and reply from >a mobile phone, as do others I guess. Questions like this are better >presented on stack overflow. > >> # I tried to remove all >> characters in the column site_no except for the actual site number, but I >> ended up with a character vector instead of a data.table >> >> This is a revised version of the code that I had written previously to >> perform the subsetting (prior to using data.table): >> mp <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) >> {ifelse(aimjoin[1]$mean[u] < min(big[[u]]$x), subset(getratings[[u]], >> aimjoin[1]$mean[u] > min(big[[u]]$x) & aimjoin[1]$mean[u], >> aimjoin[u]$mean[u] > min(big[[u]]$x)), aimjoin[1]$mean[u])}) >Again, maybe by big[[u]] you mean big[u] if big is keyed, but I didn't >see a setkey above. Seems like you maybe want [,...,by=site]. >> >> >> I have tried to join aimjoin and big, but I received the error message >> below: >> >> aimjoin[J(big$site_no)] >> Error in `[.data.table`(aimjoin, J(big$site_no)) : >> x.'site_no' is a character column being joined to i.'V1' which is type >> 'NULL'. Character columns must join to factor or character columns. >I guess that 'site_no' isn't a column of big ... typo of 'site_no'? >anyList$notthere is NULL in R and only NULL itself is type NULL, hence >the guess. >> >> >> I also tried to merge aimjoin and big, but it was not what I wanted. I would >> like for the mean and p50 values -- for each site number -- to be joined to >> the site number in big. I figure that would make it easier to perform the >> subsetting. >Please see examples of good questions on Stack Overflow. There you see >people put examples of their input and what their desired output is for >that input data. I really can't see what you're trying to do. >> >> I want to subset big based on whether or not the mean or median in aimjoin >> is less than the minimum value of x in big. Those mean or median values in >> aimjoin that are smaller than x in big will have to be grouped together for >> a future step & those mean or median values in aimjoin that are equal to or >> larger than the x in big will be grouped together for a future step. >> >> Can you provide me with advice on how to proceed with the subsetting? >Try to construct a really good toy example that demonstrates what you >want. Show input and desired output. In this case 2 groups of 5 rows >each should be enough to demonstrate. > >> >> Thank you. >> >> Irucka >> >> >> >> -- >> View this message in context: http://r.789695.n4.nabble.com/subset-between-data-table-list- >and-single-data-table-object-tp4673202p4673308.html >> Sent from the datatable-help mailing list archive at Nabble.com. >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-h elp >> > >. >

_______________________________________________________________
Get the Free email that has everyone talking at http://www.mail2world.com
Unlimited Email Storage – POP3 – Calendar – SMS – Translator – Much More!
-------------- next part -------------- An HTML attachment was scrubbed... URL: From iruckaE at mail2world.com Thu Aug 8 18:37:21 2013 From: iruckaE at mail2world.com (iembry) Date: Thu, 8 Aug 2013 09:37:21 -0700 (PDT) Subject: [datatable-help] subset between data.table list and single data.table object In-Reply-To: References: <1375758602714-4673202.post@n4.nabble.com> Message-ID: <1375979841075-4673376.post@n4.nabble.com> Hi all, I have exported the tables in .html format to make viewing the tables easier. See this link: inputoutput.html Thank you.Irucka -- View this message in context: http://r.789695.n4.nabble.com/subset-between-data-table-list-and-single-data-table-object-tp4673202p4673376.html Sent from the datatable-help mailing list archive at Nabble.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: From iruckaE at mail2world.com Thu Aug 8 18:46:48 2013 From: iruckaE at mail2world.com (iembry) Date: Thu, 8 Aug 2013 09:46:48 -0700 (PDT) Subject: [datatable-help] subset between data.table list and single data.table object In-Reply-To: References: <1375758602714-4673202.post@n4.nabble.com> Message-ID: <1375980408135-4673378.post@n4.nabble.com> I'm sorry, but the previous .html page was only a .txt page so I'm replacing it here: inputoutput.html I have also included the preformatted text below.Irucka INPUTbig y x site_no 1: 14.80 7900 /tried/02437100.exsa.rdb 2: 14.81 7920 /tried/02437100.exsa.rdb 3: 14.82 7930 /tried/02437100.exsa.rdb 4: 14.83 7950 /tried/02437100.exsa.rdb 5: 14.84 7970 /tried/02437100.exsa.rdb --- 112249: 57.86 2400000 /tried/07289000.exsa.rdb112250: 57.87 2410000 /tried/07289000.exsa.rdb112251: 57.88 2410000 /tried/07289000.exsa.rdb112252: 57.89 2420000 /tried/07289000.exsa.rdb112253: 57.90 2430000 /tried/07289000.exsa.rdbaimjoin site_no mean p50 1: 02437100 3882.65 1830.0 2: 02446500 819.82 382.0 3: 02467000 23742.37 10400.0 4: 03217500 224.72 50.0 5: 03219500 496.79 140.0 --- 54: 06889000 5632.70 2620.055: 06891000 7018.45 3300.056: 06893000 52604.19 43200.057: 06934500 81758.03 61200.058: 07010000 186504.25 147000.059: 07289000 755685.30 687000.0 site_no mean p50INTERMEDIATEbigintermediate y x site_no mean p50 1: 14.80 7900 02437100 3882.65 1830.0 2: 14.81 7920 02437100 3882.65 1830.0 3: 14.82 7930 02437100 3882.65 1830.0 4: 14.83 7950 02437100 3882.65 1830.0 5: 14.84 7970 02437100 3882.65 1830.0 --- 112249: 57.86 2400000 07289000 755685.30 687000.0112250: 57.87 2410000 07289000 755685.30 687000.0112251: 57.88 2410000 07289000 755685.30 687000.0112252: 57.89 2420000 07289000 755685.30 687000.0112253: 57.90 2430000 07289000 755685.30 687000.0OUTPUTbigintermean [where mean of site_no > min(x)] y x site_no mean --- ... 112249: 57.86 2400000 07289000 755685.30112250: 57.87 2410000 07289000 755685.30112251: 57.88 2410000 07289000 755685.30112252: 57.89 2420000 07289000 755685.30112253: 57.90 2430000 07289000 755685.30total of 109,452 rowsbigintermedian [where p50 of site_no > min(x)] y x site_no p50 --- ... 112249: 57.86 2400000 07289000 687000.0112250: 57.87 2410000 07289000 687000.0112251: 57.88 2410000 07289000 687000.0112252: 57.89 2420000 07289000 687000.0112253: 57.90 2430000 07289000 687000.0total of 109,452 rowsbigextramean [where mean of site_no < min(x)] y x site_no mean 1: 14.80 7900 02437100 3882.65 2: 14.81 7920 02437100 3882.65 3: 14.82 7930 02437100 3882.65 4: 14.83 7950 02437100 3882.65 5: 14.84 7970 02437100 3882.65total of 2671 rowsbigextramedian [where p50 of site_no < min(x)] y x site_no p50 1: 14.80 7900 02437100 1830.0 2: 14.81 7920 02437100 1830.0 3: 14.82 7930 02437100 1830.0 4: 14.83 7950 02437100 1830.0 5: 14.84 7970 02437100 1830.0total of 2671 rowsbigextrameanmax [where mean of site_no > max(x)] y x site_no mean 1: 14.80 7900 02437100 3882.65 2: 14.81 7920 02437100 3882.65 3: 14.82 7930 02437100 3882.65 4: 14.83 7950 02437100 3882.65 5: 14.84 7970 02437100 3882.65total of 2671 rowsbigextramedianmax [where p50 of site_no > max(x)] y x site_no p50 1: 14.80 7900 02437100 1830.0 2: 14.81 7920 02437100 1830.0 3: 14.82 7930 02437100 1830.0 4: 14.83 7950 02437100 1830.0 5: 14.84 7970 02437100 1830.0total of 2671 rows -- View this message in context: http://r.789695.n4.nabble.com/subset-between-data-table-list-and-single-data-table-object-tp4673202p4673378.html Sent from the datatable-help mailing list archive at Nabble.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: From lianoglou.steve at gene.com Thu Aug 8 18:52:57 2013 From: lianoglou.steve at gene.com (Steve Lianoglou) Date: Thu, 8 Aug 2013 09:52:57 -0700 Subject: [datatable-help] subset between data.table list and single data.table object In-Reply-To: References: Message-ID: Hi, On Thu, Aug 8, 2013 at 9:12 AM, Irucka Embry wrote: > Hi Matthew, thank you for your advice. > > I went over the examples in data.table, thank you for the suggestion. I also > got rid of the lapply statements too. > > > big <- lapply(sitefiles,freadDataRatingDepotFiles) > big <- rbindlist(big) > setnames(big,c("y", "shift", "x", "stor", "site_no")) > > big <- big[, y:=as.numeric(y)] > big <- big[, x:=as.numeric(x)] > big <- big[, shift:=as.numeric(shift)] > big <- big[, stor:=NULL] > > big <- na.omit(big) > big <- big[,y:=y+shift] > big <- big[,shift:=NULL] > big <- setkey(big, site_no) > > I have used dput as people on the main R help list had suggested that dput > be used instead of unformatted tables due to text-based e-mail and help > list. Based on your suggestions I have the input, intermediate table, and > the output tables. I would argue that dput is still useful. People prefer it so they can copy paste an text from and email into an R session and help you work out concrete advice based on your data. It is still your responsibility to produce an example that people can work with, though -- Matthew suggested making a good toy example with tables that are much smaller than your real data so we can easily distill what you want there -- in constructing these examples, you may (likely) figure out how to fix the problem yourself and not require more hand holding. Provide two toy tables to play with -- not your current problem that has 50 These examples are too involved for me to follow along as they are, sorry. That having been said, I'll simply point out that you need to fix the `site_no` column in your `big` table -- I suspect you might want to use that column to join across several of your data.tables (since it looks like they all have `site_no`. But look at `big`: > big > y x site_no > 1: 14.80 7900 /tried/02437100.exsa.rdb > 2: 14.81 7920 /tried/02437100.exsa.rdb > 3: 14.82 7930 /tried/02437100.exsa.rdb > 4: 14.83 7950 /tried/02437100.exsa.rdb > 5: 14.84 7970 /tried/02437100.exsa.rdb > --- > 112249: 57.86 2400000 /tried/07289000.exsa.rdb > 112250: 57.87 2410000 /tried/07289000.exsa.rdb > 112251: 57.88 2410000 /tried/07289000.exsa.rdb > 112252: 57.89 2420000 /tried/07289000.exsa.rdb > 112253: 57.90 2430000 /tried/07289000.exsa.rdb Now look at the rest of your table `site_no` values: > aimjoin > site_no mean p50 > 1: 02437100 3882.65 1830.0 > 2: 02446500 819.82 382.0 > 3: 02467000 23742.37 10400.0 > 4: 03217500 224.72 50.0 > 5: 03219500 496.79 140.0 You (obviously) need to strip the "/tried/" from the beginning and ".exsa.rdb" from the end of your big$sit_no column for it to work with the rest of your data. You can do that like so: R> big[, site_no := sub(".exsa.rdb", "", basename(site), fixed=TRUE)] Once that's done, you can easily get from `big` and `aimjoin` to `bigintermediate` ... as for the rest, it's not clear to me what your queries of "where mean of site_no > min(x)" really mean -- you can still use `subset` on data.tables like you can data.frames, and it seems to me that calling those judiciously gets you where you want to go, so I'm not quite sure what the problem is -- it's probably my understanding of what you want. -steve -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech From mdowle at mdowle.plus.com Thu Aug 8 19:04:29 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Thu, 08 Aug 2013 18:04:29 +0100 Subject: [datatable-help] subset between data.table list and single data.table object In-Reply-To: References: Message-ID: <5203CF9D.9040800@mdowle.plus.com> On 08/08/13 17:12, Irucka Embry wrote: > Hi Matthew, thank you for your advice. > > I went over the examples in data.table, thank you for the suggestion. > I also got rid of the lapply statements too. > > big <- lapply(sitefiles,freadDataRatingDepotFiles) > big <- rbindlist(big) > setnames(big,c("y", "shift", "x", "stor", "site_no")) > big <- big[, y:=as.numeric(y)] > big <- big[, x:=as.numeric(x)] > big <- big[, shift:=as.numeric(shift)] > big <- big[, stor:=NULL] > big <- na.omit(big) > big <- big[,y:=y+shift] > big <- big[,shift:=NULL] > big <- setkey(big, site_no) Great. Looks good now. Although, you don't need to assign to big on the left. := already updates by reference. Same for setkey, no need to assign the result, it changes big by reference. Only the na.omit returns a new object, so that result does need to be assigned to big. > > I have used dput as people on the main R help list had suggested that > dput be used instead of unformatted tables due to text-based e-mail > and help list. That's true and appropriate when the question is to do with a specific error message. In that case your aim is for potential answers to be able to reproduce the error message as easily and quickly as possible, so it needs to be pasteable. You're asking a different type of question; i.e., how-do-I-do-this? To answer quickly we can just type some pseudo code to give you a hint, possibly within 30 seconds from a mobile phone. Is there any reason why you don't want to ask this on Stack Overflow? These type of questions are better asked there. But also a table of data is easily readable using fread or readLines, if the answerer wants to test his answer before posting. > Based on your suggestions I have the input, intermediate table, and > the output tables. Ok, further comment below ... > > Thank you. > > Irucka > > > > INPUT > big > y x site_no > 1: 14.80 7900 /tried/02437100.exsa.rdb > 2: 14.81 7920 /tried/02437100.exsa.rdb > 3: 14.82 7930 /tried/02437100.exsa.rdb > 4: 14.83 7950 /tried/02437100.exsa.rdb > 5: 14.84 7970 /tried/02437100.exsa.rdb > --- > 112249: 57.86 2400000 /tried/07289000.exsa.rdb > 112250: 57.87 2410000 /tried/07289000.exsa.rdb > 112251: 57.88 2410000 /tried/07289000.exsa.rdb > 112252: 57.89 2420000 /tried/07289000.exsa.rdb > 112253: 57.90 2430000 /tried/07289000.exsa.rdb > > > aimjoin > site_no mean p50 > 1: 02437100 3882.65 1830.0 > 2: 02446500 819.82 382.0 > 3: 02467000 23742.37 10400.0 > 4: 03217500 224.72 50.0 > 5: 03219500 496.79 140.0 > --- > 54: 06889000 5632.70 2620.0 > 55: 06891000 7018.45 3300.0 > 56: 06893000 52604.19 43200.0 > 57: 06934500 81758.03 61200.0 > 58: 07010000 186504.25 147000.0 > 59: 07289000 755685.30 687000.0 > site_no mean p50 > What you've done is provide a small subset of your real large data. Easier and quicker for you, but harder for us to see. For example, I can't see all the data for site 02437100's mean. The 2 groups of 5 rows that I suggested needs to be the entire dataset. A new one, a dummy one: a toy tiny example with simple data. Please search for guidance on how to ask good questions. Please spend longer on Stack Overflow looking at other's questions for inspiration and guidance. > > > INTERMEDIATE > bigintermediate > y x site_no mean p50 > 1: 14.80 7900 02437100 3882.65 1830.0 > 2: 14.81 7920 02437100 3882.65 1830.0 > 3: 14.82 7930 02437100 3882.65 1830.0 > 4: 14.83 7950 02437100 3882.65 1830.0 > 5: 14.84 7970 02437100 3882.65 1830.0 > --- > 112249: 57.86 2400000 07289000 755685.30 687000.0 > 112250: 57.87 2410000 07289000 755685.30 687000.0 > 112251: 57.88 2410000 07289000 755685.30 687000.0 > 112252: 57.89 2420000 07289000 755685.30 687000.0 > 112253: 57.90 2430000 07289000 755685.30 687000.0 > > > > OUTPUT > bigintermean [where mean of site_no > min(x)] > y x site_no mean > --- > > ... > 112249: 57.86 2400000 07289000 755685.30 > 112250: 57.87 2410000 07289000 755685.30 > 112251: 57.88 2410000 07289000 755685.30 > 112252: 57.89 2420000 07289000 755685.30 > 112253: 57.90 2430000 07289000 755685.30 > > total of 109,452 rows > > > > bigintermedian [where p50 of site_no > min(x)] > y x site_no p50 > --- > > ... > 112249: 57.86 2400000 07289000 687000.0 > 112250: 57.87 2410000 07289000 687000.0 > 112251: 57.88 2410000 07289000 687000.0 > 112252: 57.89 2420000 07289000 687000.0 > 112253: 57.90 2430000 07289000 687000.0 > > total of 109,452 rows > > > > > bigextramean [where mean of site_no < min(x)] > y x site_no mean > 1: 14.80 7900 02437100 3882.65 > 2: 14.81 7920 02437100 3882.65 > 3: 14.82 7930 02437100 3882.65 > 4: 14.83 7950 02437100 3882.65 > 5: 14.84 7970 02437100 3882.65 > > total of 2671 rows > > > bigextramedian [where p50 of site_no < min(x)] > y x site_no p50 > 1: 14.80 7900 02437100 1830.0 > 2: 14.81 7920 02437100 1830.0 > 3: 14.82 7930 02437100 1830.0 > 4: 14.83 7950 02437100 1830.0 > 5: 14.84 7970 02437100 1830.0 > > total of 2671 rows > > > > bigextrameanmax [where mean of site_no > max(x)] > y x site_no mean > 1: 14.80 7900 02437100 3882.65 > 2: 14.81 7920 02437100 3882.65 > 3: 14.82 7930 02437100 3882.65 > 4: 14.83 7950 02437100 3882.65 > 5: 14.84 7970 02437100 3882.65 > > total of 2671 rows > > > bigextramedianmax [where p50 of site_no > max(x)] > y x site_no p50 > 1: 14.80 7900 02437100 1830.0 > 2: 14.81 7920 02437100 1830.0 > 3: 14.82 7930 02437100 1830.0 > 4: 14.83 7950 02437100 1830.0 > 5: 14.84 7970 02437100 1830.0 > > total of 2671 rows > > > > > > > <-----Original Message-----> > >From: Matthew Dowle [mdowle at mdowle.plus.com] > >Sent: 8/7/2013 11:16:37 PM > >To: iruckaE at mail2world.com > >Cc: datatable-help at lists.r-forge.r-project.org > >Subject: Re: [datatable-help] subset between data.table list and > single data.table object > > > >Hm. Have you worked through the examples of data.table? Type > >example(data.table) and try to thoroughly understand each and every > >example. Just forget your immediate problem for the moment, then come > >back to it once you've looked at the examples. > > > >Further comments inline ... > > > > > >On 07/08/13 23:44, iembry wrote: > >> Hi Steve and Matthew, thank you both for your suggestions. This is > the code > >> that I have now: > >> > >> freadDataRatingDepotFiles <- function (file) > >> { > >> RDdatatmp <- fread(file, autostart=40) > >> RDdatatmp[, site:= file] > >> } > >> > >> big <- lapply(sitefiles,freadDataRatingDepotFiles) > >> big <- rbindlist(big) > >> big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) > >> setnames(big[[u]], c("y", "shift", "x", "stor", "site_no"))) > >That lapply and big[[u]] doesn't make much sense. big is one big table, > >with one set of column names. Why loop setnames? > >> big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) > big[[u]][, > >> y:=as.numeric(y)]) > >> big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) > big[[u]][, > >> x:=as.numeric(x)]) > >> big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) > big[[u]][, > >> shift:=as.numeric(shift)]) > >> big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) > big[[u]][, > >> stor:=NULL]) > >> big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) > >> na.omit(big[[u]])) > >> big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) > >> big[[u]][,y:=y+shift]) > >> big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) > >> big[[u]][,shift:=NULL]) > >> big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) > >> setkey(big[[u]], site_no)) > >Again, all these lapply don't make much sense now big is one big table. > >> > >> I am trying to subset big based on the mean and median values in > aimjoin (as > >> described previously in this message thread). > > > >But that part of the message thread is no longer here. So I'd have to > >go and hunt for it. > > > >> > >> This is the first row of aimjoin: > >> dput(aimjoin[1]) > >> structure(list(site_no = "02437100", mean = 3882.65, p50 = 1830), > .Names = > >> c("site_no", > >> "mean", "p50"), sorted = "site_no", class = c("data.table", > "data.frame" > >> ), row.names = c(NA, -1L), .internal.selfref = ) > >> > >> This is one element of big: > >> tempbigdata <- data.frame(c(14.80, 14.81, 14.82), c(7900, 7920, 7930), > >> c("/tried/02437100.exsa.rdb", "/tried/02437100.exsa.rdb", > >> "/tried/02437100.exsa.rdb"), stringsAsFactors = FALSE) > >> names(tempbigdata) <- c("y", "x", "site_no") > >> tempbigdat <- gsub("/tried/", "", tempbigdata) > >> tempbigdat <- gsub(".exsa.rdb", "", tempbigdat) > > > >Please paste the data itself laid out just like you see it at the > >prompt. I find it difficult to parse dput output in emails. And longer > >to paste it into an R session before I see. I often read and reply from > >a mobile phone, as do others I guess. Questions like this are better > >presented on stack overflow. > > > >> # I tried to remove all > >> characters in the column site_no except for the actual site number, > but I > >> ended up with a character vector instead of a data.table > >> > >> This is a revised version of the code that I had written previously to > >> perform the subsetting (prior to using data.table): > >> mp <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) > >> {ifelse(aimjoin[1]$mean[u] < min(big[[u]]$x), subset(getratings[[u]], > >> aimjoin[1]$mean[u] > min(big[[u]]$x) & aimjoin[1]$mean[u], > >> aimjoin[u]$mean[u] > min(big[[u]]$x)), aimjoin[1]$mean[u])}) > >Again, maybe by big[[u]] you mean big[u] if big is keyed, but I didn't > >see a setkey above. Seems like you maybe want [,...,by=site]. > >> > >> > >> I have tried to join aimjoin and big, but I received the error message > >> below: > >> > >> aimjoin[J(big$site_no)] > >> Error in `[.data.table`(aimjoin, J(big$site_no)) : > >> x.'site_no' is a character column being joined to i.'V1' which is type > >> 'NULL'. Character columns must join to factor or character columns. > >I guess that 'site_no' isn't a column of big ... typo of 'site_no'? > >anyList$notthere is NULL in R and only NULL itself is type NULL, hence > >the guess. > >> > >> > >> I also tried to merge aimjoin and big, but it was not what I > wanted. I would > >> like for the mean and p50 values -- for each site number -- to be > joined to > >> the site number in big. I figure that would make it easier to > perform the > >> subsetting. > >Please see examples of good questions on Stack Overflow. There you see > >people put examples of their input and what their desired output is for > >that input data. I really can't see what you're trying to do. > >> > >> I want to subset big based on whether or not the mean or median in > aimjoin > >> is less than the minimum value of x in big. Those mean or median > values in > >> aimjoin that are smaller than x in big will have to be grouped > together for > >> a future step & those mean or median values in aimjoin that are > equal to or > >> larger than the x in big will be grouped together for a future step. > >> > >> Can you provide me with advice on how to proceed with the subsetting? > >Try to construct a really good toy example that demonstrates what you > >want. Show input and desired output. In this case 2 groups of 5 rows > >each should be enough to demonstrate. > > > >> > >> Thank you. > >> > >> Irucka > >> > >> > >> > >> -- > >> View this message in context: > http://r.789695.n4.nabble.com/subset-between-data-table-list- > >and-single-data-table-object-tp4673202p4673308.html > >> Sent from the datatable-help mailing list archive at Nabble.com. > >> _______________________________________________ > >> datatable-help mailing list > >> datatable-help at lists.r-forge.r-project.org > >> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > >> > > > >. > > > > _______________________________________________________________ > Get the Free email that has everyone talking at http://www.mail2world.com > Unlimited Email Storage -- POP3 -- Calendar -- SMS -- Translator -- > Much More! > -------------- next part -------------- An HTML attachment was scrubbed... URL: From saporta at scarletmail.rutgers.edu Thu Aug 8 22:26:20 2013 From: saporta at scarletmail.rutgers.edu (Ricardo Saporta) Date: Thu, 8 Aug 2013 16:26:20 -0400 Subject: [datatable-help] Discrepancy between as.data.frame & as.data.table when handling nested lists In-Reply-To: <520320CB.1050001@mdowle.plus.com> References: <520320CB.1050001@mdowle.plus.com> Message-ID: Matthew, <> The Latter. Specifically, in the last example you give > L = list(1,2,3,list("a",4L,3:10)) # the one nested list here creates one list column > as.data.table(L) V1 V2 V3 V4 1: 1 2 3 a 2: 1 2 3 4 3: 1 2 3 3,4,5,6,7,8, In the example above, we are essentially creating a tuple between each element in the sublist (what becomes V4) and all other elements in L. I am curious as to what this might reflect in real life? I often encounter the opposite need, for example, the sublist being a group of sub-properties. My intuition is that is that sublist is clearly grouped and so should most probably be treated as such. If anything, I would think of expanding the sublist horizontally into additional columns. And if I understand correctly, when the number of elements in the sublist is the same as the length of all the other elements of the main list, then the sublist's groups are preserved, correct? eg: L2 = list( 1:2 , 11:12, 31:32, list(c("a", "b", "c"), c("q","r","s")) ) as.data.table(L2) # V1 V2 V3 V4 # 1: 1 11 31 a,b,c # 2: 2 12 32 q,r,s My curiosity about use cases is mostly for learnings sake Cheers, Rick On Thu, Aug 8, 2013 at 12:38 AM, Matthew Dowle wrote: > > Agreed, intentional. > > > L = list(1,2,3) > > as.data.table(L) > V1 V2 V3 # 3 columns, not one list column > 1: 1 2 3 > > > L = list(1,2,3,list("a",4L,3:10)) # the one nested list here creates > one list column > > as.data.table(L) > V1 V2 V3 V4 > 1: 1 2 3 a > 2: 1 2 3 4 > 3: 1 2 3 3,4,5,6,7,8, > > Rick - are you asking for use cases of list columns full stop or use cases > of converting nested lists to data.table containing list columns ? > > > > On 08/08/13 04:30, Ricardo Saporta wrote: > > Hey Frank, > > Thanks for pointing out that SO link, I had missed it. > > All, > > I'm curious as to which used cases this functionality would be used in > (used for?) > > thanks, > Rick > > > > On Wed, Aug 7, 2013 at 8:14 PM, Frank Erickson wrote: > >> Hi Rick, >> >> I guess it's intentional: Matthew saw this SO question (since he edited >> one of the answers): >> http://stackoverflow.com/questions/9547518/creating-a-data-frame-where-a-column-is-a-list >> >> Some musings: Of course, to reproduce as.data.frame-like behavior, you >> can un-nest the list, so both functions treat it the same way. >> >> Z <- unlist(Y,recursive=FALSE) >> >> identical(as.data.table(Z),as.data.table(as.data.frame(Z))) # TRUE >> # or, equivalently (?) >> identical(do.call(data.table,Z),data.table(do.call(data.frame,Z))) # >> TRUE >> >> >> On the other hand, going back the other direction (getting >> data.table-like behavior when data.frame's is the default) is more awkward, >> as seen in that SO question (where they mention protecting each sublist >> with the I() function). Besides, I'm with @flodel, who asked the SO >> question, in expecting data.table's behavior: one top-level item in the >> list mapping to one column in the result... >> >> --Frank >> >> On Wed, Aug 7, 2013 at 4:56 PM, Ricardo Saporta < >> saporta at scarletmail.rutgers.edu> wrote: >> >>> Hi all, >>> >>> Note the following discrepancy in structure between as.data.frame & >>> as.data.table when called on a nested list. >>> as.data.frame converts the sublist into individual columns whereas >>> as.data.table stacks them into a single column and creates additional rows. >>> >>> Is this intentional? >>> -Rick >>> >>> >>> as.data.frame(X) >>> # start type end data.editDist data.second >>> # 1 start_node is_similar end_node 1 HelloWorld >>> >>> as.data.table(X) >>> # start type end data >>> # 1: start_node is_similar end_node 1 >>> # 2: start_node is_similar end_node HelloWorld >>> >>> >>> >>> >>> ### Copy+Paste'able Below ### >>> >>> # Example 1: >>> X <- structure(list(start = "start_node", type = "is_similar", end = >>> "end_node", >>> data = structure(list(editDist = 1, second = "HelloWorld"), .Names = >>> c("editDist", >>> "second"))), .Names = c("start", "type", "end", "data")) >>> >>> as.data.frame(X) >>> as.data.table(X) >>> >>> as.data.table(as.data.frame(X)) >>> >>> >>> # Example 2, with more elements: >>> Y <- structure(list(start = c("start_node", "start_node"), type = >>> c("is_similar", "is_similar"), end = c("end_node", "end_node"), data = >>> structure(list(editDist = c(1, 1), second = c("HelloWorld", "HelloWorld")), >>> .Names = c("editDist", "second"))), .Names = c("start", "type", "end", >>> "data")) >>> >>> as.data.frame(Y) >>> as.data.table(Y) >>> >>> >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >> >> > > > _______________________________________________ > datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From iruckaE at mail2world.com Sat Aug 10 21:04:25 2013 From: iruckaE at mail2world.com (iembry) Date: Sat, 10 Aug 2013 12:04:25 -0700 (PDT) Subject: [datatable-help] subset between data.table list and single data.table object In-Reply-To: <5203CF9D.9040800@mdowle.plus.com> References: <1375758602714-4673202.post@n4.nabble.com> <5203CF9D.9040800@mdowle.plus.com> Message-ID: <1376161465422-4673496.post@n4.nabble.com> Hi Matthew and Steve, thank you both for your suggestions. I had not used the basename function before and that was what was keeping me from moving forward along with my coding. As Matthew has suggested I made a post at http://stackoverflow.com/questions/18165373/interpolation-of-grouped-data-using-data-table with the remaining question that I have. This is my revised code: big <- lapply(sitefiles,freadDataRatingDepotFiles) big <- rbindlist(big) setnames(big,c("y", "shift", "x", "stor", "site_no")) big[, y:=as.numeric(y)] big[, x:=as.numeric(x)] big[, shift:=as.numeric(shift)] big[, stor:=NULL] big <- na.omit(big) big[,y:=y+shift] big[,shift:=NULL] big[, site_no := sub(".exsa.rdb", "", basename(site_no), fixed=TRUE)] setkey(big, site_no) big <- big[aimjoin] Once again, thank you both for your advice and suggestions. Irucka -- View this message in context: http://r.789695.n4.nabble.com/subset-between-data-table-list-and-single-data-table-object-tp4673202p4673496.html Sent from the datatable-help mailing list archive at Nabble.com. From iruckaE at mail2world.com Mon Aug 12 15:02:01 2013 From: iruckaE at mail2world.com (iembry) Date: Mon, 12 Aug 2013 06:02:01 -0700 (PDT) Subject: [datatable-help] subset between data.table list and single data.table object In-Reply-To: <1376161465422-4673496.post@n4.nabble.com> References: <1375758602714-4673202.post@n4.nabble.com> <5203CF9D.9040800@mdowle.plus.com> <1376161465422-4673496.post@n4.nabble.com> Message-ID: <1376312521886-4673560.post@n4.nabble.com> The following code represents the answer to my questions with assistance from both this mailing list and stackoverflow: big <- lapply(sitefiles,freadDataRatingDepotFiles) big <- rbindlist(big) setnames(big,c("y", "shift", "x", "stor", "site_no")) big[, y:=as.numeric(y)] big[, x:=as.numeric(x)] big[, shift:=as.numeric(shift)] big[, stor:=NULL] big <- na.omit(big) big[,y:=y+shift] big[,shift:=NULL] big[, site_no := sub(".exsa.rdb", "", basename(site_no), fixed=TRUE)] setkey(big, site_no) meaninterp <- big[aimjoin][,if(mean[1] > min(x)){interp1(x, y, xi = mean[1], method ="linear")},by=site_no] setnames(meaninterp,c("site_no", "mean")) medianinterp <- big[aimjoin][,if(p50[1] > min(x)){interp1(x, y, xi = p50[1], method ="linear")},by=site_no] setnames(medianinterp,c("site_no", "median")) meanextrap <- big[aimjoin][,if(mean[1] < min(x)){approxExtrap(x, y, xout = mean[1], method ="linear")},by=site_no] medianextrap <- big[aimjoin][,if(p50[1] < min(x)){approxExtrap(x, y, xout = p50[1], method ="linear")},by=site_no] -- View this message in context: http://r.789695.n4.nabble.com/subset-between-data-table-list-and-single-data-table-object-tp4673202p4673560.html Sent from the datatable-help mailing list archive at Nabble.com. From mailinglist.honeypot at gmail.com Mon Aug 12 19:51:28 2013 From: mailinglist.honeypot at gmail.com (Steve Lianoglou) Date: Mon, 12 Aug 2013 10:51:28 -0700 Subject: [datatable-help] unique.data.frame should create a copy, right? In-Reply-To: References: <0F857B92DB0744C69CFC07AAE0C4DCF4@gmail.com> <2A8FA620F9814DE48058FCA9F41C0AE6@gmail.com> <6C1BF9F6C1454190AA0457AA23DFB386@gmail.com> Message-ID: Hi folks, I actually want to revisit the fix I made here. Instead of having `use.key` in the signature to unique.data.table (and duplicated.data.table) to be: function(x, incomparables=FALSE, tolerance=.Machine$double.eps ^ 0.5, use.key=TRUE, ...) How about we switch out use.key for a parameter that specifies the column names to use in the uniqueness check, which defaults to key(x) to keep backwards compatibility. For argument's sake (like that?), lets call this parameter `columns` (by.columns? with.columns? whatever) so: function(x, incomparables=FALSE, tolerance=.Machine$double.eps ^ 0.5, columns=key(x), ...) Then: (1) leaving it alone is the backward compatibile behavior; (2) Perhaps setting it to NULL will use all columns, and make it equivalent to unique.data.frame (also the same when x has no key); and (3) setting it to any other combo of columns uses those columns as the uniqueness key and filters the rows (only) out of x accordingly. What do you folks think? Personally I think this is better on all accounts then just specifying to use the key or not and the only question in my mind is the name of the argument -- happy to hear other world views, however, so don't be shy. Thanks, -steve -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech From FErickson at psu.edu Mon Aug 12 20:11:41 2013 From: FErickson at psu.edu (Frank Erickson) Date: Mon, 12 Aug 2013 13:11:41 -0500 Subject: [datatable-help] unique.data.frame should create a copy, right? In-Reply-To: References: <0F857B92DB0744C69CFC07AAE0C4DCF4@gmail.com> <2A8FA620F9814DE48058FCA9F41C0AE6@gmail.com> <6C1BF9F6C1454190AA0457AA23DFB386@gmail.com> Message-ID: Yeah, this sounds like a better way to do it. Thanks for doing all this! How about "compare.cols" or "compare.by"? --Frank On Mon, Aug 12, 2013 at 12:51 PM, Steve Lianoglou < mailinglist.honeypot at gmail.com> wrote: > Hi folks, > > I actually want to revisit the fix I made here. > > Instead of having `use.key` in the signature to unique.data.table (and > duplicated.data.table) to be: > > function(x, > incomparables=FALSE, > tolerance=.Machine$double.eps ^ 0.5, > use.key=TRUE, ...) > > How about we switch out use.key for a parameter that specifies the > column names to use in the uniqueness check, which defaults to key(x) > to keep backwards compatibility. > > For argument's sake (like that?), lets call this parameter `columns` > (by.columns? with.columns? whatever) so: > > function(x, > incomparables=FALSE, > tolerance=.Machine$double.eps ^ 0.5, > columns=key(x), ...) > > Then: > > (1) leaving it alone is the backward compatibile behavior; > (2) Perhaps setting it to NULL will use all columns, and make it > equivalent to unique.data.frame (also the same when x has no key); and > (3) setting it to any other combo of columns uses those columns as the > uniqueness key and filters the rows (only) out of x accordingly. > > What do you folks think? Personally I think this is better on all > accounts then just specifying to use the key or not and the only > question in my mind is the name of the argument -- happy to hear other > world views, however, so don't be shy. > > Thanks, > -steve > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From saporta at scarletmail.rutgers.edu Mon Aug 12 20:12:48 2013 From: saporta at scarletmail.rutgers.edu (Ricardo Saporta) Date: Mon, 12 Aug 2013 14:12:48 -0400 Subject: [datatable-help] unique.data.frame should create a copy, right? In-Reply-To: References: <0F857B92DB0744C69CFC07AAE0C4DCF4@gmail.com> <2A8FA620F9814DE48058FCA9F41C0AE6@gmail.com> <6C1BF9F6C1454190AA0457AA23DFB386@gmail.com> Message-ID: Steve, I like your suggestion a lot. I can see putting column specification to good use. As for the argument name, perhaps 'use.columns' And where a value of NULL or FALSE will yield same results as `unique.data.frame` use.columns=key(x) # default behavior use.columns=c("col1name", "col7name") #etc use.columns=NULL Thanks as always, Rick On Mon, Aug 12, 2013 at 1:51 PM, Steve Lianoglou < mailinglist.honeypot at gmail.com> wrote: > Hi folks, > > I actually want to revisit the fix I made here. > > Instead of having `use.key` in the signature to unique.data.table (and > duplicated.data.table) to be: > > function(x, > incomparables=FALSE, > tolerance=.Machine$double.eps ^ 0.5, > use.key=TRUE, ...) > > How about we switch out use.key for a parameter that specifies the > column names to use in the uniqueness check, which defaults to key(x) > to keep backwards compatibility. > > For argument's sake (like that?), lets call this parameter `columns` > (by.columns? with.columns? whatever) so: > > function(x, > incomparables=FALSE, > tolerance=.Machine$double.eps ^ 0.5, > columns=key(x), ...) > > Then: > > (1) leaving it alone is the backward compatibile behavior; > (2) Perhaps setting it to NULL will use all columns, and make it > equivalent to unique.data.frame (also the same when x has no key); and > (3) setting it to any other combo of columns uses those columns as the > uniqueness key and filters the rows (only) out of x accordingly. > > What do you folks think? Personally I think this is better on all > accounts then just specifying to use the key or not and the only > question in my mind is the name of the argument -- happy to hear other > world views, however, so don't be shy. > > Thanks, > -steve > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mailinglist.honeypot at gmail.com Tue Aug 13 22:24:04 2013 From: mailinglist.honeypot at gmail.com (Steve Lianoglou) Date: Tue, 13 Aug 2013 13:24:04 -0700 Subject: [datatable-help] unique.data.frame should create a copy, right? In-Reply-To: References: <0F857B92DB0744C69CFC07AAE0C4DCF4@gmail.com> <2A8FA620F9814DE48058FCA9F41C0AE6@gmail.com> <6C1BF9F6C1454190AA0457AA23DFB386@gmail.com> Message-ID: Thanks for the suggestions, folks. Matthew: do you have a preference? -steve On Mon, Aug 12, 2013 at 11:12 AM, Ricardo Saporta wrote: > Steve, > > I like your suggestion a lot. I can see putting column specification to > good use. > > As for the argument name, perhaps > 'use.columns' > > And where a value of NULL or FALSE will yield same results as > `unique.data.frame` > > use.columns=key(x) # default behavior > use.columns=c("col1name", "col7name") #etc > use.columns=NULL > > > Thanks as always, > Rick > > > > On Mon, Aug 12, 2013 at 1:51 PM, Steve Lianoglou > wrote: >> >> Hi folks, >> >> I actually want to revisit the fix I made here. >> >> Instead of having `use.key` in the signature to unique.data.table (and >> duplicated.data.table) to be: >> >> function(x, >> incomparables=FALSE, >> tolerance=.Machine$double.eps ^ 0.5, >> use.key=TRUE, ...) >> >> How about we switch out use.key for a parameter that specifies the >> column names to use in the uniqueness check, which defaults to key(x) >> to keep backwards compatibility. >> >> For argument's sake (like that?), lets call this parameter `columns` >> (by.columns? with.columns? whatever) so: >> >> function(x, >> incomparables=FALSE, >> tolerance=.Machine$double.eps ^ 0.5, >> columns=key(x), ...) >> >> Then: >> >> (1) leaving it alone is the backward compatibile behavior; >> (2) Perhaps setting it to NULL will use all columns, and make it >> equivalent to unique.data.frame (also the same when x has no key); and >> (3) setting it to any other combo of columns uses those columns as the >> uniqueness key and filters the rows (only) out of x accordingly. >> >> What do you folks think? Personally I think this is better on all >> accounts then just specifying to use the key or not and the only >> question in my mind is the name of the argument -- happy to hear other >> world views, however, so don't be shy. >> >> Thanks, >> -steve >> >> -- >> Steve Lianoglou >> Computational Biologist >> Bioinformatics and Computational Biology >> Genentech > > -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech From john.kerpel2 at gmail.com Wed Aug 14 16:30:29 2013 From: john.kerpel2 at gmail.com (John Kerpel) Date: Wed, 14 Aug 2013 09:30:29 -0500 Subject: [datatable-help] Writing a good "WHERE" statement Message-ID: Folks: I've been working more and more in data.table and it gets better and better as I learn to use it.... So my question is, in the following example, is the Where statement inefficient because I'm using "==" ? (I just want to use the subsets where days are equal to the values in the statement) Should I do this another way? (Btw, I get exactly the right answer using this approach...) db <- DB4[ Days==1 | Days==5 | Days==10 | Days==20 | Days==30 | Days==60 | Days==90, j = {list(v=m_func(x,y,z))}, by= "Date,Indicator,Days" ] If more detail is necessary, lmk. Date, Indicator, and Days are keys. Best, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Wed Aug 14 16:58:45 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Wed, 14 Aug 2013 09:58:45 -0500 Subject: [datatable-help] Writing a good "WHERE" statement In-Reply-To: References: Message-ID: you could use Days %in% c(1,5,10,20,30,60,90) instead, or iff Days is the first key J(c(1,5,10,20,30,60,90)) On Wed, Aug 14, 2013 at 9:30 AM, John Kerpel wrote: > Folks: > > I've been working more and more in data.table and it gets better and > better as I learn to use it.... > > So my question is, in the following example, is the Where statement > inefficient because I'm using "==" ? (I just want to use the subsets where > days are equal to the values in the statement) > > Should I do this another way? (Btw, I get exactly the right answer using > this approach...) > > db <- DB4[ > > Days==1 | Days==5 | Days==10 | Days==20 | Days==30 | Days==60 | Days==90, > > j = {list(v=m_func(x,y,z))}, > > by= "Date,Indicator,Days" > > ] > > If more detail is necessary, lmk. Date, Indicator, and Days are keys. > > Best, > > John > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.kerpel2 at gmail.com Wed Aug 14 17:20:42 2013 From: john.kerpel2 at gmail.com (John Kerpel) Date: Wed, 14 Aug 2013 10:20:42 -0500 Subject: [datatable-help] Writing a good "WHERE" statement In-Reply-To: References: Message-ID: Thx Eduard - it's about 5% faster your way according to rbenchmark. Hey, latency is latency....just got a need for speed... Best, John On Wed, Aug 14, 2013 at 9:58 AM, Eduard Antonyan wrote: > you could use Days %in% c(1,5,10,20,30,60,90) instead, or iff Days is the > first key J(c(1,5,10,20,30,60,90)) > > > On Wed, Aug 14, 2013 at 9:30 AM, John Kerpel wrote: > >> Folks: >> >> I've been working more and more in data.table and it gets better and >> better as I learn to use it.... >> >> So my question is, in the following example, is the Where statement >> inefficient because I'm using "==" ? (I just want to use the subsets where >> days are equal to the values in the statement) >> >> Should I do this another way? (Btw, I get exactly the right answer using >> this approach...) >> >> db <- DB4[ >> >> Days==1 | Days==5 | Days==10 | Days==20 | Days==30 | Days==60 | >> Days==90, >> >> j = {list(v=m_func(x,y,z))}, >> >> by= "Date,Indicator,Days" >> >> ] >> >> If more detail is necessary, lmk. Date, Indicator, and Days are keys. >> >> Best, >> >> John >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Wed Aug 14 18:26:05 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Wed, 14 Aug 2013 18:26:05 +0200 Subject: [datatable-help] data.table() function regarding Message-ID: Hello, This question comes from a recent SO question on Why is transform.data.table so much slower than transform.data.frame? (http://stackoverflow.com/questions/18216658/why-is-transform-data-table-so-much-slower-than-transform-data-frame) Suppose I've, DT <- data.table(x=sample(1e5), y=sample(1e5), z=sample(1e5)) And I want to transform this data.table by adding an extra column z = 1 (I'm aware of the idiomatic way of using :=, but let's keep that aside for the moment), I'd do: transform(DT, z = 1)) However, this is terribly slow. I debugged the code and found out the reason for this slowness. To gist the issue, transform.data.table calls: ans <- do.call("data.table", c(list(`_data`), e[!matched])) which calls data.table() where, the slowness happens here: exptxt = as.character(tt) # <~~~~~~~~ SLOW when called with `do.call`! Now, the point is, exptxt is only used under one other if-statement, pasted below. if (any(novname) && length(exptxt)==length(vnames)) { okexptxt = exptxt[novname] == make.names(exptxt[novname]) vnames[novname][okexptxt] = exptxt[novname][okexptxt] } tt = vnames=="" And this statement is basically useful, for example, if one does: x <- 1:5 y <- 6:10 DT <- data.table(x, y) x y 1: 1 6 2: 2 7 3: 3 8 4: 4 9 5: 5 10 This gives a data.table with column names the same as input variables instead of giving V1 and V2. But, this is what is slowing down do.call("data.table", ...) function. For example, ll <- list(data.table(x=runif(1e5), y=runif(1e5)), z=runif(1e5), w=1) system.time(do.call("data.table", ll)) # 30 seconds on my mac But, this exptxt <- as.character(tt) and the above mentioned if-statement can be replaced with (with help from data.frame function): for (i in which(novname)) { tmp <- deparse(tt[[i]]) if (tmp == make.names(tmp)) vnames[i] <- tmp } And by replacing with this and running do.call("data.table", ...) takes 0.04 seconds. Also,data.table(x,y) gives the intended result with column names x and y. In essence, by replacing the above mentioned lines, the desired function of data.table remains unchanged while do.call("data.table", ...) is faster (and hence transform and other functions that depend on it). What do you think? To my knowledge, this doesn't seem to break anything else... Arun -------------- next part -------------- An HTML attachment was scrubbed... URL: From lianoglou.steve at gene.com Wed Aug 14 19:18:40 2013 From: lianoglou.steve at gene.com (Steve Lianoglou) Date: Wed, 14 Aug 2013 10:18:40 -0700 Subject: [datatable-help] data.table() function regarding In-Reply-To: References: Message-ID: Hi Arun, Thanks for this very detailed analysis! The slowness of transform.data.table is something that's been bugging me for a while but have not had the time to dig into it myself yet, so this is really great. I quickly tried to apply your proposed fix and recompiled/reinstalled data.table. It looks like there are some errors that do pop up after running test.data.table(), but I *think* they are trivial -- I don't have time to investigate further right now, but will do so in due time if Matthew (or you :-) don't be me to it. Thanks again, -steve On Wed, Aug 14, 2013 at 9:26 AM, Arunkumar Srinivasan wrote: > Hello, > > This question comes from a recent SO question on Why is transform.data.table > so much slower than transform.data.frame? > > Suppose I've, > > DT <- data.table(x=sample(1e5), y=sample(1e5), z=sample(1e5)) > > And I want to transform this data.table by adding an extra column z = 1 (I'm > aware of the idiomatic way of using :=, but let's keep that aside for the > moment), I'd do: > > transform(DT, z = 1)) > > However, this is terribly slow. I debugged the code and found out the reason > for this slowness. To gist the issue, transform.data.table calls: > > ans <- do.call("data.table", c(list(`_data`), e[!matched])) > > which calls data.table() where, the slowness happens here: > > exptxt = as.character(tt) # <~~~~~~~~ SLOW when called with `do.call`! > > Now, the point is, exptxt is only used under one other if-statement, pasted > below. > > if (any(novname) && length(exptxt)==length(vnames)) { > okexptxt = exptxt[novname] == make.names(exptxt[novname]) > vnames[novname][okexptxt] = exptxt[novname][okexptxt] > } > tt = vnames=="" > > And this statement is basically useful, for example, if one does: > > x <- 1:5 > y <- 6:10 > DT <- data.table(x, y) > x y > 1: 1 6 > 2: 2 7 > 3: 3 8 > 4: 4 9 > 5: 5 10 > > This gives a data.table with column names the same as input variables > instead of giving V1 and V2. > > But, this is what is slowing down do.call("data.table", ...) function. For > example, > > ll <- list(data.table(x=runif(1e5), y=runif(1e5)), z=runif(1e5), w=1) > system.time(do.call("data.table", ll)) # 30 seconds on my mac > > But, this exptxt <- as.character(tt) and the above mentioned if-statement > can be replaced with (with help from data.frame function): > > for (i in which(novname)) { > tmp <- deparse(tt[[i]]) > if (tmp == make.names(tmp)) > vnames[i] <- tmp > } > > And by replacing with this and running do.call("data.table", ...) takes 0.04 > seconds. Also,data.table(x,y) gives the intended result with column names x > and y. > > In essence, by replacing the above mentioned lines, the desired function of > data.table remains unchanged while do.call("data.table", ...) is faster (and > hence transform and other functions that depend on it). > > What do you think? To my knowledge, this doesn't seem to break anything > else... > > Arun > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech From lianoglou.steve at gene.com Wed Aug 14 19:24:08 2013 From: lianoglou.steve at gene.com (Steve Lianoglou) Date: Wed, 14 Aug 2013 10:24:08 -0700 Subject: [datatable-help] data.table() function regarding In-Reply-To: References: Message-ID: In fact, we already had a ticket on the tracker for this, so just updating this (and that) with this thread: https://r-forge.r-project.org/tracker/?group_id=240&atid=975&func=detail&aid=2599 On Wed, Aug 14, 2013 at 10:18 AM, Steve Lianoglou wrote: > Hi Arun, > > Thanks for this very detailed analysis! > > The slowness of transform.data.table is something that's been bugging > me for a while but have not had the time to dig into it myself yet, so > this is really great. > > I quickly tried to apply your proposed fix and recompiled/reinstalled > data.table. It looks like there are some errors that do pop up after > running test.data.table(), but I *think* they are trivial -- I don't > have time to investigate further right now, but will do so in due time > if Matthew (or you :-) don't be me to it. > > Thanks again, > -steve > > > On Wed, Aug 14, 2013 at 9:26 AM, Arunkumar Srinivasan > wrote: >> Hello, >> >> This question comes from a recent SO question on Why is transform.data.table >> so much slower than transform.data.frame? >> >> Suppose I've, >> >> DT <- data.table(x=sample(1e5), y=sample(1e5), z=sample(1e5)) >> >> And I want to transform this data.table by adding an extra column z = 1 (I'm >> aware of the idiomatic way of using :=, but let's keep that aside for the >> moment), I'd do: >> >> transform(DT, z = 1)) >> >> However, this is terribly slow. I debugged the code and found out the reason >> for this slowness. To gist the issue, transform.data.table calls: >> >> ans <- do.call("data.table", c(list(`_data`), e[!matched])) >> >> which calls data.table() where, the slowness happens here: >> >> exptxt = as.character(tt) # <~~~~~~~~ SLOW when called with `do.call`! >> >> Now, the point is, exptxt is only used under one other if-statement, pasted >> below. >> >> if (any(novname) && length(exptxt)==length(vnames)) { >> okexptxt = exptxt[novname] == make.names(exptxt[novname]) >> vnames[novname][okexptxt] = exptxt[novname][okexptxt] >> } >> tt = vnames=="" >> >> And this statement is basically useful, for example, if one does: >> >> x <- 1:5 >> y <- 6:10 >> DT <- data.table(x, y) >> x y >> 1: 1 6 >> 2: 2 7 >> 3: 3 8 >> 4: 4 9 >> 5: 5 10 >> >> This gives a data.table with column names the same as input variables >> instead of giving V1 and V2. >> >> But, this is what is slowing down do.call("data.table", ...) function. For >> example, >> >> ll <- list(data.table(x=runif(1e5), y=runif(1e5)), z=runif(1e5), w=1) >> system.time(do.call("data.table", ll)) # 30 seconds on my mac >> >> But, this exptxt <- as.character(tt) and the above mentioned if-statement >> can be replaced with (with help from data.frame function): >> >> for (i in which(novname)) { >> tmp <- deparse(tt[[i]]) >> if (tmp == make.names(tmp)) >> vnames[i] <- tmp >> } >> >> And by replacing with this and running do.call("data.table", ...) takes 0.04 >> seconds. Also,data.table(x,y) gives the intended result with column names x >> and y. >> >> In essence, by replacing the above mentioned lines, the desired function of >> data.table remains unchanged while do.call("data.table", ...) is faster (and >> hence transform and other functions that depend on it). >> >> What do you think? To my knowledge, this doesn't seem to break anything >> else... >> >> Arun >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech From aragorn168b at gmail.com Wed Aug 14 22:10:58 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Wed, 14 Aug 2013 22:10:58 +0200 Subject: [datatable-help] data.table() function regarding In-Reply-To: References: Message-ID: <6622ED171BEC461C90C4EC4DE66FAEFA@gmail.com> Dear Steve, Yes, the issue is with `deparse`. By default, it'll split the converted character string at every 60 characters? An easy fix is to just take the first element of `deparse` result (just like how `data.frame` does it): That is, the function should be: for (i in which(novname)) { tmp <- deparse(tt[[i]])[1] if (tmp == make.names(tmp)) vnames[i] <- tmp } If the `[1]` was not present then we are comparing a vector of > 1 elements to another? Hence the warning. With this change, I did a `R CMD CHECK ` and the test finished without errors. But it says at the end "there was 1 warning". Here's the last few lines from the check: * checking tests ... Running ?autoprint.R? Comparing ?autoprint.Rout? to ?autoprint.Rout.save? ... OK Running ?test-all.R? Running ?tests.R? OK * checking for unstated dependencies in vignettes ... OK * checking package vignettes in ?inst/doc? ... OK * checking running R code from vignettes ... ?datatable-faq.Rnw? ... OK ?datatable-intro.Rnw? ... OK ?datatable-timings.Rnw? ... OK OK * checking re-building of vignette PDFs ... OK * checking PDF version of manual ... OK WARNING: There was 1 warning. NOTE: There was 1 note. Arun -------------- next part -------------- An HTML attachment was scrubbed... URL: From mailinglist.honeypot at gmail.com Thu Aug 15 02:30:19 2013 From: mailinglist.honeypot at gmail.com (Steve Lianoglou) Date: Wed, 14 Aug 2013 17:30:19 -0700 Subject: [datatable-help] unique.data.frame should create a copy, right? In-Reply-To: References: <0F857B92DB0744C69CFC07AAE0C4DCF4@gmail.com> <2A8FA620F9814DE48058FCA9F41C0AE6@gmail.com> <6C1BF9F6C1454190AA0457AA23DFB386@gmail.com> Message-ID: Hi all, As I needed this sooner than I had expected, I just committed this change. It's in svn revision 889. I chose 'by.columns' as the parameter names -- seemed to make more sense to me, and using the short hand interactively saves a letter, eg: unique(dt, by=c('some', 'columns')) ;-) Here's the note from the NEWS file: o "Uniqueness" tests can now specify arbirtray combinations of columns to use to test for duplicates. `by.columns` parameter added to unique.data.table and duplicated.data.table. This allows the user to test for uniqueness using any combination of columns in the data.table, where previously the user only had the option to use the keyed columns (if keyed) or all columns (if not). The default behavior sets `by.columns=key(dt)` to maintain backward compatability. See man/duplicated.Rd and tests 986:991 for more information. Thanks to Arunkumar Srinivasan, Ricardo Saporta, and Frank Erickson for useful discussions. Should work as advertised assuming my unit tests weren't too simplistic. Cheers, -steve On Tue, Aug 13, 2013 at 1:24 PM, Steve Lianoglou wrote: > Thanks for the suggestions, folks. > > Matthew: do you have a preference? > > -steve > > On Mon, Aug 12, 2013 at 11:12 AM, Ricardo Saporta > wrote: >> Steve, >> >> I like your suggestion a lot. I can see putting column specification to >> good use. >> >> As for the argument name, perhaps >> 'use.columns' >> >> And where a value of NULL or FALSE will yield same results as >> `unique.data.frame` >> >> use.columns=key(x) # default behavior >> use.columns=c("col1name", "col7name") #etc >> use.columns=NULL >> >> >> Thanks as always, >> Rick >> >> >> >> On Mon, Aug 12, 2013 at 1:51 PM, Steve Lianoglou >> wrote: >>> >>> Hi folks, >>> >>> I actually want to revisit the fix I made here. >>> >>> Instead of having `use.key` in the signature to unique.data.table (and >>> duplicated.data.table) to be: >>> >>> function(x, >>> incomparables=FALSE, >>> tolerance=.Machine$double.eps ^ 0.5, >>> use.key=TRUE, ...) >>> >>> How about we switch out use.key for a parameter that specifies the >>> column names to use in the uniqueness check, which defaults to key(x) >>> to keep backwards compatibility. >>> >>> For argument's sake (like that?), lets call this parameter `columns` >>> (by.columns? with.columns? whatever) so: >>> >>> function(x, >>> incomparables=FALSE, >>> tolerance=.Machine$double.eps ^ 0.5, >>> columns=key(x), ...) >>> >>> Then: >>> >>> (1) leaving it alone is the backward compatibile behavior; >>> (2) Perhaps setting it to NULL will use all columns, and make it >>> equivalent to unique.data.frame (also the same when x has no key); and >>> (3) setting it to any other combo of columns uses those columns as the >>> uniqueness key and filters the rows (only) out of x accordingly. >>> >>> What do you folks think? Personally I think this is better on all >>> accounts then just specifying to use the key or not and the only >>> question in my mind is the name of the argument -- happy to hear other >>> world views, however, so don't be shy. >>> >>> Thanks, >>> -steve >>> >>> -- >>> Steve Lianoglou >>> Computational Biologist >>> Bioinformatics and Computational Biology >>> Genentech >> >> > > > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech From FErickson at psu.edu Thu Aug 15 17:49:10 2013 From: FErickson at psu.edu (Frank Erickson) Date: Thu, 15 Aug 2013 10:49:10 -0500 Subject: [datatable-help] Is there a way to merge-assign on something other than the key? Message-ID: Hi, I really like the DT1[DT2,z:=...] idiom. Unfortunately, the value of a merge() on other columns is a new data.table, so modifying DT1, like merge(DT1,DT2,by=...)[,z:=...], is not possible. Or is there actually a way to do this that I am missing? If this syntax -- DT1[DT2,z:=...,by=c(key(DT1),x)] -- behaved differently, allowing the "by" to determine which columns were merged on, that would solve my issue, I guess. By the way, when you use by=c(key(DT),x), you get a speedup from DT's being keyed, right? *Some background* (rest of email): I've coded a value function iteration rather inefficiently and am looking into a few different directions for improving it. Efficiency matters because the result will enter a likelihood I need to maximize. (Value function iteration is solving a dynamic programming problem with discrete periods and a finite horizon from the horizon/final period backwards.) I was alternating between two data tables and doing things like DT1[DT2[y==t],z:=...] and DT2[DT1[y==t-1],q:=...], changing the keys on both tables before each merge-assign. If secondary keys were implemented, I'd have just gone with that. (Tom's secondary key methodmentioned in the last link only works for subsetting, not merge-assigning, as far as I can tell.) I think I'm getting a big slow-down because I'm rekeying four times per iteration (two tables x two merge-assigns) and because I'm rekeying the entire table, when I'm only assigning to a subset. That second problem is easier to fix, I guess. Now I am considering making a single DT with key=intersect(key(DT1),key(DT2)) and using that instead, if I can figure out a way to do what I need to with it. Thanks, Frank -------------- next part -------------- An HTML attachment was scrubbed... URL: From lianoglou.steve at gene.com Thu Aug 15 18:23:53 2013 From: lianoglou.steve at gene.com (Steve Lianoglou) Date: Thu, 15 Aug 2013 09:23:53 -0700 Subject: [datatable-help] Is there a way to merge-assign on something other than the key? In-Reply-To: References: Message-ID: Hi Frank, I don't have time to get into the details of your entire question right now, but: On Thu, Aug 15, 2013 at 8:49 AM, Frank Erickson wrote: > Hi, > > I really like the DT1[DT2,z:=...] idiom. Unfortunately, the value of a > merge() on other columns is a new data.table, so modifying DT1, like > merge(DT1,DT2,by=...)[,z:=...], is not possible. Or is there actually a way > to do this that I am missing? I just wanted to mention that unless I am misunderstanding what you want, this is entirely possible, and the way you suggest it might work is actually the way to do it. Consider: R> dt1 <- data.table(a=sample(letters[1:2], 5, rep=T), b=runif(5), key='a') R> dt2 <- data.table(a=c('a', 'b'), c=rnorm(5), key='a') R> dt1 a b 1: a 0.02517147 2: a 0.85459776 3: a 0.67472168 4: a 0.89684769 5: b 0.11619613 R> dt2 a c 1: a -0.07817539 2: b -1.28897689 R> out <- merge(dt1, dt2)[, d := b + c] a b c d 1: a 0.02517147 -0.07817539 -0.05300392 2: a 0.85459776 -0.07817539 0.77642237 3: a 0.67472168 -0.07817539 0.59654629 4: a 0.89684769 -0.07817539 0.81867230 5: b 0.11619613 -1.28897689 -1.17278075 Will come back later to look through the rest of your question if still necessary. HTH, -steve -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech From FErickson at psu.edu Thu Aug 15 18:29:07 2013 From: FErickson at psu.edu (Frank Erickson) Date: Thu, 15 Aug 2013 11:29:07 -0500 Subject: [datatable-help] Is there a way to merge-assign on something other than the key? In-Reply-To: References: Message-ID: Thanks, Steve. I was thinking that that approach spends too much memory or computational time by making a copy (while I was thinking of editing dt1 in-place). If I'm wrong, maybe that's what I ought to do. If not, I look forward to hearing your other thoughts. --Frank On Thu, Aug 15, 2013 at 11:23 AM, Steve Lianoglou wrote: > Hi Frank, > > I don't have time to get into the details of your entire question > right now, but: > > On Thu, Aug 15, 2013 at 8:49 AM, Frank Erickson wrote: > > Hi, > > > > I really like the DT1[DT2,z:=...] idiom. Unfortunately, the value of a > > merge() on other columns is a new data.table, so modifying DT1, like > > merge(DT1,DT2,by=...)[,z:=...], is not possible. Or is there actually a > way > > to do this that I am missing? > > I just wanted to mention that unless I am misunderstanding what you > want, this is entirely possible, and the way you suggest it might work > is actually the way to do it. > > Consider: > > R> dt1 <- data.table(a=sample(letters[1:2], 5, rep=T), b=runif(5), key='a') > R> dt2 <- data.table(a=c('a', 'b'), c=rnorm(5), key='a') > > R> dt1 > a b > 1: a 0.02517147 > 2: a 0.85459776 > 3: a 0.67472168 > 4: a 0.89684769 > 5: b 0.11619613 > > R> dt2 > a c > 1: a -0.07817539 > 2: b -1.28897689 > > R> out <- merge(dt1, dt2)[, d := b + c] > a b c d > 1: a 0.02517147 -0.07817539 -0.05300392 > 2: a 0.85459776 -0.07817539 0.77642237 > 3: a 0.67472168 -0.07817539 0.59654629 > 4: a 0.89684769 -0.07817539 0.81867230 > 5: b 0.11619613 -1.28897689 -1.17278075 > > Will come back later to look through the rest of your question if > still necessary. > > HTH, > -steve > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Thu Aug 15 22:57:08 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Thu, 15 Aug 2013 22:57:08 +0200 Subject: [datatable-help] unique.data.frame should create a copy, right? In-Reply-To: References: <0F857B92DB0744C69CFC07AAE0C4DCF4@gmail.com> <2A8FA620F9814DE48058FCA9F41C0AE6@gmail.com> <6C1BF9F6C1454190AA0457AA23DFB386@gmail.com> Message-ID: <3C8A798C16894DAC9D125E2D469E2EA6@gmail.com> Awesome! Thanks, Steve. Arun On Thursday, August 15, 2013 at 2:30 AM, Steve Lianoglou wrote: > Hi all, > > As I needed this sooner than I had expected, I just committed this > change. It's in svn revision 889. > > I chose 'by.columns' as the parameter names -- seemed to make more > sense to me, and using the short hand interactively saves a letter, > eg: unique(dt, by=c('some', 'columns')) ;-) > > Here's the note from the NEWS file: > > o "Uniqueness" tests can now specify arbirtray combinations of > columns to use to test for duplicates. `by.columns` parameter added to > unique.data.table and duplicated.data.table. This allows the user to > test for uniqueness using any combination of columns in the > data.table, where previously the user only had the option to use the > keyed columns (if keyed) or all columns (if not). The default behavior > sets `by.columns=key(dt)` to maintain backward compatability. See > man/duplicated.Rd and tests 986:991 for more information. Thanks to > Arunkumar Srinivasan, Ricardo Saporta, and Frank Erickson for useful > discussions. > > Should work as advertised assuming my unit tests weren't too simplistic. > > Cheers, > > -steve > > > > > On Tue, Aug 13, 2013 at 1:24 PM, Steve Lianoglou > wrote: > > Thanks for the suggestions, folks. > > > > Matthew: do you have a preference? > > > > -steve > > > > On Mon, Aug 12, 2013 at 11:12 AM, Ricardo Saporta > > wrote: > > > Steve, > > > > > > I like your suggestion a lot. I can see putting column specification to > > > good use. > > > > > > As for the argument name, perhaps > > > 'use.columns' > > > > > > And where a value of NULL or FALSE will yield same results as > > > `unique.data.frame` > > > > > > use.columns=key(x) # default behavior > > > use.columns=c("col1name", "col7name") #etc > > > use.columns=NULL > > > > > > > > > Thanks as always, > > > Rick > > > > > > > > > > > > On Mon, Aug 12, 2013 at 1:51 PM, Steve Lianoglou > > > wrote: > > > > > > > > Hi folks, > > > > > > > > I actually want to revisit the fix I made here. > > > > > > > > Instead of having `use.key` in the signature to unique.data.table (and > > > > duplicated.data.table) to be: > > > > > > > > function(x, > > > > incomparables=FALSE, > > > > tolerance=.Machine$double.eps ^ 0.5, > > > > use.key=TRUE, ...) > > > > > > > > How about we switch out use.key for a parameter that specifies the > > > > column names to use in the uniqueness check, which defaults to key(x) > > > > to keep backwards compatibility. > > > > > > > > For argument's sake (like that?), lets call this parameter `columns` > > > > (by.columns? with.columns? whatever) so: > > > > > > > > function(x, > > > > incomparables=FALSE, > > > > tolerance=.Machine$double.eps ^ 0.5, > > > > columns=key(x), ...) > > > > > > > > Then: > > > > > > > > (1) leaving it alone is the backward compatibile behavior; > > > > (2) Perhaps setting it to NULL will use all columns, and make it > > > > equivalent to unique.data.frame (also the same when x has no key); and > > > > (3) setting it to any other combo of columns uses those columns as the > > > > uniqueness key and filters the rows (only) out of x accordingly. > > > > > > > > What do you folks think? Personally I think this is better on all > > > > accounts then just specifying to use the key or not and the only > > > > question in my mind is the name of the argument -- happy to hear other > > > > world views, however, so don't be shy. > > > > > > > > Thanks, > > > > -steve > > > > > > > > -- > > > > Steve Lianoglou > > > > Computational Biologist > > > > Bioinformatics and Computational Biology > > > > Genentech > > > > > > > > > > > > > > > > > > > > -- > > Steve Lianoglou > > Computational Biologist > > Bioinformatics and Computational Biology > > Genentech > > > > > > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Thu Aug 15 22:57:44 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Thu, 15 Aug 2013 22:57:44 +0200 Subject: [datatable-help] data.table() function regarding In-Reply-To: <6622ED171BEC461C90C4EC4DE66FAEFA@gmail.com> References: <6622ED171BEC461C90C4EC4DE66FAEFA@gmail.com> Message-ID: <6CF64E6AF5044038B611ECC94F4DCF42@gmail.com> Steve, Did you have sometime to look into the edited loop (also at the test results)? Arun On Wednesday, August 14, 2013 at 10:10 PM, Arunkumar Srinivasan wrote: > Dear Steve, > > Yes, the issue is with `deparse`. By default, it'll split the converted character string at every 60 characters? An easy fix is to just take the first element of `deparse` result (just like how `data.frame` does it): > That is, the function should be: > > for (i in which(novname)) { > tmp <- deparse(tt[[i]])[1] > if (tmp == make.names(tmp)) > vnames[i] <- tmp > } > > > If the `[1]` was not present then we are comparing a vector of > 1 elements to another? Hence the warning. With this change, I did a `R CMD CHECK ` and the test finished without errors. But it says at the end "there was 1 warning". Here's the last few lines from the check: > > * checking tests ... > Running ?autoprint.R? > Comparing ?autoprint.Rout? to ?autoprint.Rout.save? ... OK > Running ?test-all.R? > Running ?tests.R? > OK > * checking for unstated dependencies in vignettes ... OK > * checking package vignettes in ?inst/doc? ... OK > * checking running R code from vignettes ... > ?datatable-faq.Rnw? ... OK > ?datatable-intro.Rnw? ... OK > ?datatable-timings.Rnw? ... OK > OK > * checking re-building of vignette PDFs ... OK > * checking PDF version of manual ... OK > > WARNING: There was 1 warning. > NOTE: There was 1 note. > > > > Arun > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lianoglou.steve at gene.com Thu Aug 15 23:11:06 2013 From: lianoglou.steve at gene.com (Steve Lianoglou) Date: Thu, 15 Aug 2013 14:11:06 -0700 Subject: [datatable-help] data.table() function regarding In-Reply-To: <6CF64E6AF5044038B611ECC94F4DCF42@gmail.com> References: <6622ED171BEC461C90C4EC4DE66FAEFA@gmail.com> <6CF64E6AF5044038B611ECC94F4DCF42@gmail.com> Message-ID: Hi Arun, On Thu, Aug 15, 2013 at 1:57 PM, Arunkumar Srinivasan wrote: > Steve, > Did you have sometime to look into the edited loop (also at the test > results)? Sorry, I haven't. Truth be told, my eyes typically tend to gloss over when I start spiraling down into code with lots of deparse, substitute, quote, etc. stuff in it :-) It's going to take me a few days before I get the time to have a sit down and look. If the test run clean for you, though, things are likely very promising. Sorry for the delay, -steve > > Arun > > On Wednesday, August 14, 2013 at 10:10 PM, Arunkumar Srinivasan wrote: > > Dear Steve, > > Yes, the issue is with `deparse`. By default, it'll split the converted > character string at every 60 characters? An easy fix is to just take the > first element of `deparse` result (just like how `data.frame` does it): > That is, the function should be: > > for (i in which(novname)) { > tmp <- deparse(tt[[i]])[1] > if (tmp == make.names(tmp)) > vnames[i] <- tmp > } > > If the `[1]` was not present then we are comparing a vector of > 1 elements > to another? Hence the warning. With this change, I did a `R CMD CHECK ` > and the test finished without errors. But it says at the end "there was 1 > warning". Here's the last few lines from the check: > > * checking tests ... > Running ?autoprint.R? > Comparing ?autoprint.Rout? to ?autoprint.Rout.save? ... OK > Running ?test-all.R? > Running ?tests.R? > OK > * checking for unstated dependencies in vignettes ... OK > * checking package vignettes in ?inst/doc? ... OK > * checking running R code from vignettes ... > ?datatable-faq.Rnw? ... OK > ?datatable-intro.Rnw? ... OK > ?datatable-timings.Rnw? ... OK > OK > * checking re-building of vignette PDFs ... OK > * checking PDF version of manual ... OK > > WARNING: There was 1 warning. > NOTE: There was 1 note. > > > Arun > > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech From saporta at scarletmail.rutgers.edu Fri Aug 16 06:35:27 2013 From: saporta at scarletmail.rutgers.edu (Ricardo Saporta) Date: Fri, 16 Aug 2013 00:35:27 -0400 Subject: [datatable-help] unique.data.frame should create a copy, right? In-Reply-To: References: <0F857B92DB0744C69CFC07AAE0C4DCF4@gmail.com> <2A8FA620F9814DE48058FCA9F41C0AE6@gmail.com> <6C1BF9F6C1454190AA0457AA23DFB386@gmail.com> Message-ID: Steve, great stuff!! thanks for making that happen Rick On Wed, Aug 14, 2013 at 8:30 PM, Steve Lianoglou < mailinglist.honeypot at gmail.com> wrote: > Hi all, > > As I needed this sooner than I had expected, I just committed this > change. It's in svn revision 889. > > I chose 'by.columns' as the parameter names -- seemed to make more > sense to me, and using the short hand interactively saves a letter, > eg: unique(dt, by=c('some', 'columns')) ;-) > > Here's the note from the NEWS file: > > o "Uniqueness" tests can now specify arbirtray combinations of > columns to use to test for duplicates. `by.columns` parameter added to > unique.data.table and duplicated.data.table. This allows the user to > test for uniqueness using any combination of columns in the > data.table, where previously the user only had the option to use the > keyed columns (if keyed) or all columns (if not). The default behavior > sets `by.columns=key(dt)` to maintain backward compatability. See > man/duplicated.Rd and tests 986:991 for more information. Thanks to > Arunkumar Srinivasan, Ricardo Saporta, and Frank Erickson for useful > discussions. > > Should work as advertised assuming my unit tests weren't too simplistic. > > Cheers, > > -steve > > > > > On Tue, Aug 13, 2013 at 1:24 PM, Steve Lianoglou > wrote: > > Thanks for the suggestions, folks. > > > > Matthew: do you have a preference? > > > > -steve > > > > On Mon, Aug 12, 2013 at 11:12 AM, Ricardo Saporta > > wrote: > >> Steve, > >> > >> I like your suggestion a lot. I can see putting column specification to > >> good use. > >> > >> As for the argument name, perhaps > >> 'use.columns' > >> > >> And where a value of NULL or FALSE will yield same results as > >> `unique.data.frame` > >> > >> use.columns=key(x) # default behavior > >> use.columns=c("col1name", "col7name") #etc > >> use.columns=NULL > >> > >> > >> Thanks as always, > >> Rick > >> > >> > >> > >> On Mon, Aug 12, 2013 at 1:51 PM, Steve Lianoglou > >> wrote: > >>> > >>> Hi folks, > >>> > >>> I actually want to revisit the fix I made here. > >>> > >>> Instead of having `use.key` in the signature to unique.data.table (and > >>> duplicated.data.table) to be: > >>> > >>> function(x, > >>> incomparables=FALSE, > >>> tolerance=.Machine$double.eps ^ 0.5, > >>> use.key=TRUE, ...) > >>> > >>> How about we switch out use.key for a parameter that specifies the > >>> column names to use in the uniqueness check, which defaults to key(x) > >>> to keep backwards compatibility. > >>> > >>> For argument's sake (like that?), lets call this parameter `columns` > >>> (by.columns? with.columns? whatever) so: > >>> > >>> function(x, > >>> incomparables=FALSE, > >>> tolerance=.Machine$double.eps ^ 0.5, > >>> columns=key(x), ...) > >>> > >>> Then: > >>> > >>> (1) leaving it alone is the backward compatibile behavior; > >>> (2) Perhaps setting it to NULL will use all columns, and make it > >>> equivalent to unique.data.frame (also the same when x has no key); and > >>> (3) setting it to any other combo of columns uses those columns as the > >>> uniqueness key and filters the rows (only) out of x accordingly. > >>> > >>> What do you folks think? Personally I think this is better on all > >>> accounts then just specifying to use the key or not and the only > >>> question in my mind is the name of the argument -- happy to hear other > >>> world views, however, so don't be shy. > >>> > >>> Thanks, > >>> -steve > >>> > >>> -- > >>> Steve Lianoglou > >>> Computational Biologist > >>> Bioinformatics and Computational Biology > >>> Genentech > >> > >> > > > > > > > > -- > > Steve Lianoglou > > Computational Biologist > > Bioinformatics and Computational Biology > > Genentech > > > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > -------------- next part -------------- An HTML attachment was scrubbed... URL: From smartpink111 at yahoo.com Fri Aug 16 06:52:48 2013 From: smartpink111 at yahoo.com (arun) Date: Thu, 15 Aug 2013 21:52:48 -0700 (PDT) Subject: [datatable-help] Slow execution: Extracting last value in each group Message-ID: <1376628768.74473.YahooMailNeo@web142602.mail.bf1.yahoo.com> HI, This is a follow up from a post in R-help mailing list. (http://r.789695.n4.nabble.com/How-to-extract-last-value-in-each-group-td4673787.html).? In short, I tried the below using data.table(), but found to be slower than some of the other methods.? Steve Lianoglou also tried the same and got it much faster (system.time()~ 0.070? vs. ~40 ). ###data dat1<- structure(list(Date = c("06/01/2010", "06/01/2010", "06/01/2010", "06/01/2010", "06/02/2010", "06/02/2010", "06/02/2010", "06/02/2010", "06/02/2010", "06/02/2010", "06/02/2010"), Time = c(1358L, 1359L, 1400L, 1700L, 331L, 332L, 334L, 335L, 336L, 337L, 338L), O = c(136.4, 136.4, 136.45, 136.55, 136.55, 136.7, 136.75, 136.8, 136.8, 136.75, 136.8), H = c(136.4, 136.5, 136.55, 136.55, 136.7, 136.7, 136.75, 136.8, 136.8, 136.8, 136.8), L = c(136.35, 136.35, 136.35, 136.55, 136.5, 136.65, 136.75, 136.8, 136.8, 136.75, 136.8), C = c(136.35, 136.5, 136.4, 136.55, 136.7, 136.65, 136.75, 136.8, 136.8, 136.8, 136.8), U = c(2L, 9L, 8L, 1L, 36L, 3L, 1L, 4L, 8L, 1L, 3L), D = c(12L, 6L, 7L, 0L, 6L, 1L, 0L, 0L, 0L, 2L, 0L)), .Names = c("Date", "Time", "O", "H", "L", "C", "U", "D"), class = "data.frame", row.names = c(NA, -11L)) indx<- rep(1:nrow(dat1),1e5) dat2<- dat1[indx,] dat2[-c(1:11),1]<-format(rep(seq(as.Date("1080-01-01"),by=1,length.out=99999),each=11),"%m/%d/%Y") ?dat2<- dat2[order(dat2[,1],dat2[,2]),] row.names(dat2)<-1:nrow(dat2) #Some speed comparisons (more in the link): system.time(res1<-dat2[c(diff(as.numeric(as.factor(dat2$Date))),1)>0,]) #?? user? system elapsed ?# 0.528?? 0.012?? 0.540 ?system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),]) #?? user? system elapsed ?# 0.156?? 0.000?? 0.155 library(data.table) system.time({ dt1 <- data.table(dat2, key=c('Date', 'Time')) ?ans <- dt1[, .SD[.N], by='Date']}) ?# user? system elapsed ?#39.860?? 0.020? 39.952?? #############slower than many other methods ans1<- as.data.frame(ans) ?row.names(ans1)<- row.names(res7) ?attr(ans1,"row.names")<- attr(res7,"row.names") ?identical(ans1,res7) #[1] TRUE Steve Lianoglou reply is below: ############################ Amazing. This is what I get on my MacBook Pro, i7 @ 3GHz (very close specs to your machine): R> dt1 <- data.table(dat2, key=c('Date', 'Time')) R> system.time(ans <- dt1[, .SD[.N], by='Date']) ? user? system elapsed ? 0.064? 0.009? 0.073? ########################### R> system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),]) ? user? system elapsed ? 0.148? 0.016? 0.165 On one of our compute server running who knows what processor on some version of linux, but shouldn't really matter as we're talking relative time to each other here: R> system.time(ans <- dt1[, .SD[.N], by='Date']) ? user? system elapsed ? 0.160? 0.012? 0.170 R> system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),]) ? user? system elapsed ? 0.292? 0.004? 0.294 ############################################## My sessionInfo####### sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-unknown-linux-gnu (64-bit)? (linux mint 15) locale: ?[1] LC_CTYPE=en_CA.UTF-8?????? LC_NUMERIC=C????????????? ?[3] LC_TIME=en_CA.UTF-8??????? LC_COLLATE=en_CA.UTF-8??? ?[5] LC_MONETARY=en_CA.UTF-8??? LC_MESSAGES=en_CA.UTF-8?? ?[7] LC_PAPER=C???????????????? LC_NAME=C???????????????? ?[9] LC_ADDRESS=C?????????????? LC_TELEPHONE=C??????????? [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C?????? attached base packages: [1] stats???? graphics? grDevices utils???? datasets? methods?? base???? other attached packages: [1] data.table_1.8.8 stringr_0.6.2??? reshape2_1.2.2? loaded via a namespace (and not attached): [1] plyr_1.8??? tools_3.0.1 CPU #################### I use Dell XPS L502X ?* Processor 2nd Gen Core i7 Intel i7-2630QM / 2 GHz ( 2.9 GHz ) ( Quad-Core ) ?* Memory 6 GB / 8 GB (max) ?* Hard Drive 640 GB - Serial ATA-300 - 7200 rpm? Any help will be appreciated. Thanks. A.K. From aragorn168b at gmail.com Fri Aug 16 08:27:34 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 16 Aug 2013 08:27:34 +0200 Subject: [datatable-help] Slow execution: Extracting last value in each group In-Reply-To: <1376628768.74473.YahooMailNeo@web142602.mail.bf1.yahoo.com> References: <1376628768.74473.YahooMailNeo@web142602.mail.bf1.yahoo.com> Message-ID: <26A5E42C45874B66B3E98F3A40C69916@gmail.com> Sorry, but I'm not sure what your question is here. There seems to be different timings between you and Steve. You want to get it verified as to which one is true? On my system, Steve's takes 0.003 seconds. However, a *faster* version than Steve's solution (on bigger data) would be: x[x[, .I[.N], by='Date']$V1] Arun On Friday, August 16, 2013 at 6:52 AM, arun wrote: > HI, > This is a follow up from a post in R-help mailing list. (http://r.789695.n4.nabble.com/How-to-extract-last-value-in-each-group-td4673787.html). > > > In short, I tried the below using data.table(), but found to be slower than some of the other methods. Steve Lianoglou also tried the same and got it much faster (system.time()~ 0.070 vs. ~40 ). > > ###data > > dat1<- structure(list(Date = c("06/01/2010", "06/01/2010", "06/01/2010", > "06/01/2010", "06/02/2010", "06/02/2010", "06/02/2010", "06/02/2010", > "06/02/2010", "06/02/2010", "06/02/2010"), Time = c(1358L, 1359L, > 1400L, 1700L, 331L, 332L, 334L, 335L, 336L, 337L, 338L), O = c(136.4, > 136.4, 136.45, 136.55, 136.55, 136.7, 136.75, 136.8, 136.8, 136.75, > 136.8), H = c(136.4, 136.5, 136.55, 136.55, 136.7, 136.7, 136.75, > 136.8, 136.8, 136.8, 136.8), L = c(136.35, 136.35, 136.35, 136.55, > 136.5, 136.65, 136.75, 136.8, 136.8, 136.75, 136.8), C = c(136.35, > 136.5, 136.4, 136.55, 136.7, 136.65, 136.75, 136.8, 136.8, 136.8, > 136.8), U = c(2L, 9L, 8L, 1L, 36L, 3L, 1L, 4L, 8L, 1L, 3L), D = c(12L, > 6L, 7L, 0L, 6L, 1L, 0L, 0L, 0L, 2L, 0L)), .Names = c("Date", > "Time", "O", "H", "L", "C", "U", "D"), class = "data.frame", row.names = c(NA, > -11L)) > > > indx<- rep(1:nrow(dat1),1e5) > dat2<- dat1[indx,] > dat2[-c(1:11),1]<-format(rep(seq(as.Date("1080-01-01"),by=1,length.out=99999),each=11),"%m/%d/%Y") > dat2<- dat2[order(dat2[,1],dat2[,2]),] > row.names(dat2)<-1:nrow(dat2) > > > > #Some speed comparisons (more in the link): > system.time(res1<-dat2[c(diff(as.numeric(as.factor(dat2$Date))),1)>0,]) > # user system elapsed > # 0.528 0.012 0.540 > system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),]) > # user system elapsed > # 0.156 0.000 0.155 > > > library(data.table) > system.time({ > dt1 <- data.table(dat2, key=c('Date', 'Time')) > ans <- dt1[, .SD[.N], by='Date']}) > > # user system elapsed > #39.860 0.020 39.952 #############slower than many other methods > ans1<- as.data.frame(ans) > row.names(ans1)<- row.names(res7) > attr(ans1,"row.names")<- attr(res7,"row.names") > identical(ans1,res7) > #[1] TRUE > > > > > Steve Lianoglou reply is below: > ############################ > > > Amazing. This is what I get on my MacBook Pro, i7 @ 3GHz (very close > specs to your machine): > > R> dt1 <- data.table(dat2, key=c('Date', 'Time')) > R> system.time(ans <- dt1[, .SD[.N], by='Date']) > user system elapsed > 0.064 0.009 0.073 ########################### > > R> system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),]) > user system elapsed > 0.148 0.016 0.165 > > On one of our compute server running who knows what processor on some > version of linux, but shouldn't really matter as we're talking > relative time to each other here: > > R> system.time(ans <- dt1[, .SD[.N], by='Date']) > user system elapsed > 0.160 0.012 0.170 > > R> system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),]) > user system elapsed > 0.292 0.004 0.294 > ############################################## > > My sessionInfo####### > sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-unknown-linux-gnu (64-bit) (linux mint 15) > > locale: > [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8 > [5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] data.table_1.8.8 stringr_0.6.2 reshape2_1.2.2 > > loaded via a namespace (and not attached): > [1] plyr_1.8 tools_3.0.1 > > CPU #################### > I use Dell XPS L502X > * Processor 2nd Gen Core i7 Intel i7-2630QM / 2 GHz ( 2.9 GHz ) ( Quad-Core ) > * Memory 6 GB / 8 GB (max) > * Hard Drive 640 GB - Serial ATA-300 - 7200 rpm > > Any help will be appreciated. > Thanks. > A.K. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lianoglou.steve at gene.com Fri Aug 16 09:01:49 2013 From: lianoglou.steve at gene.com (Steve Lianoglou) Date: Fri, 16 Aug 2013 00:01:49 -0700 Subject: [datatable-help] Slow execution: Extracting last value in each group In-Reply-To: <26A5E42C45874B66B3E98F3A40C69916@gmail.com> References: <1376628768.74473.YahooMailNeo@web142602.mail.bf1.yahoo.com> <26A5E42C45874B66B3E98F3A40C69916@gmail.com> Message-ID: Hi Arun, On Thu, Aug 15, 2013 at 11:27 PM, Arunkumar Srinivasan wrote: > Sorry, but I'm not sure what your question is here. There seems to be > different timings between you and Steve. You want to get it verified as to > which one is true? On my system, Steve's takes 0.003 seconds. Actually, the issue was that (as far as I could tell) his code and my code are exactly the same, but it runs orders of magnitude slower on his machine than anywhere else I could test. It doesn't make any sense -- perhaps I'm not looking close enough, but I suggested he send it here so more eyes could see it, because I'm stumped as to why/how that could happen. > However, a *faster* version than Steve's solution (on bigger data) would be: > > x[x[, .I[.N], by='Date']$V1] Hah! Well done ;-) -steve -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech From aragorn168b at gmail.com Fri Aug 16 09:07:03 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 16 Aug 2013 09:07:03 +0200 Subject: [datatable-help] Slow execution: Extracting last value in each group In-Reply-To: References: <1376628768.74473.YahooMailNeo@web142602.mail.bf1.yahoo.com> <26A5E42C45874B66B3E98F3A40C69916@gmail.com> Message-ID: <02A43A8DCEAB4D5CA4BC1C61D73269A1@gmail.com> Steve, Thank you. arun, Could you run it with `microbenchmark` instead of system.time (with times = 100 or so) and paste the results here? Also, maybe you could use debugonce(data.table:::`[.data.table`) and then run x[, .SD[.N], by='Date'] to go step by step to find out the line that causes the lag, perhaps? Arun On Friday, August 16, 2013 at 9:01 AM, Steve Lianoglou wrote: > Hi Arun, > > On Thu, Aug 15, 2013 at 11:27 PM, Arunkumar Srinivasan > wrote: > > Sorry, but I'm not sure what your question is here. There seems to be > > different timings between you and Steve. You want to get it verified as to > > which one is true? On my system, Steve's takes 0.003 seconds. > > > > > Actually, the issue was that (as far as I could tell) his code and my > code are exactly the same, but it runs orders of magnitude slower on > his machine than anywhere else I could test. > > It doesn't make any sense -- perhaps I'm not looking close enough, but > I suggested he send it here so more eyes could see it, because I'm > stumped as to why/how that could happen. > > > However, a *faster* version than Steve's solution (on bigger data) would be: > > > > x[x[, .I[.N], by='Date']$V1] > > Hah! Well done ;-) > > -steve > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From FErickson at psu.edu Fri Aug 16 12:34:44 2013 From: FErickson at psu.edu (Frank Erickson) Date: Fri, 16 Aug 2013 05:34:44 -0500 Subject: [datatable-help] Slow execution: Extracting last value in each group In-Reply-To: <02A43A8DCEAB4D5CA4BC1C61D73269A1@gmail.com> References: <1376628768.74473.YahooMailNeo@web142602.mail.bf1.yahoo.com> <26A5E42C45874B66B3E98F3A40C69916@gmail.com> <02A43A8DCEAB4D5CA4BC1C61D73269A1@gmail.com> Message-ID: I get similar timings to arun, with the data.table call being a lot slower than the other timings. If data.table is not optimized for that .SD expression, perhaps that is okay because, as Arun pointed out, there are alternatives.. I can't guess why it would perform differently on different hardware, though... # alternatives: a <- dt1[dt1[, .I[.N], by='Date']$V1] b <- dt1[J(unique(Date)),,mult='last'] # a little slower d <- dt1[, .SD[.N], by='Date'] # 600x slower; it would take ages to benchmark identical(a,b) # true identical(a,d) # false identical(as.data.frame(d),as.data.frame(a)) # true --Frank -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Fri Aug 16 12:37:21 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 16 Aug 2013 12:37:21 +0200 Subject: [datatable-help] Slow execution: Extracting last value in each group In-Reply-To: References: <1376628768.74473.YahooMailNeo@web142602.mail.bf1.yahoo.com> <26A5E42C45874B66B3E98F3A40C69916@gmail.com> <02A43A8DCEAB4D5CA4BC1C61D73269A1@gmail.com> Message-ID: <9E4F228B7B0041BE9AD2428DCA0E5D80@gmail.com> Frank, Is it a windows machine as well? And could you try to use `debugonce` to find out the line(s) where it's slow? Arun On Friday, August 16, 2013 at 12:34 PM, Frank Erickson wrote: > I get similar timings to arun, with the data.table call being a lot slower than the other timings. If data.table is not optimized for that .SD expression, perhaps that is okay because, as Arun pointed out, there are alternatives.. I can't guess why it would perform differently on different hardware, though... > > # alternatives: > a <- dt1[dt1[, .I[.N], by='Date']$V1] > b <- dt1[J(unique(Date)),,mult='last'] # a little slower > d <- dt1[, .SD[.N], by='Date'] # 600x slower; it would take ages to benchmark > identical(a,b) # true > identical(a,d) # false > identical(as.data.frame(d),as.data.frame(a)) # true > > --Frank > -------------- next part -------------- An HTML attachment was scrubbed... URL: From FErickson at psu.edu Fri Aug 16 15:43:52 2013 From: FErickson at psu.edu (Frank Erickson) Date: Fri, 16 Aug 2013 08:43:52 -0500 Subject: [datatable-help] Slow execution: Extracting last value in each group In-Reply-To: <9E4F228B7B0041BE9AD2428DCA0E5D80@gmail.com> References: <1376628768.74473.YahooMailNeo@web142602.mail.bf1.yahoo.com> <26A5E42C45874B66B3E98F3A40C69916@gmail.com> <02A43A8DCEAB4D5CA4BC1C61D73269A1@gmail.com> <9E4F228B7B0041BE9AD2428DCA0E5D80@gmail.com> Message-ID: Hi Arun, Yup, windows (see below). I tried debugonce, but didn't really know what I was looking for. Every step was instantaneous except this one: debug: ans = .Call(Cdogroups, x, xcols, groups, grpcols, jiscols, grporder, o__, f__, len__, jsub, SDenv, cols, newnames, verbose) --Frank sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] rbenchmark_1.0.0 data.table_1.8.8 loaded via a namespace (and not attached): [1] tools_3.0.1 On Fri, Aug 16, 2013 at 5:37 AM, Arunkumar Srinivasan wrote: > Frank, > Is it a windows machine as well? > And could you try to use `debugonce` to find out the line(s) where it's > slow? > > Arun > > On Friday, August 16, 2013 at 12:34 PM, Frank Erickson wrote: > > I get similar timings to arun, with the data.table call being a lot slower > than the other timings. If data.table is not optimized for that .SD > expression, perhaps that is okay because, as Arun pointed out, there are > alternatives.. I can't guess why it would perform differently on different > hardware, though... > > # alternatives: > a <- dt1[dt1[, .I[.N], by='Date']$V1] > b <- dt1[J(unique(Date)),,mult='last'] # a little slower > d <- dt1[, .SD[.N], by='Date'] # 600x slower; it would take ages to > benchmark > identical(a,b) # true > identical(a,d) # false > identical(as.data.frame(d),as.data.frame(a)) # true > > --Frank > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Fri Aug 16 15:47:07 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 16 Aug 2013 15:47:07 +0200 Subject: [datatable-help] Slow execution: Extracting last value in each group In-Reply-To: References: <1376628768.74473.YahooMailNeo@web142602.mail.bf1.yahoo.com> <26A5E42C45874B66B3E98F3A40C69916@gmail.com> <02A43A8DCEAB4D5CA4BC1C61D73269A1@gmail.com> <9E4F228B7B0041BE9AD2428DCA0E5D80@gmail.com> Message-ID: <00D3C25C596D4437BE92322A1B5D1ACE@gmail.com> Frank, Great, thank you. So, basically it's the call to "C" that's taking the time.. Probably version of C? I still have trouble using gdb with R. Can't help much to debug there. Hopefully someone else could lend a hand. Arun On Friday, August 16, 2013 at 3:43 PM, Frank Erickson wrote: > Hi Arun, > > Yup, windows (see below). > > I tried debugonce, but didn't really know what I was looking for. Every step was instantaneous except this one: > > debug: ans = .Call(Cdogroups, x, xcols, groups, grpcols, jiscols, grporder, > o__, f__, len__, jsub, SDenv, cols, newnames, verbose) > > > --Frank > > sessionInfo() > > R version 3.0.1 (2013-05-16) > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] rbenchmark_1.0.0 data.table_1.8.8 > > loaded via a namespace (and not attached): > [1] tools_3.0.1 > > > > > On Fri, Aug 16, 2013 at 5:37 AM, Arunkumar Srinivasan wrote: > > Frank, > > Is it a windows machine as well? > > And could you try to use `debugonce` to find out the line(s) where it's slow? > > > > Arun > > > > > > On Friday, August 16, 2013 at 12:34 PM, Frank Erickson wrote: > > > > > I get similar timings to arun, with the data.table call being a lot slower than the other timings. If data.table is not optimized for that .SD expression, perhaps that is okay because, as Arun pointed out, there are alternatives.. I can't guess why it would perform differently on different hardware, though... > > > > > > # alternatives: > > > a <- dt1[dt1[, .I[.N], by='Date']$V1] > > > b <- dt1[J(unique(Date)),,mult='last'] # a little slower > > > d <- dt1[, .SD[.N], by='Date'] # 600x slower; it would take ages to benchmark > > > identical(a,b) # true > > > identical(a,d) # false > > > identical(as.data.frame(d),as.data.frame(a)) # true > > > > > > --Frank > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From smartpink111 at yahoo.com Fri Aug 16 17:02:16 2013 From: smartpink111 at yahoo.com (arun) Date: Fri, 16 Aug 2013 08:02:16 -0700 (PDT) Subject: [datatable-help] Slow execution: Extracting last value in each group In-Reply-To: <2A8B8745EDD4482D8FFFBE7AEE68B633@gmail.com> References: <1376628768.74473.YahooMailNeo@web142602.mail.bf1.yahoo.com> <2A8B8745EDD4482D8FFFBE7AEE68B633@gmail.com> Message-ID: <1376665336.19445.YahooMailNeo@web142605.mail.bf1.yahoo.com> Hi Arun, Frank & Steve, Thanks for responding to my post. I did the 'microbenchmark' and 'debugonce(....)' using the same dataset 'dat2'. f1 <- function (dataFrame) { ????? dataFrame[unlist(with(dataFrame, tapply(Time, list(Date), FUN = function(x) x == max(x)))), ] ? } ? f2 <- function (dataFrame) { ????? dataFrame[cumsum(with(dataFrame, tapply(Time, list(Date), FUN = which.max))), ] ? } isLastInRun <- function(x) c(x[-1] != x[-length(x)], TRUE) ? f3 <- function(dataFrame) { ????? dataFrame[ isLastInRun(dataFrame$Date), ] ? } ? f4<- function(dataFrame){ ??? dataFrame[as.logical(with(dataFrame,ave(Time,Date,FUN=function(x) x==max(x)))),] } f5<- function(dataFrame){ ?? dataFrame[cumsum(rle(dataFrame[,1])$lengths),] ??? } library(data.table) dt1 <- data.table(dat2, key=c('Date', 'Time')) f6<- function(dataTable){ ? dataTable[, .SD[.N], by='Date']} ? f7<- function(dataTable){ ? dataTable[dataTable[, .I[.N], by='Date']$V1] ?? } f8<- function(dataTable){ dataTable[J(unique(Date)),,mult='last'] ? }??? ? f9<- function(dataTable){ ?dataTable[dataTable[, .I[.N], by='Date']$V1] } library(microbenchmark) microbenchmark(f1(dat2), ??? ?????? f2(dat2), ??? ?????? f3(dat2), ??? ?????? f4(dat2), ??? ?????? f5(dat2), ??? ?????? f6(dt1), ??? ?????? f7(dt1), ??? ?????? f8(dt1), ??? ?????? f9(dt1), ??? ?????? times=100) ??? ? ??? ??? ??? #Unit: milliseconds #???? expr???????? min????????? lq????? median????????? uq??????? max neval # f1(dat2)? 2046.59313? 2318.57397? 2414.21020? 2533.28214? 2842.9609?? 100 # f2(dat2)?? 940.97742? 1000.56395? 1027.53096? 1100.67961? 1705.4570?? 100 # f3(dat2)?? 315.06253?? 325.02696?? 341.21953?? 364.85656?? 533.9347?? 100 # f4(dat2)?? 804.89703?? 858.14888?? 899.55182?? 964.39989? 1129.9311?? 100 # f5(dat2)?? 149.55682?? 153.67846?? 167.23934?? 176.56643?? 292.3134?? 100 #? f6(dt1) 46665.61046 48234.78637 48818.88141 49366.46810 51112.7930?? 100? ###############################slowest #? f7(dt1)??? 71.02789??? 76.97008??? 85.09989??? 97.82982?? 387.3801?? 100 #? f8(dt1)??? 77.74961??? 78.94773??? 80.00620??? 89.00892?? 205.2492?? 100 #? f9(dt1)??? 71.76817??? 76.40184??? 79.89194?? 100.57348?? 282.8359?? 100 #Comparing the fastest among data.table with f5() ?system.time(res8<- f8(dt1)) #?? user? system elapsed ?#? 0.08??? 0.00??? 0.08 system.time(res5<- f5(dat2)) #?? user? system elapsed ?# 0.156?? 0.000?? 0.153 res8New<- as.data.frame(res8) ?row.names(res8New)<- row.names(res5) attr(res8New,"row.names")<- attr(res5,"row.names") ?identical(res8New,res5) #[1] TRUE #During debugging: the step that took long time to execute is: (Same as Frank reported) debugonce(data.table:::`[.data.table`) ?dt1[, .SD[.N], by='Date'] debug: ans = .Call(Cdogroups, x, xcols, groups, grpcols, jiscols, grporder, ??? o__, f__, len__, jsub, SDenv, cols, newnames, verbose) #I use Linux mint 15. sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-unknown-linux-gnu (64-bit) locale: ?[1] LC_CTYPE=en_CA.UTF-8?????? LC_NUMERIC=C????????????? ?[3] LC_TIME=en_CA.UTF-8??????? LC_COLLATE=en_CA.UTF-8??? ?[5] LC_MONETARY=en_CA.UTF-8??? LC_MESSAGES=en_CA.UTF-8?? ?[7] LC_PAPER=C???????????????? LC_NAME=C???????????????? ?[9] LC_ADDRESS=C?????????????? LC_TELEPHONE=C??????????? [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C?????? attached base packages: [1] stats???? graphics? grDevices utils???? datasets? methods?? base???? other attached packages: [1] data.table_1.8.8???? microbenchmark_1.3-0 stringr_0.6.2?????? [4] reshape2_1.2.2????? loaded via a namespace (and not attached): [1] plyr_1.8??? tcltk_3.0.1 tools_3.0.1 A.K. Steve, Thank you. arun, Could you run it with `microbenchmark` instead of system.time (with times = 100 or so) and paste the results here? Also, maybe you could use debugonce(data.table:::`[.data.table`) and then run? ? ? x[, .SD[.N], by='Date'] to go step by step to find out the line that causes the lag, perhaps?? Arun ________________________________ From: Arunkumar Srinivasan To: arun Sent: Friday, August 16, 2013 2:27 AM Subject: Re: [datatable-help] Slow execution: Extracting last value in each group Sorry, but I'm not sure what your question is here. There seems to be different timings between you and Steve. You want to get it verified as to which one is true? On my system, Steve's takes 0.003 seconds.? However, a *faster* version than Steve's solution (on bigger data) would be: ? ? x[x[, .I[.N], by='Date']$V1] Arun On Friday, August 16, 2013 at 6:52 AM, arun wrote: HI, >This is a follow up from a post in R-help mailing list. (http://r.789695.n4.nabble.com/How-to-extract-last-value-in-each-group-td4673787.html).? > > > > >In short, I tried the below using data.table(), but found to be slower than some of the other methods.? Steve Lianoglou also tried the same and got it much faster (system.time()~ 0.070? vs. ~40 ). > > >###data > > >dat1<- structure(list(Date = c("06/01/2010", "06/01/2010", "06/01/2010", >"06/01/2010", "06/02/2010", "06/02/2010", "06/02/2010", "06/02/2010", >"06/02/2010", "06/02/2010", "06/02/2010"), Time = c(1358L, 1359L, >1400L, 1700L, 331L, 332L, 334L, 335L, 336L, 337L, 338L), O = c(136.4, >136.4, 136.45, 136.55, 136.55, 136.7, 136.75, 136.8, 136.8, 136.75, >136.8), H = c(136.4, 136.5, 136.55, 136.55, 136.7, 136.7, 136.75, >136.8, 136.8, 136.8, 136.8), L = c(136.35, 136.35, 136.35, 136.55, >136.5, 136.65, 136.75, 136.8, 136.8, 136.75, 136.8), C = c(136.35, >136.5, 136.4, 136.55, 136.7, 136.65, 136.75, 136.8, 136.8, 136.8, >136.8), U = c(2L, 9L, 8L, 1L, 36L, 3L, 1L, 4L, 8L, 1L, 3L), D = c(12L, >6L, 7L, 0L, 6L, 1L, 0L, 0L, 0L, 2L, 0L)), .Names = c("Date", >"Time", "O", "H", "L", "C", "U", "D"), class = "data.frame", row.names = c(NA, >-11L)) > > > > >indx<- rep(1:nrow(dat1),1e5) >dat2<- dat1[indx,] >dat2[-c(1:11),1]<-format(rep(seq(as.Date("1080-01-01"),by=1,length.out=99999),each=11),"%m/%d/%Y") >?dat2<- dat2[order(dat2[,1],dat2[,2]),] >row.names(dat2)<-1:nrow(dat2) > > > > > > >#Some speed comparisons (more in the link): >system.time(res1<-dat2[c(diff(as.numeric(as.factor(dat2$Date))),1)>0,]) >#?? user? system elapsed >?# 0.528?? 0.012?? 0.540 >?system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),]) >#?? user? system elapsed >?# 0.156?? 0.000?? 0.155 > > > > >library(data.table) >system.time({ >dt1 <- data.table(dat2, key=c('Date', 'Time')) >?ans <- dt1[, .SD[.N], by='Date']}) > > >?# user? system elapsed >?#39.860?? 0.020? 39.952?? #############slower than many other methods >ans1<- as.data.frame(ans) >?row.names(ans1)<- row.names(res7) >?attr(ans1,"row.names")<- attr(res7,"row.names") >?identical(ans1,res7) >#[1] TRUE > > > > > > > > >Steve Lianoglou reply is below: >############################ > > > > >Amazing. This is what I get on my MacBook Pro, i7 @ 3GHz (very close >specs to your machine): > > >R> dt1 <- data.table(dat2, key=c('Date', 'Time')) >R> system.time(ans <- dt1[, .SD[.N], by='Date']) >???user? system elapsed >? 0.064???0.009???0.073? ########################### > > >R> system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),]) >???user? system elapsed >? 0.148???0.016???0.165 > > >On one of our compute server running who knows what processor on some >version of linux, but shouldn't really matter as we're talking >relative time to each other here: > > >R> system.time(ans <- dt1[, .SD[.N], by='Date']) >???user? system elapsed >? 0.160???0.012???0.170 > > >R> system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),]) >???user? system elapsed >? 0.292???0.004???0.294 >############################################## > > >My sessionInfo####### >sessionInfo() >R version 3.0.1 (2013-05-16) >Platform: x86_64-unknown-linux-gnu (64-bit)? (linux mint 15) > > >locale: >?[1] LC_CTYPE=en_CA.UTF-8?????? LC_NUMERIC=C????????????? >?[3] LC_TIME=en_CA.UTF-8??????? LC_COLLATE=en_CA.UTF-8??? >?[5] LC_MONETARY=en_CA.UTF-8??? LC_MESSAGES=en_CA.UTF-8?? >?[7] LC_PAPER=C???????????????? LC_NAME=C???????????????? >?[9] LC_ADDRESS=C?????????????? LC_TELEPHONE=C??????????? >[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C?????? > > >attached base packages: >[1] stats???? graphics? grDevices utils???? datasets? methods?? base???? > > >other attached packages: >[1] data.table_1.8.8 stringr_0.6.2??? reshape2_1.2.2? > > >loaded via a namespace (and not attached): >[1] plyr_1.8??? tools_3.0.1 > > >CPU #################### >I use Dell XPS L502X >?* Processor 2nd Gen Core i7 Intel i7-2630QM / 2 GHz ( 2.9 GHz ) ( Quad-Core ) >?* Memory 6 GB / 8 GB (max) >?* Hard Drive 640 GB - Serial ATA-300 - 7200 rpm? > > >Any help will be appreciated. >Thanks. >A.K. >_______________________________________________ >datatable-help mailing list >datatable-help at lists.r-forge.r-project.org >https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From aragorn168b at gmail.com Fri Aug 23 11:49:08 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 23 Aug 2013 11:49:08 +0200 Subject: [datatable-help] Faster "CJ" Message-ID: Hi everybody, I think there's a faster version of "CJ" function that's possible. The issue currently is that the "sort" is done at the very end by using `setkey` which will work on the data *after* getting all the combinations, and therefore sorting a huge amount of entries. However, a faster way would be to get it first sorted (even before working out all combinations) and then use the hack: setattr(l, 'sorted', names(l)) Basically there are just 2 lines that need change (see bottom of the post). --------- Here's first some benchmarks on `CJ_fast` (see below) and `CJ` on a relatively big data: w <- sample(1e4, 1e3) x <- sample(letters, 12) y <- sample(letters, 12) z <- sample(letters, 12) system.time(t1 <- do.call(CJ, list(w,x,y,z))) user system elapsed 0.775 0.052 0.835 system.time(t2 <- do.call(CJ_fast, list(w,x,y,z))) user system elapsed 0.220 0.001 0.221 identical(t1, t2) [1] TRUE --------- The function: (there are only two changes) CJ_fast <- function (...) { l = list(...) if (length(l) > 1) { n = sapply(l, length) nrow = prod(n) x = c(rev(data.table:::take(cumprod(rev(n)))), 1L) # 1) SORT HERE for (i in seq(along = x)) l[[i]] = rep(sort(l[[i]], na.last = TRUE), each = x[i], length = nrow) } setattr(l, "row.names", .set_row_names(length(l[[1]]))) setattr(l, "class", c("data.table", "data.frame")) vnames = names(l) if (is.null(vnames)) vnames = rep("", length(l)) tt = vnames == "" if (any(tt)) { vnames[tt] = paste("V", which(tt), sep = "") setattr(l, "names", vnames) } data.table:::settruelength(l, 0L) l = alloc.col(l) # 2) REPLACE SETKEY WITH ATTRIBUTE "SORTED" setattr(l, 'sorted', names(l)) l } Arun -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Fri Aug 23 12:21:59 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 23 Aug 2013 12:21:59 +0200 Subject: [datatable-help] Faster "CJ" In-Reply-To: References: Message-ID: <6F6195ED3A204CE196D5190DF514AAD9@gmail.com> Filed this as FR #4849 here: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=4849&group_id=240&atid=978 Arun On Friday, August 23, 2013 at 11:49 AM, Arunkumar Srinivasan wrote: > Hi everybody, > > I think there's a faster version of "CJ" function that's possible. The issue currently is that the "sort" is done at the very end by using `setkey` which will work on the data *after* getting all the combinations, and therefore sorting a huge amount of entries. > > However, a faster way would be to get it first sorted (even before working out all combinations) and then use the hack: > > setattr(l, 'sorted', names(l)) > > Basically there are just 2 lines that need change (see bottom of the post). > > --------- > Here's first some benchmarks on `CJ_fast` (see below) and `CJ` on a relatively big data: > > w <- sample(1e4, 1e3) > x <- sample(letters, 12) > y <- sample(letters, 12) > z <- sample(letters, 12) > > system.time(t1 <- do.call(CJ, list(w,x,y,z))) > user system elapsed > 0.775 0.052 0.835 > > system.time(t2 <- do.call(CJ_fast, list(w,x,y,z))) > user system elapsed > 0.220 0.001 0.221 > > > identical(t1, t2) > [1] TRUE > --------- > > The function: (there are only two changes) > > CJ_fast <- function (...) > { > l = list(...) > if (length(l) > 1) { > n = sapply(l, length) > nrow = prod(n) > x = c(rev(data.table:::take(cumprod(rev(n)))), 1L) > # 1) SORT HERE > for (i in seq(along = x)) l[[i]] = rep(sort(l[[i]], na.last = TRUE), each = x[i], > length = nrow) > } > setattr(l, "row.names", .set_row_names(length(l[[1]]))) > setattr(l, "class", c("data.table", "data.frame")) > vnames = names(l) > if (is.null(vnames)) > vnames = rep("", length(l)) > tt = vnames == "" > if (any(tt)) { > vnames[tt] = paste("V", which(tt), sep = "") > setattr(l, "names", vnames) > } > data.table:::settruelength(l, 0L) > l = alloc.col(l) > # 2) REPLACE SETKEY WITH ATTRIBUTE "SORTED" > setattr(l, 'sorted', names(l)) > l > } > > > Arun > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sat Aug 24 09:57:53 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sat, 24 Aug 2013 09:57:53 +0200 Subject: [datatable-help] column of named vectors in data.table and possible bug Message-ID: <2826ECD8F1A445629ED4A27B96DCF661@gmail.com> Dear all, Suppose we've construct a data.table in this manner: x <- c(1,1,1,2,2) y <- 6:10 setattr(y, 'names', letters[1:5]) DT<- data.table(A = x, B = y) DT$B a b c d e 6 7 8 9 10 You see that DT maintains the name of vector B. But if we do: DT[, names(B), by=A] A V1 1: 1 a 2: 1 b 3: 1 c 4: 2 a 5: 2 b 6: 2 c There are two things here: First, you see that only the names of the first grouping is correct (A = 1). Second, the rest of the result has the same names, and the result is also recycled to fit the length. Instead of 5 rows, we get 6 rows. A way to get around it would be: DT[, names(DT$B)[.I], by=A] A V1 1: 1 a 2: 1 b 3: 1 c 4: 2 d 5: 2 e However, if one wants to do: DT[, list(list(B)), by=A]$V1 [[1]] a b c 6 7 8 [[2]] a b 9 10 You see that the names are once again wrong (for A = 2). Just the first one remains right. My question is, is it allowed usage of having names for column vectors? If so, then this should be a bug. If not, it'd be a great feature to have. Arun -------------- next part -------------- An HTML attachment was scrubbed... URL: From markyt23 at gmail.com Sun Aug 25 06:42:45 2013 From: markyt23 at gmail.com (marcussr) Date: Sat, 24 Aug 2013 21:42:45 -0700 (PDT) Subject: [datatable-help] Column retrieval Message-ID: <1377405765500-4674472.post@n4.nabble.com> I have a data table, with Columns: Case Number, Created by Analyst, Met FCR, and Date. I want to show all the rows that were created by a certain analyst. I can't seem to find the function to interpret text. Please Help -- View this message in context: http://r.789695.n4.nabble.com/Column-retrieval-tp4674472.html Sent from the datatable-help mailing list archive at Nabble.com. From mdowle at mdowle.plus.com Sun Aug 25 10:35:10 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Sun, 25 Aug 2013 09:35:10 +0100 Subject: [datatable-help] Column retrieval In-Reply-To: <1377405765500-4674472.post@n4.nabble.com> References: <1377405765500-4674472.post@n4.nabble.com> Message-ID: <5219C1BE.7090309@mdowle.plus.com> Hi. Welcome to the list. Assuming you really do have a data.table and not a data.frame and have joined the right mailing list then the term "interpret text" is very vague (!) but it's possible to guess ... DT[ `Created by Analyst` == "certain analyst" ] maybe. That's a vector scan but see intro vignette for the faster binary search way. The backquotes are needed if you have spaces in the column names. Please browse and search questions on the data.table tag on stack overflow : http://stackoverflow.com/questions/tagged/data.table Matthew On 25/08/13 05:42, marcussr wrote: > I have a data table, with Columns: Case Number, Created by Analyst, Met FCR, > and Date. > I want to show all the rows that were created by a certain analyst. I can't > seem to find the function to interpret text. Please Help > > > > -- > View this message in context: http://r.789695.n4.nabble.com/Column-retrieval-tp4674472.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From markyt23 at gmail.com Sun Aug 25 17:29:19 2013 From: markyt23 at gmail.com (marcussr) Date: Sun, 25 Aug 2013 08:29:19 -0700 (PDT) Subject: [datatable-help] Column retrieval In-Reply-To: <5219C1BE.7090309@mdowle.plus.com> References: <1377405765500-4674472.post@n4.nabble.com> <5219C1BE.7090309@mdowle.plus.com> Message-ID: <1377444559070-4674495.post@n4.nabble.com> Thank you, this worked. -- View this message in context: http://r.789695.n4.nabble.com/Column-retrieval-tp4674472p4674495.html Sent from the datatable-help mailing list archive at Nabble.com. From smartpink111 at yahoo.com Sun Aug 25 22:27:13 2013 From: smartpink111 at yahoo.com (arun) Date: Sun, 25 Aug 2013 13:27:13 -0700 (PDT) Subject: [datatable-help] Looking for a faster method Message-ID: <1377462433.85205.YahooMailNeo@web142603.mail.bf1.yahoo.com> Hi, I tried a ?data.table() method to solve the problem in the link below.? http://r.789695.n4.nabble.com/how-to-combine-apply-and-which-or-alternative-ways-to-do-so-td4674424.html#a4674434 But, it was not that fast. set.seed(24) vec1<- sample(1e5,1e3,replace=FALSE) set.seed(48) vec2<- sample(1e3,1e6,replace=TRUE) system.time({res1<- tapply(vec1,1:1e3,FUN=function(i) {which(vec2==i)})}) # user? system elapsed #? 3.912?? 0.000?? 3.880 system.time(res2<- sapply(vec1,function(x) which(vec2%in%x))) #?? user? system elapsed # 24.368?? 0.000? 23.247 vecR1<-unlist(res1) names(vecR1)<-NULL vecR2<- unlist(res2) identical(vecR1,vecR2) #[1]TRUE library(data.table) dt1<- data.table(vec1,Group=1:1e3,key='Group') system.time({res3<- dt1[,list(list(which(vec1==vec2))),by=Group]})##Not that fast # user? system elapsed #? 3.756?? 0.120?? 3.886 ###### identical(vecR1,unlist(res3$V1)) #[1] TRUE Is there a faster way? Thanks. A.K. From aragorn168b at gmail.com Sun Aug 25 22:58:56 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sun, 25 Aug 2013 22:58:56 +0200 Subject: [datatable-help] Looking for a faster method In-Reply-To: <1377462433.85205.YahooMailNeo@web142603.mail.bf1.yahoo.com> References: <1377462433.85205.YahooMailNeo@web142603.mail.bf1.yahoo.com> Message-ID: <9697748E77C14BBB8FFF82B5A905F2DC@gmail.com> How about this? system.time(out <- data.table(id=seq(vec2), val=vec2, key="val")[J(vec1)][, list(list(id)), by=val]$V1) user system elapsed 0.098 0.004 0.103 Arun On Sunday, August 25, 2013 at 10:27 PM, arun wrote: > Hi, > I tried a ?data.table() method to solve the problem in the link below. > > http://r.789695.n4.nabble.com/how-to-combine-apply-and-which-or-alternative-ways-to-do-so-td4674424.html#a4674434 > > But, it was not that fast. > > set.seed(24) > vec1<- sample(1e5,1e3,replace=FALSE) > set.seed(48) > vec2<- sample(1e3,1e6,replace=TRUE) > system.time({res1<- tapply(vec1,1:1e3,FUN=function(i) {which(vec2==i)})}) # user system elapsed > # 3.912 0.000 3.880 > > system.time(res2<- sapply(vec1,function(x) which(vec2%in%x))) > # user system elapsed > # 24.368 0.000 23.247 > vecR1<-unlist(res1) > names(vecR1)<-NULL > vecR2<- unlist(res2) > identical(vecR1,vecR2) > #[1]TRUE > > library(data.table) > dt1<- data.table(vec1,Group=1:1e3,key='Group') > system.time({res3<- dt1[,list(list(which(vec1==vec2))),by=Group]})##Not that fast > # user system elapsed > # 3.756 0.120 3.886 ###### > identical(vecR1,unlist(res3$V1)) > #[1] TRUE > > > > Is there a faster way? > > Thanks. > > A.K. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Mon Aug 26 01:42:56 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Mon, 26 Aug 2013 00:42:56 +0100 Subject: [datatable-help] Looking for a faster method In-Reply-To: <9697748E77C14BBB8FFF82B5A905F2DC@gmail.com> References: <1377462433.85205.YahooMailNeo@web142603.mail.bf1.yahoo.com> <9697748E77C14BBB8FFF82B5A905F2DC@gmail.com> Message-ID: <521A9680.80402@mdowle.plus.com> Or a slight refinement of that: system.time(out2 <- data.table(id=seq(vec2), val=vec2, key="val")[J(vec1), list(list(id))]$V1) Matthew On 25/08/13 21:58, Arunkumar Srinivasan wrote: > How about this? > > system.time(out <- data.table(id=seq(vec2), val=vec2, > key="val")[J(vec1)][, list(list(id)), by=val]$V1) > user system elapsed > 0.098 0.004 0.103 > > Arun > > On Sunday, August 25, 2013 at 10:27 PM, arun wrote: > >> Hi, >> I tried a ?data.table() method to solve the problem in the link below. >> >> http://r.789695.n4.nabble.com/how-to-combine-apply-and-which-or-alternative-ways-to-do-so-td4674424.html#a4674434 >> >> But, it was not that fast. >> >> set.seed(24) >> vec1<- sample(1e5,1e3,replace=FALSE) >> set.seed(48) >> vec2<- sample(1e3,1e6,replace=TRUE) >> system.time({res1<- tapply(vec1,1:1e3,FUN=function(i) >> {which(vec2==i)})})?# user system elapsed >> # 3.912 0.000 3.880 >> >> system.time(res2<- sapply(vec1,function(x) which(vec2%in%x))) >> # user system elapsed >> # 24.368 0.000 23.247 >> ?vecR1<-unlist(res1) >> names(vecR1)<-NULL >> vecR2<- unlist(res2) >> identical(vecR1,vecR2) >> #[1]TRUE >> >> library(data.table) >> dt1<- data.table(vec1,Group=1:1e3,key='Group') >> system.time({res3<- >> dt1[,list(list(which(vec1==vec2))),by=Group]})##Not that fast? >> # user system elapsed >> # 3.756 0.120 3.886 ###### >> identical(vecR1,unlist(res3$V1)) >> #[1] TRUE >> >> >> >> Is there a faster way? >> >> Thanks. >> >> A.K. >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From mailinglist.honeypot at gmail.com Tue Aug 27 19:23:15 2013 From: mailinglist.honeypot at gmail.com (Steve Lianoglou) Date: Tue, 27 Aug 2013 10:23:15 -0700 Subject: [datatable-help] unique.data.frame should create a copy, right? In-Reply-To: References: <0F857B92DB0744C69CFC07AAE0C4DCF4@gmail.com> <2A8FA620F9814DE48058FCA9F41C0AE6@gmail.com> <6C1BF9F6C1454190AA0457AA23DFB386@gmail.com> Message-ID: Last update here :-) After more hemming and hawing, I've changed the name of the new parameter added to duplicated.data.table and unique.data.table from `by.columnss` to just `by`, as it (more or less) is the same idea as the `by` in dt[x, i,j,by,...] Sorry for any inconveniences caused if you've been working off of the development version. -steve On Thu, Aug 15, 2013 at 9:35 PM, Ricardo Saporta wrote: > Steve, great stuff!! > thanks for making that happen > > Rick > > > On Wed, Aug 14, 2013 at 8:30 PM, Steve Lianoglou > wrote: >> >> Hi all, >> >> As I needed this sooner than I had expected, I just committed this >> change. It's in svn revision 889. >> >> I chose 'by.columns' as the parameter names -- seemed to make more >> sense to me, and using the short hand interactively saves a letter, >> eg: unique(dt, by=c('some', 'columns')) ;-) >> >> Here's the note from the NEWS file: >> >> o "Uniqueness" tests can now specify arbirtray combinations of >> columns to use to test for duplicates. `by.columns` parameter added to >> unique.data.table and duplicated.data.table. This allows the user to >> test for uniqueness using any combination of columns in the >> data.table, where previously the user only had the option to use the >> keyed columns (if keyed) or all columns (if not). The default behavior >> sets `by.columns=key(dt)` to maintain backward compatability. See >> man/duplicated.Rd and tests 986:991 for more information. Thanks to >> Arunkumar Srinivasan, Ricardo Saporta, and Frank Erickson for useful >> discussions. >> >> Should work as advertised assuming my unit tests weren't too simplistic. >> >> Cheers, >> >> -steve >> >> >> >> >> On Tue, Aug 13, 2013 at 1:24 PM, Steve Lianoglou >> wrote: >> > Thanks for the suggestions, folks. >> > >> > Matthew: do you have a preference? >> > >> > -steve >> > >> > On Mon, Aug 12, 2013 at 11:12 AM, Ricardo Saporta >> > wrote: >> >> Steve, >> >> >> >> I like your suggestion a lot. I can see putting column specification >> >> to >> >> good use. >> >> >> >> As for the argument name, perhaps >> >> 'use.columns' >> >> >> >> And where a value of NULL or FALSE will yield same results as >> >> `unique.data.frame` >> >> >> >> use.columns=key(x) # default behavior >> >> use.columns=c("col1name", "col7name") #etc >> >> use.columns=NULL >> >> >> >> >> >> Thanks as always, >> >> Rick >> >> >> >> >> >> >> >> On Mon, Aug 12, 2013 at 1:51 PM, Steve Lianoglou >> >> wrote: >> >>> >> >>> Hi folks, >> >>> >> >>> I actually want to revisit the fix I made here. >> >>> >> >>> Instead of having `use.key` in the signature to unique.data.table (and >> >>> duplicated.data.table) to be: >> >>> >> >>> function(x, >> >>> incomparables=FALSE, >> >>> tolerance=.Machine$double.eps ^ 0.5, >> >>> use.key=TRUE, ...) >> >>> >> >>> How about we switch out use.key for a parameter that specifies the >> >>> column names to use in the uniqueness check, which defaults to key(x) >> >>> to keep backwards compatibility. >> >>> >> >>> For argument's sake (like that?), lets call this parameter `columns` >> >>> (by.columns? with.columns? whatever) so: >> >>> >> >>> function(x, >> >>> incomparables=FALSE, >> >>> tolerance=.Machine$double.eps ^ 0.5, >> >>> columns=key(x), ...) >> >>> >> >>> Then: >> >>> >> >>> (1) leaving it alone is the backward compatibile behavior; >> >>> (2) Perhaps setting it to NULL will use all columns, and make it >> >>> equivalent to unique.data.frame (also the same when x has no key); and >> >>> (3) setting it to any other combo of columns uses those columns as the >> >>> uniqueness key and filters the rows (only) out of x accordingly. >> >>> >> >>> What do you folks think? Personally I think this is better on all >> >>> accounts then just specifying to use the key or not and the only >> >>> question in my mind is the name of the argument -- happy to hear other >> >>> world views, however, so don't be shy. >> >>> >> >>> Thanks, >> >>> -steve >> >>> >> >>> -- >> >>> Steve Lianoglou >> >>> Computational Biologist >> >>> Bioinformatics and Computational Biology >> >>> Genentech >> >> >> >> >> > >> > >> > >> > -- >> > Steve Lianoglou >> > Computational Biologist >> > Bioinformatics and Computational Biology >> > Genentech >> >> >> >> -- >> Steve Lianoglou >> Computational Biologist >> Bioinformatics and Computational Biology >> Genentech > > -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech From kofmank at gmail.com Thu Aug 29 16:36:51 2013 From: kofmank at gmail.com (Kostia Kofman) Date: Thu, 29 Aug 2013 17:36:51 +0300 Subject: [datatable-help] Error in eval(expr, envir, enclos) : object 'BlaBla' not found Message-ID: Hi, I'm new to R and data.table and having some difficulties with a certain function that I wrote. the function code: freq_per = function(data,byWhat,month){ for (i in 1:length(month)){ data = data.table(data) data[,percent :=sum(freq),by = byWhat] data[,percent := (freq/percent)*100] data = data.frame(data) data$percent = format(round(data$percent),nsmall = 2) } data } I have tried to call the function with freq_per(data,list(BlaBla),month) and get the error message from the topic. I have tried to create a global variable with byWhat = list(BlaBla), it didn't work because the object 'BlaBla' not found. I also tried to use keyby instead of by with c(colnames(data)[1]), but the results that I got are not right. anybody has an idea how to overcome the problem? Thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: From FErickson at psu.edu Thu Aug 29 16:44:53 2013 From: FErickson at psu.edu (Frank Erickson) Date: Thu, 29 Aug 2013 09:44:53 -0500 Subject: [datatable-help] Error in eval(expr, envir, enclos) : object 'BlaBla' not found In-Reply-To: References: Message-ID: Try "BlaBla" instead of list(BlaBla) (oops, forgot to reply to the mailing list the first time I sent this. Sorry for the double email, Kostia.) On Thu, Aug 29, 2013 at 9:36 AM, Kostia Kofman wrote: > Hi, > > I'm new to R and data.table and having some difficulties with a certain > function that I wrote. > > the function code: > > freq_per = function(data,byWhat,month){ > for (i in 1:length(month)){ > data = data.table(data) > data[,percent :=sum(freq),by = byWhat] > data[,percent := (freq/percent)*100] > data = data.frame(data) > data$percent = format(round(data$percent),nsmall = 2) > } > data > > } > > I have tried to call the function with freq_per(data,list(BlaBla),month) > and get the error message from the topic. > > I have tried to create a global variable with byWhat = list(BlaBla), it > didn't work because the object 'BlaBla' not found. > > I also tried to use keyby instead of by with c(colnames(data)[1]), but > the results that I got are not right. > > anybody has an idea how to overcome the problem? > > Thanks. > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kofmank at gmail.com Thu Aug 29 16:57:01 2013 From: kofmank at gmail.com (Kostia Kofman) Date: Thu, 29 Aug 2013 17:57:01 +0300 Subject: [datatable-help] Error in eval(expr, envir, enclos) : object 'BlaBla' not found In-Reply-To: References: Message-ID: This is the error I receive trying "BlaBla" instead of list(BlaBla) Error in `[.data.table`(data, , `:=`(percent, sum(freq)), by = byWhat) : Type of RHS ('integer') must match LHS ('character'). To check and coerce would impact performance too much for the fastest cases. Either change the type of the target column, or coerce the RHS of := yourself (e.g. by using 1L instead of 1) On Thu, Aug 29, 2013 at 5:44 PM, Frank Erickson wrote: > Try "BlaBla" instead of list(BlaBla) > > (oops, forgot to reply to the mailing list the first time I sent this. > Sorry for the double email, Kostia.) > > > On Thu, Aug 29, 2013 at 9:36 AM, Kostia Kofman wrote: > >> Hi, >> >> I'm new to R and data.table and having some difficulties with a certain >> function that I wrote. >> >> the function code: >> >> freq_per = function(data,byWhat,month){ >> for (i in 1:length(month)){ >> data = data.table(data) >> data[,percent :=sum(freq),by = byWhat] >> data[,percent := (freq/percent)*100] >> data = data.frame(data) >> data$percent = format(round(data$percent),nsmall = 2) >> } >> data >> >> } >> >> I have tried to call the function with freq_per(data,list(BlaBla),month) >> and get the error message from the topic. >> >> I have tried to create a global variable with byWhat = list(BlaBla), it >> didn't work because the object 'BlaBla' not found. >> >> I also tried to use keyby instead of by with c(colnames(data)[1]), but >> the results that I got are not right. >> >> anybody has an idea how to overcome the problem? >> >> Thanks. >> >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From FErickson at psu.edu Thu Aug 29 17:04:10 2013 From: FErickson at psu.edu (Frank Erickson) Date: Thu, 29 Aug 2013 10:04:10 -0500 Subject: [datatable-help] Error in eval(expr, envir, enclos) : object 'BlaBla' not found In-Reply-To: References: Message-ID: Hi Kostia, I think that you already have a column named "percent" that has type "character". When you use "format", the result is a character, for example... If so -- use str(data) to check -- you'll need to delete it first, using data[,percent:=NULL]. If you're confused about how data.tables work and what they are useful for (...which I think might be the case based on your conversion to a data.frame and use of format), you might want to go through the vignettes and other resources here: http://datatable.r-forge.r-project.org/ Best, Frank On Thu, Aug 29, 2013 at 9:57 AM, Kostia Kofman wrote: > This is the error I receive trying "BlaBla" instead of list(BlaBla) > > > Error in `[.data.table`(data, , `:=`(percent, sum(freq)), by = byWhat) : > Type of RHS ('integer') must match LHS ('character'). To check and coerce would impact performance too much for the fastest cases. Either change the type of the target column, or coerce the RHS of := yourself (e.g. by using 1L instead of 1) > > > > On Thu, Aug 29, 2013 at 5:44 PM, Frank Erickson wrote: > >> Try "BlaBla" instead of list(BlaBla) >> >> (oops, forgot to reply to the mailing list the first time I sent this. >> Sorry for the double email, Kostia.) >> >> >> On Thu, Aug 29, 2013 at 9:36 AM, Kostia Kofman wrote: >> >>> Hi, >>> >>> I'm new to R and data.table and having some difficulties with a certain >>> function that I wrote. >>> >>> the function code: >>> >>> freq_per = function(data,byWhat,month){ >>> for (i in 1:length(month)){ >>> data = data.table(data) >>> data[,percent :=sum(freq),by = byWhat] >>> data[,percent := (freq/percent)*100] >>> data = data.frame(data) >>> data$percent = format(round(data$percent),nsmall = 2) >>> } >>> data >>> >>> } >>> >>> I have tried to call the function with freq_per(data,list(BlaBla),month) >>> and get the error message from the topic. >>> >>> I have tried to create a global variable with byWhat = list(BlaBla), it >>> didn't work because the object 'BlaBla' not found. >>> >>> I also tried to use keyby instead of by with c(colnames(data)[1]), but >>> the results that I got are not right. >>> >>> anybody has an idea how to overcome the problem? >>> >>> Thanks. >>> >>> >>> >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From serdar.akin at prosales.com Fri Aug 30 09:04:35 2013 From: serdar.akin at prosales.com (Serdar Akin) Date: Fri, 30 Aug 2013 09:04:35 +0200 Subject: [datatable-help] Row wise operation in data.table Message-ID: Hi, Currently I'm trying to find a way to make row wise operation within data.table to find value that have a certain pattern, and to count those. For instance find the number of 3 for each Respid and count those for each row. set.seed(1) DT <- data.table( Respid = seq(1,100, 1), Q1 = rep(1:5, each = 20), Q2 = as.integer(runif(100, min = 1, max = 5) ), Q3 = sample(1:5, 100, replace = T) ) DT1 <- DT[, lapply(.SD, function(x) length(grep(3, x))), by = 'Respid'] A do get 1 for each column that has a 3 in it but no column that counts it. Regards Serdar -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Fri Aug 30 09:13:55 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 30 Aug 2013 09:13:55 +0200 Subject: [datatable-help] Row wise operation in data.table In-Reply-To: References: Message-ID: <338DB2FD29D44806902550FB138A6BE8@gmail.com> It'd be easier if you just `melt` your `data.table` by `id = Respid`. Then you get a long format with which you can use fast grouping to count the number of 3s. require(reshape2) require(data.table) setkey(as.data.table(melt(DT, id="Respid")), Respid)[value == 3, .N, by=Respid][DT] Arun On Friday, August 30, 2013 at 9:04 AM, Serdar Akin wrote: > Hi, > > Currently I'm trying to find a way to make row wise operation within data.table to find value that have > a certain pattern, and to count those. > > For instance find the number of 3 for each Respid and count those for each row. > set.seed(1) > DT <- data.table( Respid = seq(1,100, 1), > Q1 = rep(1:5, each = 20), > Q2 = as.integer(runif(100, min = 1, max = 5) ), > Q3 = sample(1:5, 100, replace = T) > ) > > DT1 <- DT[, lapply(.SD, function(x) length(grep(3, x))), by = 'Respid'] > > A do get 1 for each column that has a 3 in it but no column that counts it. > > Regards Serdar > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From serdar.akin at prosales.com Fri Aug 30 09:30:20 2013 From: serdar.akin at prosales.com (Serdar Akin) Date: Fri, 30 Aug 2013 09:30:20 +0200 Subject: [datatable-help] Row wise operation in data.table In-Reply-To: <338DB2FD29D44806902550FB138A6BE8@gmail.com> References: <338DB2FD29D44806902550FB138A6BE8@gmail.com> Message-ID: Thanks Arunkumar, Very good and fast :) /Serdar 2013/8/30 Arunkumar Srinivasan > It'd be easier if you just `melt` your `data.table` by `id = Respid`. > Then you get a long format with which you can use fast grouping to count > the number of 3s. > > require(reshape2) > require(data.table) > > setkey(as.data.table(melt(DT, id="Respid")), Respid)[value == 3, .N, > by=Respid][DT] > > > Arun > > On Friday, August 30, 2013 at 9:04 AM, Serdar Akin wrote: > > Hi, > > Currently I'm trying to find a way to make row wise operation within > data.table to find value that have > a certain pattern, and to count those. > > For instance find the number of 3 for each Respid and count those for each > row. > set.seed(1) > DT <- data.table( Respid = seq(1,100, 1), > Q1 = rep(1:5, each = 20), > Q2 = as.integer(runif(100, min = 1, max = 5) ), > Q3 = sample(1:5, 100, replace = T) > ) > > DT1 <- DT[, lapply(.SD, function(x) length(grep(3, x))), by = 'Respid'] > > A do get 1 for each column that has a 3 in it but no column that counts > it. > > Regards Serdar > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: