From martin.dunelm at gmail.com Thu Sep 4 15:09:00 2014 From: martin.dunelm at gmail.com (Martin Watts) Date: Thu, 4 Sep 2014 14:09:00 +0100 Subject: [datatable-help] Unexpected Result Reading in Data File using fread Message-ID: All I am trying to read in a data file using fread() I am getting several warnings indicating that a non-numeric entry was found in a numeric field and as a result the column is being converted to a character vector, however the non-numeric entry is one of the declared na.strings and indeed the specific entry is returned as NA. I expected that the "?" entry would been recognised as NA and column to be read as numeric vector. I have tried the same action with read.table() and it works as I was expecting. I am using: R version 3.1.1 (pre-compiled) RStudio Version 0.98.983 data.table package v1.92 locale is: en_GB.UTF-8 on: OS-X Version 10.9.4 the code I am using is: "library("data.table") column.class <- c(rep("character",2), rep("numeric",7)) data2 <- fread("./data/household_power_consumption.txt", sep=";", na.strings=c("?",""), colClasses=column.class, header=TRUE, nrows=7000, verbose=TRUE )" the 1st line in the data file causing the problem + the one before are: 21/12/2006;11:22:00;0.244;0.000;242.290;1.000;0.000;0.000;0.000 21/12/2006;11:23:00;?;?;?;?;?;?; The 1st warning is: 1: In fread("./data/household_power_consumption.txt", na.strings = "?") : Bumped column 3 to type character on data row 6840, field contains '?'. Coercing previously read values in this column from integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE. Martin -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sat Sep 6 01:20:39 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sat, 6 Sep 2014 01:20:39 +0200 Subject: [datatable-help] Unexpected Result Reading in Data File using fread In-Reply-To: References: Message-ID: Hi Martin, I'd recommend first to try with the current development version to see if this has already been fixed? Matt's already fixed some fread bugs that were recurring. You can get it from here:?https://github.com/Rdatatable/data.table?Please scroll down to see the installation instructions. And if you still get the error, could you please file a bug report?https://github.com/Rdatatable/data.table/issues?with a *reproducible example* please? If necessary, you can also link to a *minimal* file that can reproduce the issue; it'd be much helpful. Thanks, Arun From:?Martin Watts Reply:?Martin Watts > Date:?September 4, 2014 at 3:09:13 PM To:?datatable-help at lists.r-forge.r-project.org > Subject:? [datatable-help] Unexpected Result Reading in Data File using fread All I am trying to read in a data file using fread() I am getting several warnings indicating that a non-numeric entry was found in a numeric field and as a result the column is being converted to a character vector, however the non-numeric entry is one of the declared na.strings and indeed the specific entry is returned as NA. I expected that the "?" entry would been recognised as NA and column to be read as numeric vector. ?I have tried the same action with read.table() and it works as I was expecting. I am using: R version 3.1.1 (pre-compiled) RStudio?Version 0.98.983 data.table package v1.92 locale is:?en_GB.UTF-8 on: ?OS-X Version 10.9.4 the code I am using is: "library("data.table") column.class <- c(rep("character",2), rep("numeric",7)) data2 <- fread("./data/household_power_consumption.txt", ? ? ? ? ? ? ? ?sep=";", ? ? ? ? ? ? ? ?na.strings=c("?",""), ? ? ? ? ? ? ? ?colClasses=column.class, ? ? ? ? ? ? ? ?header=TRUE, ? ? ? ? ? ? ? ?nrows=7000, ? ? ? ? ? ? ? ?verbose=TRUE )" the 1st line in the data file causing the problem + the one before are: 21/12/2006;11:22:00;0.244;0.000;242.290;1.000;0.000;0.000;0.000 21/12/2006;11:23:00;?;?;?;?;?;?; The 1st warning is: 1: In fread("./data/household_power_consumption.txt", na.strings = "?") : ? Bumped column 3 to type character on data row 6840, field contains '?'. Coercing previously read values in this column from integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE. Martin _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmtruppia at gmail.com Thu Sep 11 22:16:04 2014 From: jmtruppia at gmail.com (Juan Manuel Truppia) Date: Thu, 11 Sep 2014 17:16:04 -0300 Subject: [datatable-help] Update table from other table Message-ID: What is the best data.table way of doing something similar to UPDATE FROM in SQL? I used to do something like dta = data.table(idx = c(1, 2, 3), a = runif(3), key = "idx") dtb = data.table(idx = c(1, 3), b = runif(3), key = "idx") dta[dtb, b := b] However, after the 1.9.3 and the explicit .EACHI, it fails sometimes, but I can't determine when. So, just to be sure, I do dta[dtb, b := b, .EACHI = TRUE, nomatch = 0] Is the .EACHI and the nomatch necessary? In this case, I want the row with idx 1 and 3 (the matching ones) to end with a b value from the matching b column in dtb, and the row with idx 2 (the one that isn't in dtb) to end up with NA in column b. -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Fri Sep 12 17:14:26 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 12 Sep 2014 17:14:26 +0200 Subject: [datatable-help] Update table from other table In-Reply-To: References: Message-ID: I think you mean: dta[dtb, b:=b, by=.EACHI] and not .EACHI = TRUE. Not sure what?s the use of nomatch=0L along with :=. by=.EACHI does exactly what it means, really. It evaluates j for each i match. Let?s first see the matches: dta[dtb, which=TRUE] # [1] 1 1 3 So, first row of dtb matches with first of dta. The second of dtb matches with 1st of dta and so on. When you add by=.EACHI, as shown on the top, j-expression is evaluated on each of these matches. So, it?ll be evaluated 3-times here. On the other hand, without it, j is evaluated once. In this case, it doesn?t make a difference either way. So you should avoid by=.EACHI, as it?ll be slower with it. It?s particularly useful when you?d like to perform operations in j, that depends on the values in j on that group. For example, consider these data.tables dt1 and dt2: dt1 = data.table(x=rep(1:4, each=2), y=1:8, key="x") dt2 = data.table(x=3:5, z=10, key="x") And, you?d like to get sum(y)*z while joining.. If not for the by=.EACHI feature.. you?d approach the problem like this: dt1[dt2][, list(agg = sum(y)*z[1]), by=x] With by=.EACHI, this is simply: dt1[dt2, list(agg=sum(y)*z), by=.EACHI] Here, your expression is evaluated on each i. Another interesting use case is, say, you?d like to create a lagged vector of y: dt1[dt2, list(y=y, lagy = c(NA, head(y,-1)), z=z), by=.EACHI] It?s that simple.. really. Basically, as long as the operation you?re performing in j affects it depending on whether j is executed for that group or as a whole, then you?re most likely looking for by=.EACHI. If not, by=.EACHI has no effect, and therefore you?re wanting to use a normal join there.. This is not a text book definition, rather my understanding of this awesome feature! Hope this helps. Arun From:?Juan Manuel Truppia Reply:?Juan Manuel Truppia > Date:?September 11, 2014 at 10:16:41 PM To:?datatable-help at lists.r-forge.r-project.org > Subject:? [datatable-help] Update table from other table What is the best data.table way of doing something similar to UPDATE FROM in SQL? I used to do something like dta = data.table(idx = c(1, 2, 3), a = runif(3), key = "idx") dtb = data.table(idx = c(1, 3), b = runif(3), key = "idx") dta[dtb, b := b] However, after the 1.9.3 and the explicit .EACHI, it fails sometimes, but I can't determine when. So, just to be sure, I do? dta[dtb, b := b, .EACHI = TRUE, nomatch = 0] Is the .EACHI and the nomatch necessary? In this case, I want the row with idx 1 and 3 (the matching ones) to end with a b value from the matching b column in dtb, and the row with idx 2 (the one that isn't in dtb) to end up with NA in column b. _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmtruppia at gmail.com Fri Sep 12 17:46:39 2014 From: jmtruppia at gmail.com (Juan Manuel Truppia) Date: Fri, 12 Sep 2014 12:46:39 -0300 Subject: [datatable-help] Update table from other table In-Reply-To: References:

Message-ID: Great! (sorry, .EACHI = TRUE was an old definition). It's good to know also that nomatch = 0 is irrelevant when using :=, I always used is to avoid the rows in dtb creeping in dtb as NAs. Also, it's really useful to know that by = EACHI should be used when the calculations you are perfoming depend on the group or not. This came in really in handy yesterday, and should be emphasized in .EACHI description. Should I perform a pull request? On Fri, Sep 12, 2014 at 12:14 PM, Arunkumar Srinivasan < aragorn168b at gmail.com> wrote: > I think you mean: > > dta[dtb, b:=b, by=.EACHI] > > and not .EACHI = TRUE. Not sure what's the use of nomatch=0L along with := > . > > by=.EACHI does exactly what it means, really. It evaluates j for each i > match. Let's first see the matches: > > dta[dtb, which=TRUE] > # [1] 1 1 3 > > So, first row of dtb matches with first of dta. The second of dtb matches > with 1st of dta and so on. > > When you add by=.EACHI, as shown on the top, j-expression is evaluated on > each of these matches. So, it'll be evaluated 3-times here. On the other > hand, without it, j is evaluated once. In this case, it doesn't make a > difference either way. So you should avoid by=.EACHI, as it'll be slower > with it. > > It's particularly useful when you'd like to perform operations in j, that > depends on the values in j on *that* group. For example, consider these > data.tables dt1 and dt2: > > dt1 = data.table(x=rep(1:4, each=2), y=1:8, key="x") > dt2 = data.table(x=3:5, z=10, key="x") > > And, you'd like to get sum(y)*z while joining.. If not for the by=.EACHI > feature.. you'd approach the problem like this: > > dt1[dt2][, list(agg = sum(y)*z[1]), by=x] > > With by=.EACHI, this is simply: > > dt1[dt2, list(agg=sum(y)*z), by=.EACHI] > > Here, your expression is evaluated on each i. > > Another interesting use case is, say, you'd like to create a lagged vector > of y: > > dt1[dt2, list(y=y, lagy = c(NA, head(y,-1)), z=z), by=.EACHI] > > It's that simple.. really. Basically, as long as the operation you're > performing in j affects it depending on whether j is executed for that > group or as a whole, then you're most likely looking for by=.EACHI. If > not, by=.EACHI has no effect, and therefore you're wanting to use a normal > join there.. > > This is not a text book definition, rather my understanding of this > awesome feature! > > Hope this helps. > > > Arun > > From: Juan Manuel Truppia > Reply: Juan Manuel Truppia > > Date: September 11, 2014 at 10:16:41 PM > To: datatable-help at lists.r-forge.r-project.org > > > > Subject: [datatable-help] Update table from other table > > What is the best data.table way of doing something similar to UPDATE > FROM in SQL? > > I used to do something like > > dta = data.table(idx = c(1, 2, 3), a = runif(3), key = "idx") > dtb = data.table(idx = c(1, 3), b = runif(3), key = "idx") > dta[dtb, b := b] > > However, after the 1.9.3 and the explicit .EACHI, it fails sometimes, but > I can't determine when. > > So, just to be sure, I do > > dta[dtb, b := b, .EACHI = TRUE, nomatch = 0] > > Is the .EACHI and the nomatch necessary? > > In this case, I want the row with idx 1 and 3 (the matching ones) to end > with a b value from the matching b column in dtb, and the row with idx 2 > (the one that isn't in dtb) to end up with NA in column b. > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sat Sep 13 01:08:44 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sat, 13 Sep 2014 01:08:44 +0200 Subject: [datatable-help] Update table from other table In-Reply-To: References:

Message-ID: Glad it helped. Always welcome "pull requests" :). Arun From:?Juan Manuel Truppia Reply:?Juan Manuel Truppia > Date:?September 12, 2014 at 5:46:59 PM To:?Arunkumar Srinivasan > Cc:?datatable-help at lists.r-forge.r-project.org > Subject:? Re: [datatable-help] Update table from other table Great! (sorry, .EACHI = TRUE was an old definition). It's good to know also that nomatch = 0 is irrelevant when using :=, I always used is to avoid the rows in dtb creeping in dtb as NAs. Also, it's really useful to know that by = EACHI should be used when the calculations you are perfoming depend on the group or not. This came in really in handy yesterday, and should be emphasized in .EACHI description. Should I perform a pull request? On Fri, Sep 12, 2014 at 12:14 PM, Arunkumar Srinivasan wrote: I think you mean: dta[dtb, b:=b, by=.EACHI] and not .EACHI = TRUE. Not sure what?s the use of nomatch=0L along with :=. by=.EACHI does exactly what it means, really. It evaluates j for each i match. Let?s first see the matches: dta[dtb, which=TRUE] # [1] 1 1 3 So, first row of dtb matches with first of dta. The second of dtb matches with 1st of dta and so on. When you add by=.EACHI, as shown on the top, j-expression is evaluated on each of these matches. So, it?ll be evaluated 3-times here. On the other hand, without it, j is evaluated once. In this case, it doesn?t make a difference either way. So you should avoid by=.EACHI, as it?ll be slower with it. It?s particularly useful when you?d like to perform operations in j, that depends on the values in j on that group. For example, consider these data.tables dt1 and dt2: dt1 = data.table(x=rep(1:4, each=2), y=1:8, key="x") dt2 = data.table(x=3:5, z=10, key="x") And, you?d like to get sum(y)*z while joining.. If not for the by=.EACHI feature.. you?d approach the problem like this: dt1[dt2][, list(agg = sum(y)*z[1]), by=x] With by=.EACHI, this is simply: dt1[dt2, list(agg=sum(y)*z), by=.EACHI] Here, your expression is evaluated on each i. Another interesting use case is, say, you?d like to create a lagged vector of y: dt1[dt2, list(y=y, lagy = c(NA, head(y,-1)), z=z), by=.EACHI] It?s that simple.. really. Basically, as long as the operation you?re performing in j affects it depending on whether j is executed for that group or as a whole, then you?re most likely looking for by=.EACHI. If not, by=.EACHI has no effect, and therefore you?re wanting to use a normal join there.. This is not a text book definition, rather my understanding of this awesome feature! Hope this helps. Arun From:?Juan Manuel Truppia Reply:?Juan Manuel Truppia > Date:?September 11, 2014 at 10:16:41 PM To:?datatable-help at lists.r-forge.r-project.org > Subject:? [datatable-help] Update table from other table What is the best data.table way of doing something similar to UPDATE FROM in SQL? I used to do something like dta = data.table(idx = c(1, 2, 3), a = runif(3), key = "idx") dtb = data.table(idx = c(1, 3), b = runif(3), key = "idx") dta[dtb, b := b] However, after the 1.9.3 and the explicit .EACHI, it fails sometimes, but I can't determine when. So, just to be sure, I do? dta[dtb, b := b, .EACHI = TRUE, nomatch = 0] Is the .EACHI and the nomatch necessary? In this case, I want the row with idx 1 and 3 (the matching ones) to end with a b value from the matching b column in dtb, and the row with idx 2 (the one that isn't in dtb) to end up with NA in column b. _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From f_j_rod at hotmail.com Sun Sep 14 14:01:48 2014 From: f_j_rod at hotmail.com (Frank S.) Date: Sun, 14 Sep 2014 14:01:48 +0200 Subject: [datatable-help] To apply different instructions in data table Message-ID: Hello everyone. I?ve a rather difficult question the following data table with grouped data (I really don?t know how to do it)a very long data. As an example, the first 10 rows: DT <- data.table(ID=c(1,1,2,2,2,3,3,4,5,5), start=as.Date(c("1985-01-01","1993-07-15","1993-05-17","1998-02-25","1997-10-28","2000-05-25","1995-09-02","1998-03-01","1992-02-26","1994-07-22")), end=as.Date(c("1992-05-01","1997-02-01","1997-10-20","1999-10-15","2003-08-25","2000-01-27","2002-04-15","2003-10-02","1997-03-17","2002-08-19")), reason=(c("Q2","Vacancy","R3","Vacancy","Vacancy","Vacancy","Q3","R2","S2","R1"))) ID start end reason 1: 1 1985-01-01 1992-05-01 Q2 2: 1 1993-07-15 1997-02-01 Vacancy 3: 2 1993-05-17 1997-10-20 R3 4: 2 1998-02-25 1999-10-15 Vacancy 5: 2 1997-10-28 2003-08-25 Vacancy 6: 3 2000-05-25 2000-01-27 Vacancy 7: 3 1995-09-02 2002-04-15 Q3 8: 4 1998-03-01 2003-10-02 R2 9: 5 1992-02-26 1997-03-17 S2 10: 5 1994-07-22 2002-08-19 R1 I would like to be able to construct an small function that allows to keep ONLY ONE OBSERVATION PER subject in the following way: 1) To keep his associated oldest date in ?start? column. 2) With regard to ?end? column, we can have the following situations: a) ID does NOT CONTAIN ?Vacancy? in any of his observations: Then I must keep the date closer to the present time, and its corresponding reason (ID=4 and ID=5). b) ID contains ?Vacancy? in some of his observations: If Vacancy appears only once, I will keep its corresponding end date, and reason will be Vacancy (ID=1 and ID=3). If vacancy appears two or more times, then I will keep as ?end? the oldest date among the rows which contain ?Vacancy? (ID=2). ID start end reason 1: 1 1985-01-01 1997-02-01 Vacancy 2: 2 1993-05-17 1999-10-15 Vacancy 3: 3 1995-09-02 2000-01-27 Vacancy 4: 4 1998-03-01 2003-10-02 R2 5: 5 1992-02-26 2002-08-19 R1 Thanks in advance for any help!! -------------- next part -------------- An HTML attachment was scrubbed... URL: From jholtman at gmail.com Mon Sep 15 14:49:05 2014 From: jholtman at gmail.com (jholtman) Date: Mon, 15 Sep 2014 05:49:05 -0700 (PDT) Subject: [datatable-help] To apply different instructions in data table In-Reply-To: References: Message-ID: <1410785345188-4696950.post@n4.nabble.com> try this: require(data.table) DT <- data.table(ID=c(1,1,2,2,2,3,3,4,5,5), start=as.Date(c("1985-01-01","1993-07-15","1993-05-17","1998-02-25","1997-10-28","2000-05-25","1995-09-02","1998-03-01","1992-02-26","1994-07-22")), end=as.Date(c("1992-05-01","1997-02-01","1997-10-20","1999-10-15","2003-08-25","2000-01-27","2002-04-15","2003-10-02","1997-03-17","2002-08-19")), reason=(c("Q2","Vacancy","R3","Vacancy","Vacancy","Vacancy","Q3","R2","S2","R1"))) DT[ , { if (all(reason != "Vacancy")){ indx <- which.max(end) result <- list(start = min(start) , end = end[indx] , reason = reason[indx] ) } else { if (sum(reason == "Vacancy") == 1){ indx <- which(reason == "Vacancy") result <- list(start = min(start) , end = end[indx] , reason = reason[indx] ) } else { indx <- which(reason == "Vacancy" & end == min(end[reason == "Vacancy"])) result <- list(start = min(start) , end = end[indx] , reason = reason[indx] ) } } result } , by = ID ] -- View this message in context: http://r.789695.n4.nabble.com/To-apply-different-instructions-in-data-table-tp4696925p4696950.html Sent from the datatable-help mailing list archive at Nabble.com. From f_j_rod at hotmail.com Mon Sep 15 16:30:59 2014 From: f_j_rod at hotmail.com (Frank S.) Date: Mon, 15 Sep 2014 16:30:59 +0200 Subject: [datatable-help] To apply different instructions in data table In-Reply-To: <1410785345188-4696950.post@n4.nabble.com> References: , <1410785345188-4696950.post@n4.nabble.com> Message-ID: JIM, thank you very much ! -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmtruppia at gmail.com Mon Sep 15 19:52:06 2014 From: jmtruppia at gmail.com (Juan Manuel Truppia) Date: Mon, 15 Sep 2014 14:52:06 -0300 Subject: [datatable-help] Update table from other table In-Reply-To: References:

Message-ID: Arun, sometimes it helps to have nomatch = 0 when using := dta = data.table(idx = c(1,1), key = "idx") dtb = data.table(idx = c(1,2), val = c("a", "b"), key = "idx") This fails because of cartesian join not allowed by default dta[dtb, val := i.val] but this doesn't dta[dtb, val := i.val, nomatch = 0] This is the same as doing dta[dtb, val := i.val, allow.cartesian = TRUE] On Fri, Sep 12, 2014 at 8:08 PM, Arunkumar Srinivasan wrote: > Glad it helped. > Always welcome "pull requests" :). > > Arun > > From: Juan Manuel Truppia > Reply: Juan Manuel Truppia > > Date: September 12, 2014 at 5:46:59 PM > To: Arunkumar Srinivasan > > Cc: datatable-help at lists.r-forge.r-project.org > > > > Subject: Re: [datatable-help] Update table from other table > > Great! (sorry, .EACHI = TRUE was an old definition). > It's good to know also that nomatch = 0 is irrelevant when using :=, I > always used is to avoid the rows in dtb creeping in dtb as NAs. > Also, it's really useful to know that by = EACHI should be used when the > calculations you are perfoming depend on the group or not. This came in > really in handy yesterday, and should be emphasized in .EACHI description. > Should I perform a pull request? > > On Fri, Sep 12, 2014 at 12:14 PM, Arunkumar Srinivasan < > aragorn168b at gmail.com> wrote: > >> I think you mean: >> >> dta[dtb, b:=b, by=.EACHI] >> >> and not .EACHI = TRUE. Not sure what's the use of nomatch=0L along with >> :=. >> >> by=.EACHI does exactly what it means, really. It evaluates j for each i >> match. Let's first see the matches: >> >> dta[dtb, which=TRUE] >> # [1] 1 1 3 >> >> So, first row of dtb matches with first of dta. The second of dtb >> matches with 1st of dta and so on. >> >> When you add by=.EACHI, as shown on the top, j-expression is evaluated >> on each of these matches. So, it'll be evaluated 3-times here. On the other >> hand, without it, j is evaluated once. In this case, it doesn't make a >> difference either way. So you should avoid by=.EACHI, as it'll be slower >> with it. >> >> It's particularly useful when you'd like to perform operations in j, >> that depends on the values in j on *that* group. For example, consider >> these data.tables dt1 and dt2: >> >> dt1 = data.table(x=rep(1:4, each=2), y=1:8, key="x") >> dt2 = data.table(x=3:5, z=10, key="x") >> >> And, you'd like to get sum(y)*z while joining.. If not for the by=.EACHI >> feature.. you'd approach the problem like this: >> >> dt1[dt2][, list(agg = sum(y)*z[1]), by=x] >> >> With by=.EACHI, this is simply: >> >> dt1[dt2, list(agg=sum(y)*z), by=.EACHI] >> >> Here, your expression is evaluated on each i. >> >> Another interesting use case is, say, you'd like to create a lagged >> vector of y: >> >> dt1[dt2, list(y=y, lagy = c(NA, head(y,-1)), z=z), by=.EACHI] >> >> It's that simple.. really. Basically, as long as the operation you're >> performing in j affects it depending on whether j is executed for that >> group or as a whole, then you're most likely looking for by=.EACHI. If >> not, by=.EACHI has no effect, and therefore you're wanting to use a normal >> join there.. >> >> This is not a text book definition, rather my understanding of this >> awesome feature! >> >> Hope this helps. >> >> Arun >> >> From: Juan Manuel Truppia >> Reply: Juan Manuel Truppia > >> Date: September 11, 2014 at 10:16:41 PM >> To: datatable-help at lists.r-forge.r-project.org >> > >> >> Subject: [datatable-help] Update table from other table >> >> What is the best data.table way of doing something similar to UPDATE >> FROM in SQL? >> >> I used to do something like >> >> dta = data.table(idx = c(1, 2, 3), a = runif(3), key = "idx") >> dtb = data.table(idx = c(1, 3), b = runif(3), key = "idx") >> dta[dtb, b := b] >> >> However, after the 1.9.3 and the explicit .EACHI, it fails sometimes, but >> I can't determine when. >> >> So, just to be sure, I do >> >> dta[dtb, b := b, .EACHI = TRUE, nomatch = 0] >> >> Is the .EACHI and the nomatch necessary? >> >> In this case, I want the row with idx 1 and 3 (the matching ones) to end >> with a b value from the matching b column in dtb, and the row with idx 2 >> (the one that isn't in dtb) to end up with NA in column b. >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From my.r.help at gmail.com Tue Sep 16 05:51:43 2014 From: my.r.help at gmail.com (Michael Smith) Date: Tue, 16 Sep 2014 11:51:43 +0800 Subject: [datatable-help] Update table from other table In-Reply-To: References:

Message-ID: <5417B3CF.8040409@gmail.com> That's interesting. Internally, which join is more efficient? The first one (using nomatch=0) I suppose? M On 09/16/2014 01:52 AM, Juan Manuel Truppia wrote: > Arun, sometimes it helps to have nomatch = 0 when using := > > dta = data.table(idx = c(1,1), key = "idx") > dtb = data.table(idx = c(1,2), val = c("a", "b"), key = "idx") > > This fails because of cartesian join not allowed by default > > dta[dtb, val := i.val] > > but this doesn't > > dta[dtb, val := i.val, nomatch = 0] > > This is the same as doing > > dta[dtb, val := i.val, allow.cartesian = TRUE] > > On Fri, Sep 12, 2014 at 8:08 PM, Arunkumar Srinivasan > > wrote: > > Glad it helped. > Always welcome "pull requests" :). > > Arun > > From: Juan Manuel Truppia > > Reply: Juan Manuel Truppia > > > Date: September 12, 2014 at 5:46:59 PM > To: Arunkumar Srinivasan > > > Cc: datatable-help at lists.r-forge.r-project.org > > > > > Subject: Re: [datatable-help] Update table from other table > >> Great! (sorry, .EACHI = TRUE was an old definition). >> It's good to know also that nomatch = 0 is irrelevant when using >> :=, I always used is to avoid the rows in dtb creeping in dtb as NAs. >> Also, it's really useful to know that by = EACHI should be used >> when the calculations you are perfoming depend on the group or >> not. This came in really in handy yesterday, and should be >> emphasized in .EACHI description. Should I perform a pull request? >> >> On Fri, Sep 12, 2014 at 12:14 PM, Arunkumar Srinivasan >> > wrote: >> >> I think you mean: >> >> |dta[dtb, b:=b, by=.EACHI] >> | >> >> and not |.EACHI = TRUE|. Not sure what?s the use of >> |nomatch=0L| along with |:=|. >> >> |by=.EACHI| does exactly what it means, really. It evaluates >> |j| for each |i| match. Let?s first see the matches: >> >> |dta[dtb, which=TRUE] >> # [1] 1 1 3 >> | >> >> So, first row of |dtb| matches with first of |dta|. The second >> of |dtb| matches with 1st of |dta| and so on. >> >> When you add |by=.EACHI|, as shown on the top, |j-expression| >> is evaluated on each of these matches. So, it?ll be evaluated >> 3-times here. On the other hand, without it, |j| is evaluated >> once. In this case, it doesn?t make a difference either way. >> So you should avoid |by=.EACHI|, as it?ll be slower with it. >> >> It?s particularly useful when you?d like to perform operations >> in |j|, that depends on the values in |j| on /that/ group. For >> example, consider these data.tables |dt1| and |dt2|: >> >> |dt1 = data.table(x=rep(1:4, each=2), y=1:8, key="x") >> dt2 = data.table(x=3:5, z=10, key="x") >> | >> >> And, you?d like to get |sum(y)*z| while joining.. If not for >> the |by=.EACHI| feature.. you?d approach the problem like this: >> >> |dt1[dt2][, list(agg = sum(y)*z[1]), by=x] >> | >> >> With |by=.EACHI|, this is simply: >> >> |dt1[dt2, list(agg=sum(y)*z), by=.EACHI] >> | >> >> Here, your expression is evaluated on each |i|. >> >> Another interesting use case is, say, you?d like to create a >> lagged vector of |y|: >> >> |dt1[dt2, list(y=y, lagy = c(NA, head(y,-1)), z=z), by=.EACHI] >> | >> >> It?s that simple.. really. Basically, as long as the operation >> you?re performing in |j| affects it depending on whether j is >> executed for that group or as a whole, then you?re most likely >> looking for |by=.EACHI|. If not, |by=.EACHI| has no effect, >> and therefore you?re wanting to use a |normal join| there.. >> >> This is not a text book definition, rather my understanding of >> this awesome feature! >> >> Hope this helps. >> >> >> Arun >> >> From: Juan Manuel Truppia >> >> Reply: Juan Manuel Truppia > >> >> Date: September 11, 2014 at 10:16:41 PM >> To: datatable-help at lists.r-forge.r-project.org >> >> > >> >> Subject: [datatable-help] Update table from other table >> >>> What is the best data.table way of doing something similar to >>> UPDATE FROM in SQL? >>> >>> I used to do something like >>> >>> dta = data.table(idx = c(1, 2, 3), a = runif(3), key = "idx") >>> dtb = data.table(idx = c(1, 3), b = runif(3), key = "idx") >>> dta[dtb, b := b] >>> >>> However, after the 1.9.3 and the explicit .EACHI, it fails >>> sometimes, but I can't determine when. >>> >>> So, just to be sure, I do >>> >>> dta[dtb, b := b, .EACHI = TRUE, nomatch = 0] >>> >>> Is the .EACHI and the nomatch necessary? >>> >>> In this case, I want the row with idx 1 and 3 (the matching >>> ones) to end with a b value from the matching b column in >>> dtb, and the row with idx 2 (the one that isn't in dtb) to >>> end up with NA in column b. >>> >>> >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > From f_j_rod at hotmail.com Tue Sep 16 19:04:10 2014 From: f_j_rod at hotmail.com (Frank S.) Date: Tue, 16 Sep 2014 19:04:10 +0200 Subject: [datatable-help] Replace missings by dates in data table Message-ID: Hi to all members of the list, Let's say I have the following data (small example): DT <- data.table(ID=c(1,1,2), start=c("1985-01-01","1993-07-15","1993-05-17"), end=c("1992-05-01","1997-02-01",NA))I would want to replace missing values by "01-01-2000" in "end" variable, and convert both "start" and "end" columns in as.Date class. I tried the code: DT[ , c("start", "end"):=list(as.Date(start,format="%d/%m/%Y",origin="1900-10-01"),as.Date(ifelse(is.na(end),"01/01/2000",end),format="%d/%m/%Y",origin="1900-10-01")), by=ID] Error in `[.data.table`(DT, , `:=`(c("start", "end"), list(as.Date(start, : Type of RHS ('double') must match LHS ('character'). What I have to change? Am I doing it in a too complicated way? Thanks in advance for any help!! -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Tue Sep 16 19:27:13 2014 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Tue, 16 Sep 2014 13:27:13 -0400 Subject: [datatable-help] Replace missings by dates in data table In-Reply-To: References: Message-ID: On Tue, Sep 16, 2014 at 1:04 PM, Frank S. wrote: > Hi to all members of the list, > > > > Let's say I have the following data (small example): > > > > DT <- data.table(ID=c(1,1,2), > start=c("1985-01-01","1993-07-15","1993-05-17"), > end=c("1992-05-01","1997-02-01",NA)) > > I would want to replace missing values by "01-01-2000" in "end" variable, > and convert both "start" and "end" columns in as.Date class. > > I tried the code: > > > > DT[ , c("start", > "end"):=list(as.Date(start,format="%d/%m/%Y",origin="1900-10-01"), > > as.Date(ifelse(is.na(end),"01/01/2000",end),format="%d/%m/%Y",origin="1900-10-01")), > by=ID] > > > > Error in `[.data.table`(DT, , `:=`(c("start", "end"), list(as.Date(start, : > Type of RHS ('double') must match LHS ('character'). > > > > What I have to change? Am I doing it in a too complicated way? DT[, list(ID, start = as.Date(start), end = as.Date(replace(end, is.na(end), "2000-01-01")))] From f_j_rod at hotmail.com Wed Sep 17 09:58:47 2014 From: f_j_rod at hotmail.com (Frank S.) Date: Wed, 17 Sep 2014 09:58:47 +0200 Subject: [datatable-help] Replace missings by dates in data table In-Reply-To: References: , Message-ID: Gabor, many thanks for your answer!! -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Thu Sep 18 00:04:21 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Thu, 18 Sep 2014 00:04:21 +0200 Subject: [datatable-help] Update table from other table In-Reply-To: References:

Message-ID: Arun, sometimes it helps to have nomatch = 0 when using := dta = data.table(idx = c(1,1), key = "idx") dtb = data.table(idx = c(1,2), val = c("a", "b"), key = "idx") This fails because of cartesian join not allowed by default That's because of a bug. `allow.cartesian` error shouldn't occur with `:=` at all, as the number of rows will *never* exceed `x`. IIRC, the allow.cartesian bugs are scheduled to be fixed for the next-next release (after 1.9.4). Arun From:?Juan Manuel Truppia Reply:?Juan Manuel Truppia > Date:?September 15, 2014 at 7:52:26 PM To:?Arunkumar Srinivasan > Cc:?datatable-help at lists.r-forge.r-project.org > Subject:? Re: [datatable-help] Update table from other table Arun, sometimes it helps to have nomatch = 0 when using := dta = data.table(idx = c(1,1), key = "idx") dtb = data.table(idx = c(1,2), val = c("a", "b"), key = "idx") This fails because of cartesian join not allowed by default dta[dtb, val := i.val] but this doesn't dta[dtb, val := i.val, nomatch = 0] This is the same as doing dta[dtb, val := i.val, allow.cartesian = TRUE] On Fri, Sep 12, 2014 at 8:08 PM, Arunkumar Srinivasan wrote: Glad it helped. Always welcome "pull requests" :). Arun From:?Juan Manuel Truppia Reply:?Juan Manuel Truppia > Date:?September 12, 2014 at 5:46:59 PM To:?Arunkumar Srinivasan > Cc:?datatable-help at lists.r-forge.r-project.org > Subject:? Re: [datatable-help] Update table from other table Great! (sorry, .EACHI = TRUE was an old definition). It's good to know also that nomatch = 0 is irrelevant when using :=, I always used is to avoid the rows in dtb creeping in dtb as NAs. Also, it's really useful to know that by = EACHI should be used when the calculations you are perfoming depend on the group or not. This came in really in handy yesterday, and should be emphasized in .EACHI description. Should I perform a pull request? On Fri, Sep 12, 2014 at 12:14 PM, Arunkumar Srinivasan wrote: I think you mean: dta[dtb, b:=b, by=.EACHI] and not .EACHI = TRUE. Not sure what?s the use of nomatch=0L along with :=. by=.EACHI does exactly what it means, really. It evaluates j for each i match. Let?s first see the matches: dta[dtb, which=TRUE] # [1] 1 1 3 So, first row of dtb matches with first of dta. The second of dtb matches with 1st of dta and so on. When you add by=.EACHI, as shown on the top, j-expression is evaluated on each of these matches. So, it?ll be evaluated 3-times here. On the other hand, without it, j is evaluated once. In this case, it doesn?t make a difference either way. So you should avoid by=.EACHI, as it?ll be slower with it. It?s particularly useful when you?d like to perform operations in j, that depends on the values in j on that group. For example, consider these data.tables dt1 and dt2: dt1 = data.table(x=rep(1:4, each=2), y=1:8, key="x") dt2 = data.table(x=3:5, z=10, key="x") And, you?d like to get sum(y)*z while joining.. If not for the by=.EACHI feature.. you?d approach the problem like this: dt1[dt2][, list(agg = sum(y)*z[1]), by=x] With by=.EACHI, this is simply: dt1[dt2, list(agg=sum(y)*z), by=.EACHI] Here, your expression is evaluated on each i. Another interesting use case is, say, you?d like to create a lagged vector of y: dt1[dt2, list(y=y, lagy = c(NA, head(y,-1)), z=z), by=.EACHI] It?s that simple.. really. Basically, as long as the operation you?re performing in j affects it depending on whether j is executed for that group or as a whole, then you?re most likely looking for by=.EACHI. If not, by=.EACHI has no effect, and therefore you?re wanting to use a normal join there.. This is not a text book definition, rather my understanding of this awesome feature! Hope this helps. Arun From:?Juan Manuel Truppia Reply:?Juan Manuel Truppia > Date:?September 11, 2014 at 10:16:41 PM To:?datatable-help at lists.r-forge.r-project.org > Subject:? [datatable-help] Update table from other table What is the best data.table way of doing something similar to UPDATE FROM in SQL? I used to do something like dta = data.table(idx = c(1, 2, 3), a = runif(3), key = "idx") dtb = data.table(idx = c(1, 3), b = runif(3), key = "idx") dta[dtb, b := b] However, after the 1.9.3 and the explicit .EACHI, it fails sometimes, but I can't determine when. So, just to be sure, I do? dta[dtb, b := b, .EACHI = TRUE, nomatch = 0] Is the .EACHI and the nomatch necessary? In this case, I want the row with idx 1 and 3 (the matching ones) to end with a b value from the matching b column in dtb, and the row with idx 2 (the one that isn't in dtb) to end up with NA in column b. _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmtruppia at gmail.com Thu Sep 18 18:14:19 2014 From: jmtruppia at gmail.com (Juan Manuel Truppia) Date: Thu, 18 Sep 2014 13:14:19 -0300 Subject: [datatable-help] NA in joins Message-ID: Hi, this must have been discussed before, but I couldn't find anything. In my opinion, NA shouldn't join with anything, including other NA (as to mirror what we expect from SQL, where NULL doesn't join with NULL). However, with data.table, NA matches other NA. I.e, this should return an empty data.table data.table(idx = NA_real_, key = "idx")[data.table(idx = NA_real_, val = "a", key = "idx"), nomatch = 0] Let's assume that we can't change this behavior, would it be possible to add a parameter to avoid NA matching NA in [.data.table and merge? -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Thu Sep 18 21:00:56 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Thu, 18 Sep 2014 21:00:56 +0200 Subject: [datatable-help] NA in joins In-Reply-To: References: Message-ID: In base R `NA` matches `NA` alone, and `NaN` matches `NaN` alone: match(NA, c(1:5, NA)) # [1] 6 data.table?matches, through binary search, by design, in the same way.?And in `?match`, there's this line: "Exactly what matches what is to some extent a matter of definition." In some operations it may not make sense. But, by design, we do consider Inf = Inf, -Inf = -Inf, NaN = NaN and NA = NA always. Do you think it'd help tp state this explicitly in `?data.table`? Arun From:?Juan Manuel Truppia Reply:?Juan Manuel Truppia > Date:?September 18, 2014 at 6:14:56 PM To:?datatable-help at lists.r-forge.r-project.org > Subject:? [datatable-help] NA in joins Hi, this must have been discussed before, but I couldn't find anything. In my opinion, NA shouldn't join with anything, including other NA (as to mirror what we expect from SQL, where NULL doesn't join with NULL). However, with data.table, NA matches other NA. I.e, this should return an empty data.table data.table(idx = NA_real_, key = "idx")[data.table(idx = NA_real_, val = "a", key = "idx"), nomatch = 0] Let's assume that we can't change this behavior, would it be possible to add a parameter to avoid NA matching NA in [.data.table and merge? _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmtruppia at gmail.com Thu Sep 18 21:14:22 2014 From: jmtruppia at gmail.com (Juan Manuel Truppia) Date: Thu, 18 Sep 2014 16:14:22 -0300 Subject: [datatable-help] NA in joins In-Reply-To: References:

Message-ID: It might help, specially where data.table is compared to SQL. However, I think that having merge (and maybe [.data.table) have an argument to avoid NA matching. Is there a FR already created for this? I can create it otherwise On Thu, Sep 18, 2014 at 4:00 PM, Arunkumar Srinivasan wrote: > In base R `NA` matches `NA` alone, and `NaN` matches `NaN` alone: > match(NA, c(1:5, NA)) > # [1] 6 > > data.table matches, through binary search, by design, in the same way. And > in `?match`, there's this line: "Exactly what matches what is to some > extent a matter of definition." In some operations it may not make sense. > But, by design, we do consider Inf = Inf, -Inf = -Inf, NaN = NaN and NA = > NA always. Do you think it'd help tp state this explicitly in `?data.table`? > > > Arun > > From: Juan Manuel Truppia > Reply: Juan Manuel Truppia > > Date: September 18, 2014 at 6:14:56 PM > To: datatable-help at lists.r-forge.r-project.org > > > > Subject: [datatable-help] NA in joins > > Hi, this must have been discussed before, but I couldn't find anything. > > In my opinion, NA shouldn't join with anything, including other NA (as to > mirror what we expect from SQL, where NULL doesn't join with NULL). > > However, with data.table, NA matches other NA. > > I.e, this should return an empty data.table > > data.table(idx = NA_real_, key = "idx")[data.table(idx = NA_real_, val = > "a", key = "idx"), nomatch = 0] > > Let's assume that we can't change this behavior, would it be possible to > add a parameter to avoid NA matching NA in [.data.table and merge? > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Thu Sep 18 21:34:00 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Thu, 18 Sep 2014 21:34:00 +0200 Subject: [datatable-help] NA in joins In-Reply-To: References:

Message-ID: Thanks. It'd also be great if you could add an issue for adding the documentation. On NA non-matching, yes you could add an FR, there isn't one to my recollection. However much of this year has been spent on internal order and binary search in tweaking quite a lot of things. So I'd not be surprised if it is not attended to anytime soon. Arun From:?Juan Manuel Truppia Reply:?Juan Manuel Truppia > Date:?September 18, 2014 at 9:14:42 PM To:?Arunkumar Srinivasan > Cc:?datatable-help at lists.r-forge.r-project.org > Subject:? Re: [datatable-help] NA in joins It might help, specially where data.table is compared to SQL. However, I think that having merge (and maybe [.data.table) have an argument to avoid NA matching. Is there a FR already created for this? I can create it otherwise On Thu, Sep 18, 2014 at 4:00 PM, Arunkumar Srinivasan wrote: In base R `NA` matches `NA` alone, and `NaN` matches `NaN` alone: match(NA, c(1:5, NA)) # [1] 6 data.table?matches, through binary search, by design, in the same way.?And in `?match`, there's this line: "Exactly what matches what is to some extent a matter of definition." In some operations it may not make sense. But, by design, we do consider Inf = Inf, -Inf = -Inf, NaN = NaN and NA = NA always. Do you think it'd help tp state this explicitly in `?data.table`? Arun From:?Juan Manuel Truppia Reply:?Juan Manuel Truppia > Date:?September 18, 2014 at 6:14:56 PM To:?datatable-help at lists.r-forge.r-project.org > Subject:? [datatable-help] NA in joins Hi, this must have been discussed before, but I couldn't find anything. In my opinion, NA shouldn't join with anything, including other NA (as to mirror what we expect from SQL, where NULL doesn't join with NULL). However, with data.table, NA matches other NA. I.e, this should return an empty data.table data.table(idx = NA_real_, key = "idx")[data.table(idx = NA_real_, val = "a", key = "idx"), nomatch = 0] Let's assume that we can't change this behavior, would it be possible to add a parameter to avoid NA matching NA in [.data.table and merge? _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmtruppia at gmail.com Thu Sep 18 22:00:46 2014 From: jmtruppia at gmail.com (Juan Manuel Truppia) Date: Thu, 18 Sep 2014 17:00:46 -0300 Subject: [datatable-help] NA in joins In-Reply-To: References:

Message-ID: 818 and 819 created On Thu, Sep 18, 2014 at 4:34 PM, Arunkumar Srinivasan wrote: > Thanks. It'd also be great if you could add an issue for adding the > documentation. > On NA non-matching, yes you could add an FR, there isn't one to my > recollection. However much of this year has been spent on internal order > and binary search in tweaking quite a lot of things. So I'd not be > surprised if it is not attended to anytime soon. > > Arun > > From: Juan Manuel Truppia > Reply: Juan Manuel Truppia > > Date: September 18, 2014 at 9:14:42 PM > To: Arunkumar Srinivasan > > Cc: datatable-help at lists.r-forge.r-project.org > > > > Subject: Re: [datatable-help] NA in joins > > It might help, specially where data.table is compared to SQL. However, I > think that having merge (and maybe [.data.table) have an argument to avoid > NA matching. Is there a FR already created for this? I can create it > otherwise > > On Thu, Sep 18, 2014 at 4:00 PM, Arunkumar Srinivasan < > aragorn168b at gmail.com> wrote: > >> In base R `NA` matches `NA` alone, and `NaN` matches `NaN` alone: >> match(NA, c(1:5, NA)) >> # [1] 6 >> >> data.table matches, through binary search, by design, in the same way. And >> in `?match`, there's this line: "Exactly what matches what is to some >> extent a matter of definition." In some operations it may not make sense. >> But, by design, we do consider Inf = Inf, -Inf = -Inf, NaN = NaN and NA = >> NA always. Do you think it'd help tp state this explicitly in `?data.table`? >> >> >> Arun >> >> From: Juan Manuel Truppia >> Reply: Juan Manuel Truppia > >> Date: September 18, 2014 at 6:14:56 PM >> To: datatable-help at lists.r-forge.r-project.org >> > >> >> Subject: [datatable-help] NA in joins >> >> Hi, this must have been discussed before, but I couldn't find >> anything. >> >> In my opinion, NA shouldn't join with anything, including other NA (as to >> mirror what we expect from SQL, where NULL doesn't join with NULL). >> >> However, with data.table, NA matches other NA. >> >> I.e, this should return an empty data.table >> >> data.table(idx = NA_real_, key = "idx")[data.table(idx = NA_real_, val = >> "a", key = "idx"), nomatch = 0] >> >> Let's assume that we can't change this behavior, would it be possible to >> add a parameter to avoid NA matching NA in [.data.table and merge? >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Thu Sep 18 22:01:41 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Thu, 18 Sep 2014 22:01:41 +0200 Subject: [datatable-help] NA in joins In-Reply-To: References:

Message-ID: Awesome, thanks! Have added tags to them. Arun From:?Juan Manuel Truppia Reply:?Juan Manuel Truppia > Date:?September 18, 2014 at 10:01:06 PM To:?Arunkumar Srinivasan > Cc:?datatable-help at lists.r-forge.r-project.org > Subject:? Re: [datatable-help] NA in joins 818 and 819 created On Thu, Sep 18, 2014 at 4:34 PM, Arunkumar Srinivasan wrote: Thanks. It'd also be great if you could add an issue for adding the documentation. On NA non-matching, yes you could add an FR, there isn't one to my recollection. However much of this year has been spent on internal order and binary search in tweaking quite a lot of things. So I'd not be surprised if it is not attended to anytime soon. Arun From:?Juan Manuel Truppia Reply:?Juan Manuel Truppia > Date:?September 18, 2014 at 9:14:42 PM To:?Arunkumar Srinivasan > Cc:?datatable-help at lists.r-forge.r-project.org > Subject:? Re: [datatable-help] NA in joins It might help, specially where data.table is compared to SQL. However, I think that having merge (and maybe [.data.table) have an argument to avoid NA matching. Is there a FR already created for this? I can create it otherwise On Thu, Sep 18, 2014 at 4:00 PM, Arunkumar Srinivasan wrote: In base R `NA` matches `NA` alone, and `NaN` matches `NaN` alone: match(NA, c(1:5, NA)) # [1] 6 data.table?matches, through binary search, by design, in the same way.?And in `?match`, there's this line: "Exactly what matches what is to some extent a matter of definition." In some operations it may not make sense. But, by design, we do consider Inf = Inf, -Inf = -Inf, NaN = NaN and NA = NA always. Do you think it'd help tp state this explicitly in `?data.table`? Arun From:?Juan Manuel Truppia Reply:?Juan Manuel Truppia > Date:?September 18, 2014 at 6:14:56 PM To:?datatable-help at lists.r-forge.r-project.org > Subject:? [datatable-help] NA in joins Hi, this must have been discussed before, but I couldn't find anything. In my opinion, NA shouldn't join with anything, including other NA (as to mirror what we expect from SQL, where NULL doesn't join with NULL). However, with data.table, NA matches other NA. I.e, this should return an empty data.table data.table(idx = NA_real_, key = "idx")[data.table(idx = NA_real_, val = "a", key = "idx"), nomatch = 0] Let's assume that we can't change this behavior, would it be possible to add a parameter to avoid NA matching NA in [.data.table and merge? _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From martin.dunelm at gmail.com Fri Sep 19 09:32:02 2014 From: martin.dunelm at gmail.com (Martin Watts) Date: Fri, 19 Sep 2014 08:32:02 +0100 Subject: [datatable-help] Replace missings by dates in data table In-Reply-To: References:

Message-ID: You can also try: DT[,start:=as.Date(start)][,end:=as.Date(replace(end,is.na(end), "2000-01-01"))] If you want to modify the original data table. On 17 September 2014 08:58, Frank S. wrote: > Gabor, many thanks for your answer!! > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From f_j_rod at hotmail.com Mon Sep 22 12:50:00 2014 From: f_j_rod at hotmail.com (Frank S.) Date: Mon, 22 Sep 2014 12:50:00 +0200 Subject: [datatable-help] Replace missings by dates in data table In-Reply-To: References: , , , Message-ID: Thank you Martin!! -------------- next part -------------- An HTML attachment was scrubbed... URL: From f_j_rod at hotmail.com Mon Sep 22 13:16:06 2014 From: f_j_rod at hotmail.com (Frank S.) Date: Mon, 22 Sep 2014 13:16:06 +0200 Subject: [datatable-help] Two short questions about the operation of data table Message-ID: Hello to everyone, Let's consider, just by way of example, the following date and data table: opening <- as.Date("1990-01-01")DT <- data.table(ID=c(1,2,3), start=c("1985-01-01","1993-07-15","1993-05-17"), end=c("1992-05-01","1997-02-25","2002-01-01"), value=c(7.8, 3.2, 20.0)) FIRST QUESTION: If I execute: DTNEW <- DT[ , { if (all(start <= opening)){ result <- list(start, end, t.dif= unclass(round(difftime(end, start)/365.25,1)), value) } else { result <- list(start, end, t.dif= 20, value) } result}, by=ID] Why can I not keep the column names? ID t.dif 1: 1 1985-01-01 1992-05-01 7.3 7.82: 2 1993-07-15 1997-02-25 20.0 3.23: 3 1993-05-17 2002-01-01 20.0 20.0SECOND QUESTION: I would want to remove rows where t.dif=value in the final result. Then, I tried: DTNEW <- DT[ , { if (all(start <= opening)){ result <- list(start, end, t.dif= unclass(round(difftime(end, start)/365.25,1)), value) } else { result <- list(start, end, t.dif= 20, value) } result[!(t.dif == value)]}, by=ID] But R does not find the variable t.dif !! Thank you for your time to all of the members of the list!! -------------- next part -------------- An HTML attachment was scrubbed... URL: From f_j_rod at hotmail.com Mon Sep 22 13:28:30 2014 From: f_j_rod at hotmail.com (Frank S.) Date: Mon, 22 Sep 2014 13:28:30 +0200 Subject: [datatable-help] Two short questions about the operation of data table In-Reply-To: References: Message-ID: Excuse me, One comment on my question: start and end variables are dates: DT <- data.table(ID=c(1,2,3), start=as.Date(c("1985-01-01","1993-07-15","1993-05-17")), end=as.Date(c("1992-05-01","1997-02-25","2002-01-01")), value=c(7.8, 3.2, 20)) Thanks!! -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Mon Sep 22 13:32:02 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Mon, 22 Sep 2014 13:32:02 +0200 Subject: [datatable-help] Two short questions about the operation of data table In-Reply-To: References: Message-ID: DTNEW <- DT[ , { if (all(start <= opening)){ result <- list(start, end, t.dif= unclass(round(difftime(end, start)/365.25,1)), value) } else { result <- list(start, end, t.dif= 20, value) } result}, by=ID] Why can I not keep the column names? When we do: DT[, list(x, y), by=z] the j-expression returns an unnamed list. But we understand it as a straightforward list() scenario and extract the symbols and assign them as names in output/result. But what happens within { ... } is more complicated and therefore is hard to extract the names to set names in result. DTNEW <- DT[ , { if (all(start <= opening)){ result <- list(start, end, t.dif= unclass(round(difftime(end, start)/365.25,1)), value) } else { result <- list(start, end, t.dif= 20, value) } result[!(t.dif == value)]}, by=ID] But R does not find the variable t.dif !! But result is a list. You should be doing result[!result$t.dif == value]. That is: DTNEW <- DT[ , { if (all(start <= opening)){ result <- list(start, end, t.dif= unclass(round(difftime(end, start)/365.25,1)), value) } else { result <- list(start, end, t.dif= 20, value) } result[!result$t.dif == value]}, by=ID] Arun From:?Frank S. Reply:?Frank S. > Date:?September 22, 2014 at 1:16:48 PM To:?datatable-help at lists.r-forge.r-project.org > Subject:? [datatable-help] Two short questions about the operation of data table Hello to everyone, ? Let's consider, just by way of example, the following date and data table: ? opening <- as.Date("1990-01-01") DT <- data.table(ID=c(1,2,3), ??? start=c("1985-01-01","1993-07-15","1993-05-17"), ????end=c("1992-05-01","1997-02-25","2002-01-01"), ??? value=c(7.8, 3.2, 20.0)) ? FIRST QUESTION: ? If I?execute: ? DTNEW <- DT[ , {? ???if (all(start <= opening)){? ?????result <- list(start, end,?t.dif= unclass(round(difftime(end, start)/365.25,1)), value)? ?????} else {? ?????result <- list(start, end,?t.dif= 20, value) ?????}? ???result}, by=ID] ? Why can I not keep the column names? ?? ID?????????????????????? t.dif???? 1:? 1 1985-01-01 1992-05-01?? 7.3? 7.8 2:? 2 1993-07-15 1997-02-25? 20.0? 3.2 3:? 3 1993-05-17 2002-01-01? 20.0 20.0 SECOND QUESTION: ? I would want to remove rows where t.dif=value in the final result. Then,?I tried: ? DTNEW <- DT[ , {? ???if (all(start <= opening)){? ?????result <- list(start, end,? t.dif= unclass(round(difftime(end, start)/365.25,1)), value)? ?????} else {? ?????result <- list(start, end,? t.dif= 20, value) ?????}? ???result[!(t.dif == value)]}, by=ID] ? But R does not find the variable?t.dif !! ? Thank you for your time?to all of the members of the list!! ? ? ? ? ? ? ? ? ? ? ? ? ? ? _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists at sigg-iten.ch Mon Sep 22 14:01:02 2014 From: lists at sigg-iten.ch (Christian Sigg) Date: Mon, 22 Sep 2014 14:01:02 +0200 Subject: [datatable-help] Stability of Duplicate Column Renaming in Version 1.9.3 Message-ID: <54200F7E.1070404@sigg-iten.ch> We are evaluating data.table version 1.9.3 to benefit from a number of bug fixes. But one significant change for us is the new feature 5 in the changelog: "X[Y] now names non-join columns from i (...) with an i. prefix (...)" The new naming scheme implies a substantial number of changes in our code. Can this feature be considered "stable", i.e. can we already migrate our code? Or is there the possibility that the naming scheme might change again until the release of 1.9.4? Thanks, Christian From f_j_rod at hotmail.com Mon Sep 22 15:20:01 2014 From: f_j_rod at hotmail.com (Frank S.) Date: Mon, 22 Sep 2014 15:20:01 +0200 Subject: [datatable-help] Two short questions about the operation of data table In-Reply-To: References: , Message-ID: Thank you Arunkumar for your quick answer! Best regards, Frank S. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ozias_hounkpatin at yahoo.fr Wed Sep 24 08:55:20 2014 From: ozias_hounkpatin at yahoo.fr (Hounkpatin Ozias) Date: Wed, 24 Sep 2014 07:55:20 +0100 Subject: [datatable-help] Random Forest Message-ID: <1411541720.64934.YahooMailNeo@web172406.mail.ir2.yahoo.com> Hello to everyone, I am interested in doing some iteration in using Random Forest (classification purpose) with different value of mtry= 2, 4, 6, 9, 12. I want to repeat each run 100 times. that is with mtry=2 for example, run it 100 times. As output, I would like to have the aggregate out of bag errors (total means over the 100 runs) as well as well as the variable importance based on this 100 runs aggregated over their OBB errors. One could try it one by one, report each value but it is very laborious. Is there anyway to have R run the Random Forest 100 times, and give me as output the resulting (aggregated means) OOB errors and variable importance. here was my code. r2 <- randomForest(Factor ~ ., data=tr, nodesize = 1,ntree=1000, importance=TRUE, proximity=TRUE, mtry=2) #I want a 100 run of this #get average OOB errors #get variable importance based on these aggregated OBB errors. Thank you very much. -------------- next part -------------- An HTML attachment was scrubbed... URL: From my.r.help at gmail.com Wed Sep 24 09:53:55 2014 From: my.r.help at gmail.com (Michael Smith) Date: Wed, 24 Sep 2014 15:53:55 +0800 Subject: [datatable-help] Random Forest In-Reply-To: <1411541720.64934.YahooMailNeo@web172406.mail.ir2.yahoo.com> References: <1411541720.64934.YahooMailNeo@web172406.mail.ir2.yahoo.com> Message-ID: <54227893.6060700@gmail.com> Not sure whether I understand you correctly (and whether this is even a data.table question), but maybe you are looking for `replicate`? M On 09/24/2014 02:55 PM, Hounkpatin Ozias wrote: > Hello to everyone, > > I am interested in doing some iteration in using Random Forest > (classification purpose) with different value of mtry= 2, 4, 6, 9, 12. I > want to repeat each run 100 times. that is with mtry=2 for example, run > it 100 times. As output, I would like to have the aggregate out of bag > errors (total means over the 100 runs) as well as well as the variable > importance based on this 100 runs aggregated over their OBB errors. One > could try it one by one, report each value but it is very laborious. Is > there anyway to have R run the Random Forest 100 times, and give me as > output the resulting (aggregated means) OOB errors and variable > importance. here was my code. > > r2 <- randomForest(Factor ~ ., data=tr, nodesize = 1,ntree=1000, > importance=TRUE, proximity=TRUE, mtry=2) > #I want a 100 run of this > #get average OOB errors > #get variable importance based on these aggregated OBB errors. > > Thank you very much. > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > From my.r.help at gmail.com Wed Sep 24 14:34:56 2014 From: my.r.help at gmail.com (Michael Smith) Date: Wed, 24 Sep 2014 20:34:56 +0800 Subject: [datatable-help] Random Forest In-Reply-To: <1411548144.93879.YahooMailNeo@web172402.mail.ir2.yahoo.com> References: <1411541720.64934.YahooMailNeo@web172406.mail.ir2.yahoo.com> <54227893.6060700@gmail.com> <1411548144.93879.YahooMailNeo@web172402.mail.ir2.yahoo.com> Message-ID: <5422BA70.2050607@gmail.com> Since you're asking where else to post, I would suggest the R-help mailing list for your particular case. (And be sure to read the posting guide before sending.) If you have a question about a specific package, you could also contact the author/maintainer of that package (but only after you have "done your homework" and searched around without finding a solution). And please reply cc to the list, even if you're replying to my email, otherwise it won't get archived and other people cannot contribute. M On 09/24/2014 04:42 PM, Hounkpatin Ozias wrote: > Hi Smith, > I looked through the replicate function. you got my question right > actually. I have just tried it using the code: > >>rep<-replicate(100, randomForest(RSG ~ ., data=tr,nodesize = > 1,ntree=1000,importance=TRUE, proximity=TRUE, mtry=2)) >>rep > I got the result below 100 times. > > [,1]......................................................................................... > [,100] > call Expression > type "classification" > predicted factor,890 > err.rate Numeric,6000 > confusion Numeric,30 > votes Numeric,4450 > oob.times Numeric,890 > classes Character,5 > importance Numeric,126 > importanceSD Numeric,108 > localImportance NULL > proximity Numeric,792100 > ntree 1000 > mtry 2 > forest List,14 > y factor,890 > test NULL > inbag NULL > terms Expression > Now the question remains for me to get the OBB errors and variable > importance, not each run, but considering the means of the 100 runs. > When I call the following output after replicate, I got only NULL. >> rep1$err.rate > NULL >> > rep1$ntree > NULL >> rep1$mtry > NULL >> rep1$y > NULL > I am new in R and new also in this data list. if my post does not fit > the purpose of the data list, I will appreciate if anyone could direct > me to a better platform dealing with this issue. > Thanks. > > > Le Mercredi 24 septembre 2014 8h53, Michael Smith > a ?crit : > > > Not sure whether I understand you correctly (and whether this is even a > data.table question), but maybe you are looking for `replicate`? > > M > > > On 09/24/2014 02:55 PM, Hounkpatin Ozias wrote: >> Hello to everyone, >> >> I am interested in doing some iteration in using Random Forest >> (classification purpose) with different value of mtry= 2, 4, 6, 9, 12. I >> want to repeat each run 100 times. that is with mtry=2 for example, run >> it 100 times. As output, I would like to have the aggregate out of bag >> errors (total means over the 100 runs) as well as well as the variable >> importance based on this 100 runs aggregated over their OBB errors. One >> could try it one by one, report each value but it is very laborious. Is >> there anyway to have R run the Random Forest 100 times, and give me as >> output the resulting (aggregated means) OOB errors and variable >> importance. here was my code. >> >> r2 <- randomForest(Factor ~ ., data=tr, nodesize = 1,ntree=1000, >> importance=TRUE, proximity=TRUE, mtry=2) >> #I want a 100 run of this >> #get average OOB errors >> #get variable importance based on these aggregated OBB errors. >> >> Thank you very much. >> >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org > >> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > >