From mdowle at mdowle.plus.com Thu Oct 2 22:53:50 2014 From: mdowle at mdowle.plus.com (Matt Dowle) Date: Thu, 02 Oct 2014 21:53:50 +0100 Subject: [datatable-help] v1.9.4 is now on CRAN Message-ID: <542DBB5E.5030905@mdowle.plus.com> Available on Linux now or a few hours/tomorrow for Windows and Mac binaries to make their way to all mirrors. NEWS is the README on CRAN : http://cran.r-project.org/web/packages/data.table/README.html or the formatting may be easier to read on GitHub : https://github.com/Rdatatable/data.table The first two points deal with by=.EACHI and the option to revert to the old behaviour should you need to. Of the 66 packages on CRAN or Bioconductor, this only affected 3. So either by-without-by wasn't getting much usage or it is but those packages don't have tests covering that usage. Have also updated FAQs 1.13 and 1.14 (a lot) regarding by=.EACHI : http://cran.r-project.org/web/packages/data.table/vignettes/datatable-faq.pdf Main new features: i) overlap joins and ii) automatic indexing; i.e. DT[column==values] is now ok (just for one ==, currently). New homepage : https://github.com/Rdatatable/data.table/wiki The old homepage on R-Forge will soon redirect to the new one (still some links/content to move over). Fingers crossed! Matt From harishv_99 at yahoo.com Tue Oct 14 14:13:54 2014 From: harishv_99 at yahoo.com (Harish) Date: Tue, 14 Oct 2014 12:13:54 +0000 (UTC) Subject: [datatable-help] Error in row filtering Message-ID: <1652895759.38733.1413288834416.JavaMail.yahoo@jws10082.mail.ne1.yahoo.com> I have a very strange row-filtering issue in front of me that I can only reproduce on a very large data set.? Let me start off by giving you the end symptoms and then I will talk through some? hacks which will avoid the bug. I have two fields of interest -- pred_bad_t_f and weight.- pred_bad_t_f is of class "integer" with two unique values, 0 and 1- weight is of class "numeric" > dt[pred_bad_t_f == 1, sum(weight)] [1] 6580818130 > dt[pred_bad_t_f == 1L, sum(weight)] [1] 5414941720 As you can see, there is no reason for the second value to be any different.? I believe the first value is correct because slight changes to the filtering logic generates that value repeatedly.? Below are some examples: > dt[1:nrow( dt)][pred_bad_t_f == 1L, sum(weight)] [1] 6580818130> dt[TRUE & pred_bad_t_f == 1L, sum(weight)] [1] 6580818130 s -------------- next part -------------- An HTML attachment was scrubbed... URL: From harishv_99 at yahoo.com Tue Oct 14 14:28:08 2014 From: harishv_99 at yahoo.com (Harish) Date: Tue, 14 Oct 2014 12:28:08 +0000 (UTC) Subject: [datatable-help] Error in row filtering In-Reply-To: <1652895759.38733.1413288834416.JavaMail.yahoo@jws10082.mail.ne1.yahoo.com> References: <1652895759.38733.1413288834416.JavaMail.yahoo@jws10082.mail.ne1.yahoo.com> Message-ID: <1310139015.43023.1413289688423.JavaMail.yahoo@jws100150.mail.ne1.yahoo.com> My sent-mail seems to show only a truncated version of my original request.? So let me summarize whatever got truncated. My suspicion is that there is some issue with an optimization used when there is an integer comparison and that optimization is? being turned off when the logic is more complex. It would be great if someone can help me understand what the root cause is so I can check where else this could be happening in my code.? My fear is that I do not know what other numbers I am getting might be incorrect. Thanks a lot for your help. Regards,Harish On Tuesday, October 14, 2014 5:13 AM, Harish wrote: I have a very strange row-filtering issue in front of me that I can only reproduce on a very large data set.? Let me start off by giving you the end symptoms and then I will talk through some? hacks which will avoid the bug. I have two fields of interest -- pred_bad_t_f and weight.- pred_bad_t_f is of class "integer" with two unique values, 0 and 1- weight is of class "numeric" > dt[pred_bad_t_f == 1, sum(weight)] [1] 6580818130 > dt[pred_bad_t_f == 1L, sum(weight)] [1] 5414941720 As you can see, there is no reason for the second value to be any different.? I believe the first value is correct because slight changes to the filtering logic generates that value repeatedly.? Below are some examples: > dt[1:nrow( dt)][pred_bad_t_f == 1L, sum(weight)] [1] 6580818130> dt[TRUE & pred_bad_t_f == 1L, sum(weight)] [1] 6580818130 s -------------- next part -------------- An HTML attachment was scrubbed... URL: From f_j_rod at hotmail.com Wed Oct 22 19:19:05 2014 From: f_j_rod at hotmail.com (Frank S.) Date: Wed, 22 Oct 2014 19:19:05 +0200 Subject: [datatable-help] Remove some data table rows based on three conditions Message-ID: Dear all, I'm working with a large database in wich I have some rows which have identical id and datep variables. Of these duplicated rows, I only want to keep those row associated to the maximum value in marker variable. As an example: DT <- data.table( id = rep(c(2,5),c(3,2)), datep = as.Date(c('1995-04-20','1995-04-20', '1997-02-19', '1998-01-15','1998-01-15')), marker = c(2,8,5,7,5), group=rep(c("A","B"),c(3,2)) ) First, I sort by key variables: id, marker DT[order(id,marker)] But afterwards I've tried different things and I'm not able to what I want: DT[!duplicated(DT[c('id', 'datep')])] DT[ !(duplicated %chin% c('id','datep'))] DT[ !(duplicated %in% c('id','datep'))] DT[,!(duplicated(DT[c("id","datep")])), by=list(id,datep)] unique(DT[c('id','datep')]) Please, does anyone know how to do it? -------------- next part -------------- An HTML attachment was scrubbed... URL: From caneff at gmail.com Wed Oct 22 19:38:23 2014 From: caneff at gmail.com (Chris Neff) Date: Wed, 22 Oct 2014 17:38:23 +0000 Subject: [datatable-help] Remove some data table rows based on three conditions References: Message-ID: DT[, .SD[which.max(marker),], by=.(id, datep)] is what I would do. On Wed Oct 22 2014 at 1:19:17 PM Frank S. wrote: > Dear all, > I'm working with a large database in wich I have some rows which have > identical id and datep variables. Of these > duplicated rows, I only want to keep those row associated to the maximum > value in marker variable. As an example: > DT <- data.table( > id = rep(c(2,5),c(3,2)), > datep = as.Date(c('1995-04-20','1995-04-20', > '1997-02-19', '1998-01-15','1998-01-15')), > marker = c(2,8,5,7,5), > group=rep(c("A","B"),c(3,2)) > ) > First, I sort by key variables: id, marker > DT[order(id,marker)] > > But afterwards I've tried different things and I'm not able to what I want: > DT[!duplicated(DT[c('id', 'datep')])] > DT[ !(duplicated %chin% c('id','datep'))] > DT[ !(duplicated %in% c('id','datep'))] > DT[,!(duplicated(DT[c("id","datep")])), by=list(id,datep)] > unique(DT[c('id','datep')]) > Please, does anyone know how to do it? > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/ > listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From mel at mbacou.com Wed Oct 22 22:00:47 2014 From: mel at mbacou.com (Bacou, Melanie) Date: Wed, 22 Oct 2014 16:00:47 -0400 Subject: [datatable-help] Remove some data table rows based on three conditions In-Reply-To: References: Message-ID: <54480CEF.2070204@mbacou.com> Frank, My understanding is |unique()| will remove all duplicated records in a keyed data.table, keeping only the first occurrence. So if you sort your data by decreasing |marker|, you should achieve what you?re looking for. Note that |unique()| also comes with a |fromLast=FALSE| argument to keep unique records starting from the last occurrence instead. |dt <- data.table( id = rep(c(2,5),c(3,2)), datep = as.Date(c('1995-04-20','1995-04-20','1997-02-19','1998-01-15','1998-01-15')), marker = c(2,8,5,7,5), group=rep(c("A","B"),c(3,2)) ) setkey(dt, id) setorder(dt, -marker) dt <- unique(dt) | ?Mel. On 10/22/2014 1:19 PM, Frank S. wrote: > Dear all, > I'm working with a large database in wich I have some rows which have > identical id and datep variables. Of these > duplicated rows, I only want to keep those row associated to the > maximum value in marker variable. As an example: > DT <- data.table( > id = rep(c(2,5),c(3,2)), > datep = as.Date(c('1995-04-20','1995-04-20', > '1997-02-19', '1998-01-15','1998-01-15')), > marker = c(2,8,5,7,5), > group=rep(c("A","B"),c(3,2)) > ) > First, I sort by key variables: id, marker > DT[order(id,marker)] > > But afterwards I've tried different things and I'm not able to what I > want: > DT[!duplicated(DT[c('id', 'datep')])] > DT[ !(duplicated %chin% c('id','datep'))] > DT[ !(duplicated %in% c('id','datep'))] > DT[,!(duplicated(DT[c("id","datep")])), by=list(id,datep)] > unique(DT[c('id','datep')]) > Please, does anyone know how to do it? > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help ? -- Melanie BACOU International Food Policy Research Institute Snr. Program Manager, HarvestChoice Work +1(202)862-5699 E-mail m.bacou at cgiar.org Visit www.harvestchoice.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From f_j_rod at hotmail.com Thu Oct 23 11:30:04 2014 From: f_j_rod at hotmail.com (Frank S.) Date: Thu, 23 Oct 2014 11:30:04 +0200 Subject: [datatable-help] Remove some data table rows based on three conditions In-Reply-To: <54480CEF.2070204@mbacou.com> References: , <54480CEF.2070204@mbacou.com> Message-ID: Chris and Melanie, many thanks for your quick answers!! It was what I needed! -------------- next part -------------- An HTML attachment was scrubbed... URL: From f_j_rod at hotmail.com Fri Oct 24 19:09:37 2014 From: f_j_rod at hotmail.com (Frank S.) Date: Fri, 24 Oct 2014 19:09:37 +0200 Subject: [datatable-help] Sistematically construction of 2 variables in data table Message-ID: Dear all, I'm writing to you because I'm not able to construct in a long data 2 new variables based on other columns for each id. Small example with only one subject:dt <- data.table( id = rep(1,7), bday = rep(as.Date("1960-10-29"),7), start = rep(as.Date("2005-02-27"),7), marker0 = rep(125,7), datep = as.Date(c('2005-04-20','2005-10-28','2005-12-31','2006-08-10','2006-12-31','2007-02-19','2007-05-15')), marker = c(10,2,0,5,3,7,1) ) I would want to construct sistematically a new data table from three conditions (I have the first): 1) It only keeps rows whose datep variable is 31st december or is the last date within id (I can get it)newdt <- unique(rbind(dt[which(month(datep)==12 & as.POSIXlt(datep)$mday==31)], dt[, .SD[.N], by='id'])[,list(id,init=0,marker0,sum=0,dsum=datep)])[order(id,dsum)] id init marker0 sum dsum1: 1 0 125 0 2005-12-312: 1 0 125 0 2006-12-313: 1 0 125 0 2007-05-15 2) It has a so-called init variable, whose value is defined in 1st row as difference (yr) between start and bday, in 2nd row as difference between dsum of 1st row and bday and finally the 3rd row as difference between dsum of 2nd row and bday: id bday ini marker0 sum dsum1: 1 1960-10-29 difftime(start, bday)/365.25 125 0 2005-12-31 2: 1 1960-10-29 difftime(as.Date("2005-12-31"), bday)/365.25 125 0 2006-12-31 3: 1 1960-10-29 difftime(as.Date("2006-12-31"), bday)/365.25 125 0 2007-05-15 3) It has also a sum variable, whose value is defined as follows: 3a) For first row within each id: The corresponding marker0 value. 3b) For each of the following rows within id: Previous sum value plus the sum of marker's values across the previous year. id init marker0 sum dsum 1: 1 44.3 125 125 2005-12-31 2: 1 45.2 125 125+10+2=137 2006-12-31 3: 1 46.2 125 137+5+3=145 2007-05-15 Thanks to all R-data table help community for your continuous help! -------------- next part -------------- An HTML attachment was scrubbed... URL: From my.r.help at gmail.com Sat Oct 25 03:12:14 2014 From: my.r.help at gmail.com (Michael Smith) Date: Sat, 25 Oct 2014 09:12:14 +0800 Subject: [datatable-help] Sistematically construction of 2 variables in data table In-Reply-To: References: Message-ID: <544AF8EE.5050103@gmail.com> Would help if you add dsum to your example data for reproducibility. On 10/25/2014 01:09 AM, Frank S. wrote: > Dear all, > I'm writing to you because I'm not able to construct in a long data 2 > new variables > based on other columns for each id. Small example with only one subject: > > dt <- data.table( > id = rep(1,7), > bday = rep(as.Date("1960-10-29"),7), > start = rep(as.Date("2005-02-27"),7), > marker0 = rep(125,7), > datep = > as.Date(c('2005-04-20','2005-10-28','2005-12-31','2006-08-10','2006-12-31','2007-02-19','2007-05-15')), > marker = c(10,2,0,5,3,7,1) > ) > > > > I would want to construct sistematically a new data table from three > conditions (I have the first): > > > > 1) It only keeps rows whose datep variable is 31st december or is the > last date within id (I can get it) > newdt <- unique(rbind(dt[which(month(datep)==12 & > as.POSIXlt(datep)$mday==31)], > dt[, .SD[.N], > by='id'])[,list(id,init=0,marker0,sum=0,dsum=datep)])[order(id,dsum)] > > id init marker0 sum dsum > 1: 1 0 125 0 2005-12-31 > 2: 1 0 125 0 2006-12-31 > 3: 1 0 125 0 2007-05-15 > > > > 2) It has a so-called init variable, whose value is defined in 1st row > as difference (yr) between > start and bday, in 2nd row as difference between dsum of 1st row and > bday and finally the 3rd > row as difference between dsum of 2nd row and bday: > id > bday > ini marker0 sum dsum > 1: 1 1960-10-29 difftime(start, > bday)/365.25 125 0 2005-12-31 > 2: 1 1960-10-29 difftime(as.Date("2005-12-31"), bday)/365.25 > 125 0 2006-12-31 > 3: 1 1960-10-29 difftime(as.Date("2006-12-31"), bday)/365.25 > 125 0 2007-05-15 > > > > 3) It has also a sum variable, whose value is defined as follows: > 3a) For first row within each id: The corresponding marker0 value. > 3b) For each of the following rows within id: Previous sum value plus > the sum of marker's values across the previous year. > > id init marker0 sum dsum > 1: 1 44.3 125 125 2005-12-31 > 2: 1 45.2 125 125+10+2=137 2006-12-31 > 3: 1 46.2 125 137+5+3=145 2007-05-15 > > > > Thanks to all R-data table help community for your continuous help! > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > From f_j_rod at hotmail.com Sat Oct 25 13:43:10 2014 From: f_j_rod at hotmail.com (Frank S.) Date: Sat, 25 Oct 2014 13:43:10 +0200 Subject: [datatable-help] Sistematically construction of 2 variables in data table In-Reply-To: <544AF8EE.5050103@gmail.com> References: , <544AF8EE.5050103@gmail.com> Message-ID: Hi Michael, The dsum variable results from datep, contained in the original data table I give, called dt. Thus, dsum is just to keep those dates contained in datep variable which correspond to 31st december or to thelast register within each id. Please, tell me for any further questions. I'm aware that the question is not easy. Regards -------------- next part -------------- An HTML attachment was scrubbed... URL: From my.r.help at gmail.com Sat Oct 25 15:22:41 2014 From: my.r.help at gmail.com (Michael Smith) Date: Sat, 25 Oct 2014 21:22:41 +0800 Subject: [datatable-help] Sistematically construction of 2 variables in data table In-Reply-To: References: , <544AF8EE.5050103@gmail.com> Message-ID: <544BA421.3010703@gmail.com> Sorry, I skipped that part. On 10/25/2014 07:43 PM, Frank S. wrote: > Hi Michael, > > > > The dsum variable results from datep, contained in the original /data > table /I give/, /called dt. Thus, > > dsum is just to keep those dates contained in datep variable which > correspond to 31st december or to the > > last register within each id. > > > > Please, tell me for any further questions. I'm aware that the question > is not easy. > > > > Regards > From f_j_rod at hotmail.com Mon Oct 27 14:06:33 2014 From: f_j_rod at hotmail.com (Frank S.) Date: Mon, 27 Oct 2014 14:06:33 +0100 Subject: [datatable-help] Sistematically construction of 2 variables in data table In-Reply-To: <544BA421.3010703@gmail.com> References: , <544AF8EE.5050103@gmail.com> ,<544BA421.3010703@gmail.com> Message-ID: Hi all, I have been able to solve (I think) the second question: newdt[, init := c( round(difftime(start[1], bday[1])/365.25,1) , round(difftime(dsum[1:(.N-1)],bday[1:(.N-1)])/365.25,1) )][] # [] is to print the result -------------- next part -------------- An HTML attachment was scrubbed... URL: