From mdowle at mdowle.plus.com  Thu Oct  2 22:53:50 2014
From: mdowle at mdowle.plus.com (Matt Dowle)
Date: Thu, 02 Oct 2014 21:53:50 +0100
Subject: [datatable-help] v1.9.4 is now on CRAN
Message-ID: <542DBB5E.5030905@mdowle.plus.com>


Available on Linux now or a few hours/tomorrow for Windows and Mac 
binaries to make their way to all mirrors.

NEWS is the README on CRAN :

     http://cran.r-project.org/web/packages/data.table/README.html

or the formatting may be easier to read on GitHub :

     https://github.com/Rdatatable/data.table

The first two points deal with by=.EACHI and the option to revert to the 
old behaviour should you need to.  Of the 66 packages on CRAN or 
Bioconductor, this only affected 3. So either by-without-by wasn't 
getting much usage or it is but those packages don't have tests covering 
that usage.

Have also updated FAQs 1.13 and 1.14 (a lot) regarding by=.EACHI :

 
http://cran.r-project.org/web/packages/data.table/vignettes/datatable-faq.pdf

Main new features: i) overlap joins and ii) automatic indexing; i.e. 
DT[column==values] is now ok (just for one ==, currently).

New homepage :

     https://github.com/Rdatatable/data.table/wiki

The old homepage on R-Forge will soon redirect to the new one (still 
some links/content to move over).

Fingers crossed!

Matt


From harishv_99 at yahoo.com  Tue Oct 14 14:13:54 2014
From: harishv_99 at yahoo.com (Harish)
Date: Tue, 14 Oct 2014 12:13:54 +0000 (UTC)
Subject: [datatable-help] Error in row filtering
Message-ID: <1652895759.38733.1413288834416.JavaMail.yahoo@jws10082.mail.ne1.yahoo.com>

I have a very strange row-filtering issue in front of me that I can only reproduce on a very large data set.? Let me start off by giving you the end symptoms and then I will talk through some? hacks which will avoid the bug.

I have two fields of interest -- pred_bad_t_f and weight.- pred_bad_t_f is of class "integer" with two unique values, 0 and 1- weight is of class "numeric"
> dt[pred_bad_t_f == 1, sum(weight)]
[1] 6580818130
> dt[pred_bad_t_f == 1L, sum(weight)]
[1] 5414941720
As you can see, there is no reason for the second value to be any different.? I believe the first value is correct because slight changes to the filtering logic generates that value repeatedly.? Below are some examples:

> dt[1:nrow( dt)][pred_bad_t_f == 1L, sum(weight)]
[1] 6580818130> dt[TRUE & pred_bad_t_f == 1L, sum(weight)]
[1] 6580818130
s
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20141014/6d0d6157/attachment.html>

From harishv_99 at yahoo.com  Tue Oct 14 14:28:08 2014
From: harishv_99 at yahoo.com (Harish)
Date: Tue, 14 Oct 2014 12:28:08 +0000 (UTC)
Subject: [datatable-help] Error in row filtering
In-Reply-To: <1652895759.38733.1413288834416.JavaMail.yahoo@jws10082.mail.ne1.yahoo.com>
References: <1652895759.38733.1413288834416.JavaMail.yahoo@jws10082.mail.ne1.yahoo.com>
Message-ID: <1310139015.43023.1413289688423.JavaMail.yahoo@jws100150.mail.ne1.yahoo.com>

My sent-mail seems to show only a truncated version of my original request.? So let me summarize whatever got truncated.

My suspicion is that there is some issue with an optimization used when there is an integer comparison and that optimization is? being turned off when the logic is more complex.
It would be great if someone can help me understand what the root cause is so I can check where else this could be happening in my code.? My fear is that I do not know what other numbers I am getting might be incorrect.
Thanks a lot for your help.
Regards,Harish
 

     On Tuesday, October 14, 2014 5:13 AM, Harish <harishv_99 at yahoo.com> wrote:
   

 I have a very strange row-filtering issue in front of me that I can only reproduce on a very large data set.? Let me start off by giving you the end symptoms and then I will talk through some? hacks which will avoid the bug.

I have two fields of interest -- pred_bad_t_f and weight.- pred_bad_t_f is of class "integer" with two unique values, 0 and 1- weight is of class "numeric"
> dt[pred_bad_t_f == 1, sum(weight)]
[1] 6580818130
> dt[pred_bad_t_f == 1L, sum(weight)]
[1] 5414941720
As you can see, there is no reason for the second value to be any different.? I believe the first value is correct because slight changes to the filtering logic generates that value repeatedly.? Below are some examples:

> dt[1:nrow( dt)][pred_bad_t_f == 1L, sum(weight)]
[1] 6580818130> dt[TRUE & pred_bad_t_f == 1L, sum(weight)]
[1] 6580818130
s


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20141014/3ac8a064/attachment.html>

From f_j_rod at hotmail.com  Wed Oct 22 19:19:05 2014
From: f_j_rod at hotmail.com (Frank S.)
Date: Wed, 22 Oct 2014 19:19:05 +0200
Subject: [datatable-help] Remove some data table rows based on three
	conditions
Message-ID: <BAY168-W219E8CD5CA923B238B39A1BA950@phx.gbl>

Dear all,
I'm working with a large database in wich I have some rows which have identical id and datep variables. Of these
duplicated rows, I only want to keep those row associated to the maximum value in marker variable. As an example:
DT <- data.table(
 id = rep(c(2,5),c(3,2)),
 datep = as.Date(c('1995-04-20','1995-04-20', '1997-02-19', '1998-01-15','1998-01-15')),
 marker = c(2,8,5,7,5),
 group=rep(c("A","B"),c(3,2))
 )
First, I sort by key variables: id, marker
DT[order(id,marker)]
 
But afterwards I've tried different things and I'm not able to what I want:
DT[!duplicated(DT[c('id', 'datep')])]
DT[ !(duplicated %chin% c('id','datep'))]
DT[ !(duplicated %in% c('id','datep'))]
DT[,!(duplicated(DT[c("id","datep")])), by=list(id,datep)]
unique(DT[c('id','datep')])
Please, does anyone know how to do it?
 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20141022/ae77be82/attachment.html>

From caneff at gmail.com  Wed Oct 22 19:38:23 2014
From: caneff at gmail.com (Chris Neff)
Date: Wed, 22 Oct 2014 17:38:23 +0000
Subject: [datatable-help] Remove some data table rows based on three
	conditions
References: <BAY168-W219E8CD5CA923B238B39A1BA950@phx.gbl>
Message-ID: <CAAuY0RUN4r9OLs8YSodTp_-Oc8PBs8E80bgiTx-T6Rqu3vOUFQ@mail.gmail.com>

DT[, .SD[which.max(marker),], by=.(id, datep)]

is what I would do.


On Wed Oct 22 2014 at 1:19:17 PM Frank S. <f_j_rod at hotmail.com> wrote:

> Dear all,
> I'm working with a large database in wich I have some rows which have
> identical id and datep variables. Of these
> duplicated rows, I only want to keep those row associated to the maximum
> value in marker variable. As an example:
> DT <- data.table(
>  id = rep(c(2,5),c(3,2)),
>  datep = as.Date(c('1995-04-20','1995-04-20',
> '1997-02-19', '1998-01-15','1998-01-15')),
>  marker = c(2,8,5,7,5),
>  group=rep(c("A","B"),c(3,2))
>  )
> First, I sort by key variables: id, marker
> DT[order(id,marker)]
>
> But afterwards I've tried different things and I'm not able to what I want:
> DT[!duplicated(DT[c('id', 'datep')])]
> DT[ !(duplicated %chin% c('id','datep'))]
> DT[ !(duplicated %in% c('id','datep'))]
> DT[,!(duplicated(DT[c("id","datep")])), by=list(id,datep)]
> unique(DT[c('id','datep')])
> Please, does anyone know how to do it?
>  _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/
> listinfo/datatable-help
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20141022/b0f7e779/attachment.html>

From mel at mbacou.com  Wed Oct 22 22:00:47 2014
From: mel at mbacou.com (Bacou, Melanie)
Date: Wed, 22 Oct 2014 16:00:47 -0400
Subject: [datatable-help] Remove some data table rows based on three
	conditions
In-Reply-To: <BAY168-W219E8CD5CA923B238B39A1BA950@phx.gbl>
References: <BAY168-W219E8CD5CA923B238B39A1BA950@phx.gbl>
Message-ID: <54480CEF.2070204@mbacou.com>

Frank,

My understanding is |unique()| will remove all duplicated records in a 
keyed data.table, keeping only the first occurrence. So if you sort your 
data by decreasing |marker|, you should achieve what you?re looking for. 
Note that |unique()| also comes with a |fromLast=FALSE| argument to keep 
unique records starting from the last occurrence instead.

|dt <- data.table(
  id = rep(c(2,5),c(3,2)),
  datep = as.Date(c('1995-04-20','1995-04-20','1997-02-19','1998-01-15','1998-01-15')),
  marker = c(2,8,5,7,5),
  group=rep(c("A","B"),c(3,2))
  )

setkey(dt, id)
setorder(dt, -marker)
dt <- unique(dt)
|

?Mel.

On 10/22/2014 1:19 PM, Frank S. wrote:

> Dear all,
> I'm working with a large database in wich I have some rows which have 
> identical id and datep variables. Of these
> duplicated rows, I only want to keep those row associated to the 
> maximum value in marker variable. As an example:
> DT <- data.table(
>  id = rep(c(2,5),c(3,2)),
>  datep = as.Date(c('1995-04-20','1995-04-20', 
> '1997-02-19', '1998-01-15','1998-01-15')),
>  marker = c(2,8,5,7,5),
>  group=rep(c("A","B"),c(3,2))
>  )
> First, I sort by key variables: id, marker
> DT[order(id,marker)]
>
> But afterwards I've tried different things and I'm not able to what I 
> want:
> DT[!duplicated(DT[c('id', 'datep')])]
> DT[ !(duplicated %chin% c('id','datep'))]
> DT[ !(duplicated %in% c('id','datep'))]
> DT[,!(duplicated(DT[c("id","datep")])), by=list(id,datep)]
> unique(DT[c('id','datep')])
> Please, does anyone know how to do it?
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

?

-- 
Melanie BACOU
International Food Policy Research Institute
Snr. Program Manager, HarvestChoice
Work +1(202)862-5699
E-mail m.bacou at cgiar.org
Visit www.harvestchoice.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20141022/9fef3fd9/attachment.html>

From f_j_rod at hotmail.com  Thu Oct 23 11:30:04 2014
From: f_j_rod at hotmail.com (Frank S.)
Date: Thu, 23 Oct 2014 11:30:04 +0200
Subject: [datatable-help] Remove some data table rows based on three
 conditions
In-Reply-To: <54480CEF.2070204@mbacou.com>
References: <BAY168-W219E8CD5CA923B238B39A1BA950@phx.gbl>,
 <54480CEF.2070204@mbacou.com>
Message-ID: <BAY168-W70033DF4370AEE2ED8C5CFBA920@phx.gbl>

 Chris and Melanie, many thanks for your quick answers!!
 
It was what I needed!
 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20141023/20386432/attachment.html>

From f_j_rod at hotmail.com  Fri Oct 24 19:09:37 2014
From: f_j_rod at hotmail.com (Frank S.)
Date: Fri, 24 Oct 2014 19:09:37 +0200
Subject: [datatable-help] Sistematically construction of 2 variables in data
	table
Message-ID: <BAY168-W67A660C3C08DCF07DFAE9FBA930@phx.gbl>

 Dear all, I'm writing to you because I'm not able to construct in a long data 2 new variables based on other columns for each id. Small example with only one subject:dt <- data.table(  id = rep(1,7),  bday = rep(as.Date("1960-10-29"),7),  start = rep(as.Date("2005-02-27"),7),  marker0 = rep(125,7),  datep = as.Date(c('2005-04-20','2005-10-28','2005-12-31','2006-08-10','2006-12-31','2007-02-19','2007-05-15')),  marker = c(10,2,0,5,3,7,1) ) I would want to construct sistematically a new data table from three conditions (I have the first): 1) It only keeps rows whose datep variable is 31st december or is the last date within id (I can get it)newdt <- unique(rbind(dt[which(month(datep)==12 & as.POSIXlt(datep)$mday==31)],  dt[, .SD[.N], by='id'])[,list(id,init=0,marker0,sum=0,dsum=datep)])[order(id,dsum)]   id init marker0   sum             dsum1:  1    0     125          0     2005-12-312:  1    0     125          0     2006-12-313:  1    0     125          0     2007-05-15     2) It has a so-called init variable, whose value is defined in 1st row as difference (yr) between    start and bday, in 2nd row as difference between dsum of 1st row and bday and finally the 3rd   row as difference between dsum of 2nd row and bday:   id       bday                                                                 ini                         marker0    sum       dsum1:  1 1960-10-29                                difftime(start, bday)/365.25        125            0  2005-12-31     2:  1 1960-10-29   difftime(as.Date("2005-12-31"), bday)/365.25     125            0  2006-12-31     3:  1 1960-10-29   difftime(as.Date("2006-12-31"), bday)/365.25     125            0  2007-05-15      3) It has also a sum variable, whose value is defined as follows:   3a) For first row within each id: The corresponding marker0 value.   3b) For each of the following rows within id: Previous sum value plus  the sum of marker's values across the previous year.   id     init  marker0          sum                        dsum 1:  1   44.3      125             125                    2005-12-31     2:  1   45.2      125    125+10+2=137        2006-12-31     3:  1   46.2      125     137+5+3=145         2007-05-15  Thanks to all R-data table help community for your continuous help! 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20141024/af60a1b2/attachment.html>

From my.r.help at gmail.com  Sat Oct 25 03:12:14 2014
From: my.r.help at gmail.com (Michael Smith)
Date: Sat, 25 Oct 2014 09:12:14 +0800
Subject: [datatable-help] Sistematically construction of 2 variables in
 data table
In-Reply-To: <BAY168-W67A660C3C08DCF07DFAE9FBA930@phx.gbl>
References: <BAY168-W67A660C3C08DCF07DFAE9FBA930@phx.gbl>
Message-ID: <544AF8EE.5050103@gmail.com>

Would help if you add dsum to your example data for reproducibility.

On 10/25/2014 01:09 AM, Frank S. wrote:
>  Dear all,
> I'm writing to you because I'm not able to construct in a long data 2
> new variables
> based on other columns for each id. Small example with only one subject:
> 
> dt <- data.table(
>   id = rep(1,7),
>   bday = rep(as.Date("1960-10-29"),7),
>   start = rep(as.Date("2005-02-27"),7),
>   marker0 = rep(125,7),
>   datep =
> as.Date(c('2005-04-20','2005-10-28','2005-12-31','2006-08-10','2006-12-31','2007-02-19','2007-05-15')),
>   marker = c(10,2,0,5,3,7,1)
>  )
> 
>  
> 
> I would want to construct sistematically a new data table from three
> conditions (I have the first):
> 
>  
> 
> 1) It only keeps rows whose datep variable is 31st december or is the
> last date within id (I can get it)
> newdt <- unique(rbind(dt[which(month(datep)==12 &
> as.POSIXlt(datep)$mday==31)],
>  dt[, .SD[.N],
> by='id'])[,list(id,init=0,marker0,sum=0,dsum=datep)])[order(id,dsum)]
> 
>    id init marker0   sum             dsum
> 1:  1    0     125          0     2005-12-31
> 2:  1    0     125          0     2006-12-31
> 3:  1    0     125          0     2007-05-15   
> 
>  
> 
> 2) It has a so-called init variable, whose value is defined in 1st row
> as difference (yr) between
>    start and bday, in 2nd row as difference between dsum of 1st row and
> bday and finally the 3rd
>    row as difference between dsum of 2nd row and bday:
>    id      
> bday                                                               
>  ini                         marker0    sum       dsum
> 1:  1 1960-10-29                                difftime(start,
> bday)/365.25        125            0  2005-12-31    
> 2:  1 1960-10-29   difftime(as.Date("2005-12-31"), bday)/365.25    
> 125            0  2006-12-31    
> 3:  1 1960-10-29   difftime(as.Date("2006-12-31"), bday)/365.25    
> 125            0  2007-05-15    
> 
>  
> 
> 3) It has also a sum variable, whose value is defined as follows:
>    3a) For first row within each id: The corresponding marker0 value.
>    3b) For each of the following rows within id: Previous sum value plus
>   the sum of marker's values across the previous year.
> 
>    id     init  marker0          sum                        dsum
> 1:  1   44.3      125             125                    2005-12-31    
> 2:  1   45.2      125    125+10+2=137        2006-12-31    
> 3:  1   46.2      125     137+5+3=145         2007-05-15
> 
>  
> 
> Thanks to all R-data table help community for your continuous help!
> 
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 

From f_j_rod at hotmail.com  Sat Oct 25 13:43:10 2014
From: f_j_rod at hotmail.com (Frank S.)
Date: Sat, 25 Oct 2014 13:43:10 +0200
Subject: [datatable-help] Sistematically construction of 2 variables in
 data table
In-Reply-To: <544AF8EE.5050103@gmail.com>
References: <BAY168-W67A660C3C08DCF07DFAE9FBA930@phx.gbl>,
 <544AF8EE.5050103@gmail.com>
Message-ID: <BAY168-W81ED200B7D24B4424EADBABA900@phx.gbl>

Hi Michael, The dsum variable results from datep, contained in the original data table I give, called dt. Thus, dsum is just to keep those dates contained in datep variable which correspond to 31st december or to thelast register within each id. Please, tell me for any further questions. I'm aware that the question is not easy. Regards 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20141025/caaa42de/attachment.html>

From my.r.help at gmail.com  Sat Oct 25 15:22:41 2014
From: my.r.help at gmail.com (Michael Smith)
Date: Sat, 25 Oct 2014 21:22:41 +0800
Subject: [datatable-help] Sistematically construction of 2 variables in
 data table
In-Reply-To: <BAY168-W81ED200B7D24B4424EADBABA900@phx.gbl>
References: <BAY168-W67A660C3C08DCF07DFAE9FBA930@phx.gbl>,
 <544AF8EE.5050103@gmail.com>
 <BAY168-W81ED200B7D24B4424EADBABA900@phx.gbl>
Message-ID: <544BA421.3010703@gmail.com>

Sorry, I skipped that part.

On 10/25/2014 07:43 PM, Frank S. wrote:
> Hi Michael,
> 
>  
> 
> The dsum variable results from datep, contained in the original /data
> table /I give/, /called dt. Thus,
> 
> dsum is just to keep those dates contained in datep variable which
> correspond to 31st december or to the
> 
> last register within each id.
> 
>  
> 
> Please, tell me for any further questions. I'm aware that the question
> is not easy.
> 
>  
> 
> Regards
> 

From f_j_rod at hotmail.com  Mon Oct 27 14:06:33 2014
From: f_j_rod at hotmail.com (Frank S.)
Date: Mon, 27 Oct 2014 14:06:33 +0100
Subject: [datatable-help] Sistematically construction of 2 variables in
 data table
In-Reply-To: <544BA421.3010703@gmail.com>
References: <BAY168-W67A660C3C08DCF07DFAE9FBA930@phx.gbl>,
 <544AF8EE.5050103@gmail.com>
 <BAY168-W81ED200B7D24B4424EADBABA900@phx.gbl>,<544BA421.3010703@gmail.com>
Message-ID: <BAY168-W680D8F3DA85AF83825482EBA9E0@phx.gbl>

Hi all,  I have been able to solve (I think) the second question: newdt[, init := c( round(difftime(start[1], bday[1])/365.25,1) , 
 round(difftime(dsum[1:(.N-1)],bday[1:(.N-1)])/365.25,1) )][] #  [] is to print the result    		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20141027/2810584e/attachment.html>