From saporta at scarletmail.rutgers.edu  Sun Mar  3 19:34:05 2013
From: saporta at scarletmail.rutgers.edu (Ricardo Saporta)
Date: Sun, 3 Mar 2013 13:34:05 -0500
Subject: [datatable-help] Benchmarks for reshaping data
Message-ID: <CAE7Aa4R1+s_26PqxUqKeLfyauLk0RPCtoZ+WiFx-yCMSF87-Hg@mail.gmail.com>

Hello All,

There were some questions on SO today regarding reshaping data which
provided good opportunities to run benchmarks.

I'm sending the links here in case others are interested:

    http://bit.ly/ZZXA6X
    http://bit.ly/YkdapY

Cheers,
Rick


-- 
Ricardo Saporta
Graduate Student, Data Analytics
Rutgers University, New Jersey
e: saporta at rutgers.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130303/b8c12fc1/attachment.html>

From victor.kryukov at gmail.com  Sun Mar  3 23:25:52 2013
From: victor.kryukov at gmail.com (Victor Kryukov)
Date: Sun, 3 Mar 2013 14:25:52 -0800
Subject: [datatable-help] Error in a package that imports data.table
Message-ID: <1527B53B-4FB5-4937-8FEA-867EDC588482@gmail.com>

Hello,

I'm developing an R package which will be used internally at my company, and I have troubles using data.table. I'm very new to package development and I'm not really sure whether the errors I see are related to data.table or not, but here it is anyway.

I have a function that imports data from .csv files and cleans the data (subsets, converting fields to numeric etc.). As the end of the function definition, I convert the resulting data.frame to data.table and return the result:

ProcessData <- function(?) {
	...
	df <- data.table(df)
	df
}

When I use this function standalone, after 

library(data.package)

everything works as expected. However, when I'm defining this function as a part of a package and later call it, I'm getting the following error:

Error in rbind(deparse.level, ...) : 
  could not find function ".rbind.data.table"

Please note that in the package .R files, I'm not importing data.table directly with library(data.package) but rather have `import(data.table)` statement in my NAMESPACE, as recommended here https://github.com/hadley/devtools/wiki/Namespaces. 

When I import data.table directly with library(data.table) after importing my package, everything works as expected.

I suspect there may be something going wrong with namespaces in data.table. 

My environment: I'm using R 2.15.3 on Mac and have tested the above on both data.table 1.8.6 and 1.8.7. Please let me know if I need to provide more info. Any help will be much appreciated!

Regards,
Victor


From mdowle at mdowle.plus.com  Mon Mar  4 00:03:01 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Sun, 03 Mar 2013 23:03:01 +0000
Subject: [datatable-help] Benchmarks for reshaping data
In-Reply-To: <CAE7Aa4R1+s_26PqxUqKeLfyauLk0RPCtoZ+WiFx-yCMSF87-Hg@mail.gmail.com>
References: <CAE7Aa4R1+s_26PqxUqKeLfyauLk0RPCtoZ+WiFx-yCMSF87-Hg@mail.gmail.com>
Message-ID: <71673794ac7b88c1caeba6eef69edc32@imap.plus.net>

 
Hi, 

Many thanks. I commented/answered there. 

Matthew 

On
03.03.2013 18:34, Ricardo Saporta wrote: 

> Hello All, 
> There were
some questions on SO today regarding reshaping data which provided good
opportunities to run benchmarks. 
> I'm sending the links here in case
others are interested: 
> http://bit.ly/ZZXA6X [1] 
>
http://bit.ly/YkdapY [2] 
> Cheers, 
> Rick -- 
> 
> Ricardo Saporta 
>
Graduate Student, Data Analytics 
> Rutgers University, New Jersey 
> e:
saporta at rutgers.edu [3]

 
Links:
------
[1] http://bit.ly/ZZXA6X
[2]
http://bit.ly/YkdapY
[3] mailto:saporta at rutgers.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130303/817f36d4/attachment.html>

From mdowle at mdowle.plus.com  Mon Mar  4 00:26:30 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Sun, 03 Mar 2013 23:26:30 +0000
Subject: [datatable-help] Error in a package that imports data.table
In-Reply-To: <1527B53B-4FB5-4937-8FEA-867EDC588482@gmail.com>
References: <1527B53B-4FB5-4937-8FEA-867EDC588482@gmail.com>
Message-ID: <efdd363a8e79e1477aa02f019b3904c9@imap.plus.net>


Hi,

Did you include data.table in either the Imports or Depends field of 
your package's DESCRIPTION file?

I've just improved data.table FAQ 6.9 to make that clearer.

If it still doesn't work, does your package fully pass "R CMD check"?

Matthew


On 03.03.2013 22:25, Victor Kryukov wrote:
> Hello,
>
> I'm developing an R package which will be used internally at my
> company, and I have troubles using data.table. I'm very new to 
> package
> development and I'm not really sure whether the errors I see are
> related to data.table or not, but here it is anyway.
>
> I have a function that imports data from .csv files and cleans the
> data (subsets, converting fields to numeric etc.). As the end of the
> function definition, I convert the resulting data.frame to data.table
> and return the result:
>
> ProcessData <- function(?) {
> 	...
> 	df <- data.table(df)
> 	df
> }
>
> When I use this function standalone, after
>
> library(data.package)
>
> everything works as expected. However, when I'm defining this
> function as a part of a package and later call it, I'm getting the
> following error:
>
> Error in rbind(deparse.level, ...) :
>   could not find function ".rbind.data.table"
>
> Please note that in the package .R files, I'm not importing
> data.table directly with library(data.package) but rather have
> `import(data.table)` statement in my NAMESPACE, as recommended here
> https://github.com/hadley/devtools/wiki/Namespaces.
>
> When I import data.table directly with library(data.table) after
> importing my package, everything works as expected.
>
> I suspect there may be something going wrong with namespaces in 
> data.table.
>
> My environment: I'm using R 2.15.3 on Mac and have tested the above
> on both data.table 1.8.6 and 1.8.7. Please let me know if I need to
> provide more info. Any help will be much appreciated!
>
> Regards,
> Victor
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> 
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


From michael.nelson at sydney.edu.au  Mon Mar  4 00:47:29 2013
From: michael.nelson at sydney.edu.au (Michael Nelson)
Date: Sun, 3 Mar 2013 23:47:29 +0000
Subject: [datatable-help] which parent.frame is more correct
Message-ID: <6FB5193A6CDCDF499486A833B7AFBDCD5827DC71@EX-MBX-PRO-04.mcs.usyd.edu.au>

In answering http://stackoverflow.com/a/15102156/1385941

I confidently stated at parent.frame(3) was the correct frame to use, and I stand by that, but am slightly confused over how parent.frame 1 and 3 differ in how they are evaluated. 

More specifically, I don't understand why `parent.frame(1)` works as it does. 

Take for example

x <- 3:4

dt <- data.table(x = 1:5,y=5:1, key = 'x')

foo <-function(){
  x <- 1:2
  for(n in 1:5) {
    print(dt[list(get('x',parent.frame(n)))])
  }
}

foo()

# n= 1
# uses parent.frame of foo
# 
#    x y
# 1: 3 3
# 2: 4 2


# n= 2 
# some kind of self join of  data.table 
# output equivalent of (dt[dt[list(x)]])

#    x y y.1
# 1: 1 5   5
# 2: 2 4   4
# 3: 3 3   3
# 4: 4 2   2
# 5: 5 1   1


# n= 3
# uses parent.frame of call to `[.data.table`

#    x y
# 1: 1 5
# 2: 2 4


# n >= 4
# uses parent.frame of foo again (makes sense I think)
#    x y
# 1: 5 1
# 2: 4 2
________________________________________
From: datatable-help-bounces at lists.r-forge.r-project.org [datatable-help-bounces at lists.r-forge.r-project.org] on behalf of Matthew Dowle [mdowle at mdowle.plus.com]
Sent: Monday, 4 March 2013 10:26 AM
To: victor.kryukov at gmail.com
Cc: datatable-help at lists.r-forge.r-project.org
Subject: Re: [datatable-help] Error in a package that imports data.table

Hi,

Did you include data.table in either the Imports or Depends field of
your package's DESCRIPTION file?

I've just improved data.table FAQ 6.9 to make that clearer.

If it still doesn't work, does your package fully pass "R CMD check"?

Matthew


On 03.03.2013 22:25, Victor Kryukov wrote:
> Hello,
>
> I'm developing an R package which will be used internally at my
> company, and I have troubles using data.table. I'm very new to
> package
> development and I'm not really sure whether the errors I see are
> related to data.table or not, but here it is anyway.
>
> I have a function that imports data from .csv files and cleans the
> data (subsets, converting fields to numeric etc.). As the end of the
> function definition, I convert the resulting data.frame to data.table
> and return the result:
>
> ProcessData <- function(?) {
>       ...
>       df <- data.table(df)
>       df
> }
>
> When I use this function standalone, after
>
> library(data.package)
>
> everything works as expected. However, when I'm defining this
> function as a part of a package and later call it, I'm getting the
> following error:
>
> Error in rbind(deparse.level, ...) :
>   could not find function ".rbind.data.table"
>
> Please note that in the package .R files, I'm not importing
> data.table directly with library(data.package) but rather have
> `import(data.table)` statement in my NAMESPACE, as recommended here
> https://github.com/hadley/devtools/wiki/Namespaces.
>
> When I import data.table directly with library(data.table) after
> importing my package, everything works as expected.
>
> I suspect there may be something going wrong with namespaces in
> data.table.
>
> My environment: I'm using R 2.15.3 on Mac and have tested the above
> on both data.table 1.8.6 and 1.8.7. Please let me know if I need to
> provide more info. Any help will be much appreciated!
>
> Regards,
> Victor
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

From mdowle at mdowle.plus.com  Mon Mar  4 01:56:33 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Mon, 04 Mar 2013 00:56:33 +0000
Subject: [datatable-help] which parent.frame is more correct
In-Reply-To: <6FB5193A6CDCDF499486A833B7AFBDCD5827DC71@EX-MBX-PRO-04.mcs.usyd.edu.au>
References: <6FB5193A6CDCDF499486A833B7AFBDCD5827DC71@EX-MBX-PRO-04.mcs.usyd.edu.au>
Message-ID: <20aad0cfa80570d56aacdf745fb14461@imap.plus.net>


Hi,

In general it's probably best to use a unique name like "..tmpvar" and 
let `i` or `j` find that via scope.
Passing a specific n to parent.frame(n) might work now, but may be 
dependent on data.table internals not changing in the future.
Also, note the Notes in ?parent.frame (i.e. user beware).

With those caveats out of the way, some answers inline ...

On 03.03.2013 23:47, Michael Nelson wrote:
> In answering http://stackoverflow.com/a/15102156/1385941
>
> I confidently stated at parent.frame(3) was the correct frame to use,
> and I stand by that, but am slightly confused over how parent.frame 1
> and 3 differ in how they are evaluated.
>
> More specifically, I don't understand why `parent.frame(1)` works as
> it does.
>
> Take for example
>
> x <- 3:4
>
> dt <- data.table(x = 1:5,y=5:1, key = 'x')
>
> foo <-function(){
>   x <- 1:2
>   for(n in 1:5) {
>     print(dt[list(get('x',parent.frame(n)))])
>   }
> }
>
> foo()
>
> # n= 1
> # uses parent.frame of foo
> #
> #    x y
> # 1: 3 3
> # 2: 4 2

Setting get(...,inherits=FALSE) reveals x isn't in parent.frame(1). 
That's base::eval itself ([.data.table internals call eval directly). 
get(...,inherits=TRUE) then finds "x" in .GlobalEnv via search().

>
> # n= 2
> # some kind of self join of  data.table
> # output equivalent of (dt[dt[list(x)]])
>
> #    x y y.1
> # 1: 1 5   5
> # 2: 2 4   4
> # 3: 3 3   3
> # 4: 4 2   2
> # 5: 5 1   1

That's picking up the "x" column name in dt. Because the eval inside 
[.data.table passes x to it. In other words, the normal place variables 
in i are looked for.

> # n= 3
> # uses parent.frame of call to `[.data.table`

Yes, I think so. I'm not always certain myself. I don't use 
parent.frame(3) but seems it works.

>
> #    x y
> # 1: 1 5
> # 2: 2 4
>

> # n >= 4
> # uses parent.frame of foo again (makes sense I think)
> #    x y
> # 1: 5 1
> # 2: 4 2


If you need to refer to a specific scope without using parent.frame(n), 
this might be safer :


foo <-function(){
   x <- 1:2
   ..localenv = environment()
   print(dt[list(get('x',..localenv,inherits=FALSE))])
}

which is what ..() is intended to do in future built in :

foo <-function(){
   x <- 1:2
   print(dt[..(x)])
}


But currently I prefer doing :

foo <-function(){
   ..x <- 1:2
   print(dt[list(..x)])
}

which is what I meant at the top:  in general it's probably best to use 
a unique name like ..tmpvar and let `i` or `j` find that via scope.

Matthew


From victor.kryukov at gmail.com  Mon Mar  4 07:32:36 2013
From: victor.kryukov at gmail.com (Victor Kryukov)
Date: Sun, 3 Mar 2013 22:32:36 -0800
Subject: [datatable-help] Error in a package that imports data.table
In-Reply-To: <efdd363a8e79e1477aa02f019b3904c9@imap.plus.net>
References: <1527B53B-4FB5-4937-8FEA-867EDC588482@gmail.com>
 <efdd363a8e79e1477aa02f019b3904c9@imap.plus.net>
Message-ID: <2DD70661-7319-4D42-A308-9F3A6EC50AD0@gmail.com>

Hi Matthew,

my DESCRIPTION file has the following section:

Imports:
    data.table,
    lubridate

and my (generated) NAMESPACE contains

export(ProcessTransactionSurvey)
import(data.table)
import(lubridate)

My R CMD CHECK (run with check() from devtools) mostly runs OK but fails at the end with the following error, which is expected since I haven't created any documentation yet. I'm not sure yet have to fix this LaTeX warning (I do have latex installed on my machine).

* checking PDF version of manual ... WARNING
LaTeX errors when creating PDF version.
This typically indicates Rd problems.
LaTeX errors found:
* checking PDF version of manual without hyperrefs or index ... ERROR
Error: Command failed (1)

Anything else I should check?

Victor


On Mar 3, 2013, at 3:26 PM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:

> 
> Hi,
> 
> Did you include data.table in either the Imports or Depends field of your package's DESCRIPTION file?
> 
> I've just improved data.table FAQ 6.9 to make that clearer.
> 
> If it still doesn't work, does your package fully pass "R CMD check"?
> 
> Matthew
> 
> 
> On 03.03.2013 22:25, Victor Kryukov wrote:
>> Hello,
>> 
>> I'm developing an R package which will be used internally at my
>> company, and I have troubles using data.table. I'm very new to package
>> development and I'm not really sure whether the errors I see are
>> related to data.table or not, but here it is anyway.
>> 
>> I have a function that imports data from .csv files and cleans the
>> data (subsets, converting fields to numeric etc.). As the end of the
>> function definition, I convert the resulting data.frame to data.table
>> and return the result:
>> 
>> ProcessData <- function(?) {
>> 	...
>> 	df <- data.table(df)
>> 	df
>> }
>> 
>> When I use this function standalone, after
>> 
>> library(data.package)
>> 
>> everything works as expected. However, when I'm defining this
>> function as a part of a package and later call it, I'm getting the
>> following error:
>> 
>> Error in rbind(deparse.level, ...) :
>>  could not find function ".rbind.data.table"
>> 
>> Please note that in the package .R files, I'm not importing
>> data.table directly with library(data.package) but rather have
>> `import(data.table)` statement in my NAMESPACE, as recommended here
>> https://github.com/hadley/devtools/wiki/Namespaces.
>> 
>> When I import data.table directly with library(data.table) after
>> importing my package, everything works as expected.
>> 
>> I suspect there may be something going wrong with namespaces in data.table.
>> 
>> My environment: I'm using R 2.15.3 on Mac and have tested the above
>> on both data.table 1.8.6 and 1.8.7. Please let me know if I need to
>> provide more info. Any help will be much appreciated!
>> 
>> Regards,
>> Victor
>> 
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 


From mdowle at mdowle.plus.com  Mon Mar  4 08:35:15 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Mon, 04 Mar 2013 07:35:15 +0000
Subject: [datatable-help] Error in a package that imports data.table
In-Reply-To: <2DD70661-7319-4D42-A308-9F3A6EC50AD0@gmail.com>
References: <1527B53B-4FB5-4937-8FEA-867EDC588482@gmail.com>
 <efdd363a8e79e1477aa02f019b3904c9@imap.plus.net>
 <2DD70661-7319-4D42-A308-9F3A6EC50AD0@gmail.com>
Message-ID: <6f4c9b413e4fd40bb994370d8993be4e@imap.plus.net>


Hi,

I don't see what's wrong then.

Can you whittle the package down to the essential code such that you 
can attach it and we can reproduce?

Thanks,
Matthew


On 04.03.2013 06:32, Victor Kryukov wrote:
> Hi Matthew,
>
> my DESCRIPTION file has the following section:
>
> Imports:
>     data.table,
>     lubridate
>
> and my (generated) NAMESPACE contains
>
> export(ProcessTransactionSurvey)
> import(data.table)
> import(lubridate)
>
> My R CMD CHECK (run with check() from devtools) mostly runs OK but
> fails at the end with the following error, which is expected since I
> haven't created any documentation yet. I'm not sure yet have to fix
> this LaTeX warning (I do have latex installed on my machine).
>
> * checking PDF version of manual ... WARNING
> LaTeX errors when creating PDF version.
> This typically indicates Rd problems.
> LaTeX errors found:
> * checking PDF version of manual without hyperrefs or index ... ERROR
> Error: Command failed (1)
>
> Anything else I should check?
>
> Victor
>
>
> On Mar 3, 2013, at 3:26 PM, Matthew Dowle <mdowle at mdowle.plus.com> 
> wrote:
>
>>
>> Hi,
>>
>> Did you include data.table in either the Imports or Depends field of 
>> your package's DESCRIPTION file?
>>
>> I've just improved data.table FAQ 6.9 to make that clearer.
>>
>> If it still doesn't work, does your package fully pass "R CMD 
>> check"?
>>
>> Matthew
>>
>>
>> On 03.03.2013 22:25, Victor Kryukov wrote:
>>> Hello,
>>>
>>> I'm developing an R package which will be used internally at my
>>> company, and I have troubles using data.table. I'm very new to 
>>> package
>>> development and I'm not really sure whether the errors I see are
>>> related to data.table or not, but here it is anyway.
>>>
>>> I have a function that imports data from .csv files and cleans the
>>> data (subsets, converting fields to numeric etc.). As the end of 
>>> the
>>> function definition, I convert the resulting data.frame to 
>>> data.table
>>> and return the result:
>>>
>>> ProcessData <- function(?) {
>>> 	...
>>> 	df <- data.table(df)
>>> 	df
>>> }
>>>
>>> When I use this function standalone, after
>>>
>>> library(data.package)
>>>
>>> everything works as expected. However, when I'm defining this
>>> function as a part of a package and later call it, I'm getting the
>>> following error:
>>>
>>> Error in rbind(deparse.level, ...) :
>>>  could not find function ".rbind.data.table"
>>>
>>> Please note that in the package .R files, I'm not importing
>>> data.table directly with library(data.package) but rather have
>>> `import(data.table)` statement in my NAMESPACE, as recommended here
>>> https://github.com/hadley/devtools/wiki/Namespaces.
>>>
>>> When I import data.table directly with library(data.table) after
>>> importing my package, everything works as expected.
>>>
>>> I suspect there may be something going wrong with namespaces in 
>>> data.table.
>>>
>>> My environment: I'm using R 2.15.3 on Mac and have tested the above
>>> on both data.table 1.8.6 and 1.8.7. Please let me know if I need to
>>> provide more info. Any help will be much appreciated!
>>>
>>> Regards,
>>> Victor
>>>
>>> _______________________________________________
>>> datatable-help mailing list
>>> datatable-help at lists.r-forge.r-project.org
>>> 
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> 
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

From mailinglist.honeypot at gmail.com  Mon Mar  4 22:39:22 2013
From: mailinglist.honeypot at gmail.com (Steve Lianoglou)
Date: Mon, 4 Mar 2013 16:39:22 -0500
Subject: [datatable-help] Error in a package that imports data.table
In-Reply-To: <6f4c9b413e4fd40bb994370d8993be4e@imap.plus.net>
References: <1527B53B-4FB5-4937-8FEA-867EDC588482@gmail.com>
 <efdd363a8e79e1477aa02f019b3904c9@imap.plus.net>
 <2DD70661-7319-4D42-A308-9F3A6EC50AD0@gmail.com>
 <6f4c9b413e4fd40bb994370d8993be4e@imap.plus.net>
Message-ID: <CAHA9McMAJitoYxX0jCB5MUZDtf-XL9TMhjK4ENoxe4xf6owzNA@mail.gmail.com>

I'm not sure if order matters in the NAMESPACE, but maybe you could
try to write it manually and put the import statements up top?

I haven't come across this problem, and I've got several packages that
use data.table via importing it as you show here ...

On Mon, Mar 4, 2013 at 2:35 AM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
>
> Hi,
>
> I don't see what's wrong then.
>
> Can you whittle the package down to the essential code such that you can
> attach it and we can reproduce?
>
> Thanks,
> Matthew
>
>
>
> On 04.03.2013 06:32, Victor Kryukov wrote:
>>
>> Hi Matthew,
>>
>> my DESCRIPTION file has the following section:
>>
>> Imports:
>>     data.table,
>>     lubridate
>>
>> and my (generated) NAMESPACE contains
>>
>> export(ProcessTransactionSurvey)
>> import(data.table)
>> import(lubridate)
>>
>> My R CMD CHECK (run with check() from devtools) mostly runs OK but
>> fails at the end with the following error, which is expected since I
>> haven't created any documentation yet. I'm not sure yet have to fix
>> this LaTeX warning (I do have latex installed on my machine).
>>
>> * checking PDF version of manual ... WARNING
>> LaTeX errors when creating PDF version.
>> This typically indicates Rd problems.
>> LaTeX errors found:
>> * checking PDF version of manual without hyperrefs or index ... ERROR
>> Error: Command failed (1)
>>
>> Anything else I should check?
>>
>> Victor
>>
>>
>> On Mar 3, 2013, at 3:26 PM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
>>
>>>
>>> Hi,
>>>
>>> Did you include data.table in either the Imports or Depends field of your
>>> package's DESCRIPTION file?
>>>
>>> I've just improved data.table FAQ 6.9 to make that clearer.
>>>
>>> If it still doesn't work, does your package fully pass "R CMD check"?
>>>
>>> Matthew
>>>
>>>
>>> On 03.03.2013 22:25, Victor Kryukov wrote:
>>>>
>>>> Hello,
>>>>
>>>> I'm developing an R package which will be used internally at my
>>>> company, and I have troubles using data.table. I'm very new to package
>>>> development and I'm not really sure whether the errors I see are
>>>> related to data.table or not, but here it is anyway.
>>>>
>>>> I have a function that imports data from .csv files and cleans the
>>>> data (subsets, converting fields to numeric etc.). As the end of the
>>>> function definition, I convert the resulting data.frame to data.table
>>>> and return the result:
>>>>
>>>> ProcessData <- function(?) {
>>>>         ...
>>>>         df <- data.table(df)
>>>>         df
>>>> }
>>>>
>>>> When I use this function standalone, after
>>>>
>>>> library(data.package)
>>>>
>>>> everything works as expected. However, when I'm defining this
>>>> function as a part of a package and later call it, I'm getting the
>>>> following error:
>>>>
>>>> Error in rbind(deparse.level, ...) :
>>>>  could not find function ".rbind.data.table"
>>>>
>>>> Please note that in the package .R files, I'm not importing
>>>> data.table directly with library(data.package) but rather have
>>>> `import(data.table)` statement in my NAMESPACE, as recommended here
>>>> https://github.com/hadley/devtools/wiki/Namespaces.
>>>>
>>>> When I import data.table directly with library(data.table) after
>>>> importing my package, everything works as expected.
>>>>
>>>> I suspect there may be something going wrong with namespaces in
>>>> data.table.
>>>>
>>>> My environment: I'm using R 2.15.3 on Mac and have tested the above
>>>> on both data.table 1.8.6 and 1.8.7. Please let me know if I need to
>>>> provide more info. Any help will be much appreciated!
>>>>
>>>> Regards,
>>>> Victor
>>>>
>>>> _______________________________________________
>>>> datatable-help mailing list
>>>> datatable-help at lists.r-forge.r-project.org
>>>>
>>>>
>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>
>>>
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>>
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

From mdowle at mdowle.plus.com  Wed Mar  6 09:47:08 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Wed, 06 Mar 2013 08:47:08 +0000
Subject: [datatable-help] v1.8.8 is now on CRAN
Message-ID: <303fa2c18913ee4f367c1521e97117f0@imap.plus.net>


Please see NEWS :

     
https://r-forge.r-project.org/scm/viewvc.php/pkg/NEWS?view=markup&root=datatable

and the new paragraphs at the top of ?fread :

     
https://r-forge.r-project.org/scm/viewvc.php/pkg/man/fread.Rd?view=markup&root=datatable

As normal it will take a few days to reach all mirrors.

R-Forge has now bumped to 1.8.9. The idea of even numbers on CRAN is to 
make it
impossible for anyone to be running a slightly different version of 
1.8.8 other than
the one on CRAN (1.8.8 has never been available from R-Forge, even 
fleetingly).

Matthew


From statquant at outlook.com  Wed Mar  6 13:37:51 2013
From: statquant at outlook.com (stat quant)
Date: Wed, 6 Mar 2013 13:37:51 +0100
Subject: [datatable-help] datatable-help Digest, Vol 37, Issue 4
In-Reply-To: <mailman.15.1362567606.1394.datatable-help@lists.r-forge.r-project.org>
References: <mailman.15.1362567606.1394.datatable-help@lists.r-forge.r-project.org>
Message-ID: <CAJJHHA_grZYMkQwQwduxt5dJefeYMcdB2K1hH8kBHsOViqE74A@mail.gmail.com>

Hello Matthew,
many thanks for all the work and all the improvements on data.table.
Just a practical question :
looking on http://cran.r-project.org/web/packages/data.table/index.html I
see that mac/win versions are still 1.8.6 unlike the sources to be built
(tar.gz), is it an error or is it expected (I am not aware of what is
requested by cran to package devs)

Again many thanks

2013/3/6 <datatable-help-request at lists.r-forge.r-project.org>

> Send datatable-help mailing list submissions to
>         datatable-help at lists.r-forge.r-project.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
> or, via email, send a message with subject or body 'help' to
>         datatable-help-request at lists.r-forge.r-project.org
>
> You can reach the person managing the list at
>         datatable-help-owner at lists.r-forge.r-project.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of datatable-help digest..."
>
>
> Today's Topics:
>
>    1. v1.8.8 is now on CRAN (Matthew Dowle)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Wed, 06 Mar 2013 08:47:08 +0000
> From: Matthew Dowle <mdowle at mdowle.plus.com>
> To: <datatable-help at lists.r-forge.r-project.org>
> Subject: [datatable-help] v1.8.8 is now on CRAN
> Message-ID: <303fa2c18913ee4f367c1521e97117f0 at imap.plus.net>
> Content-Type: text/plain; charset=UTF-8; format=flowed
>
>
> Please see NEWS :
>
>
>
> https://r-forge.r-project.org/scm/viewvc.php/pkg/NEWS?view=markup&root=datatable
>
> and the new paragraphs at the top of ?fread :
>
>
>
> https://r-forge.r-project.org/scm/viewvc.php/pkg/man/fread.Rd?view=markup&root=datatable
>
> As normal it will take a few days to reach all mirrors.
>
> R-Forge has now bumped to 1.8.9. The idea of even numbers on CRAN is to
> make it
> impossible for anyone to be running a slightly different version of
> 1.8.8 other than
> the one on CRAN (1.8.8 has never been available from R-Forge, even
> fleetingly).
>
> Matthew
>
>
>
>
> ------------------------------
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
> End of datatable-help Digest, Vol 37, Issue 4
> *********************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130306/f1d5c0dc/attachment.html>

From mdowle at mdowle.plus.com  Wed Mar  6 14:02:44 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Wed, 06 Mar 2013 13:02:44 +0000
Subject: [datatable-help] datatable-help Digest, Vol 37, Issue 4
In-Reply-To: <CAJJHHA_grZYMkQwQwduxt5dJefeYMcdB2K1hH8kBHsOViqE74A@mail.gmail.com>
References: <mailman.15.1362567606.1394.datatable-help@lists.r-forge.r-project.org>
 <CAJJHHA_grZYMkQwQwduxt5dJefeYMcdB2K1hH8kBHsOViqE74A@mail.gmail.com>
Message-ID: <dec5e11ae52f0c21e3c9780f7b602b8c@imap.plus.net>

 
No problem. Yes that's normal. It just takes a few days for all CRAN
web links to update. Some parts will update faster than others, too.


Sometimes it might be our browsers as well, so a Ctrl+F5 to flush the
browser's cache sometimes helps (or sometimes that's not even enough and
full cache purge is needed). 

On the "CRAN checks:" page you'll see
tests now passing OK for r-release (and r-oldrel) for Windows. That's
the page I watch. Once those update to 1.8.8 (as they have) and say "OK"
(as they have) it's usually not too long (within 24 hours) to update the
.zip link. Those red ERRORs turning to black OK is (sadly) the exciting
bit for me! 

On 06.03.2013 12:37, stat quant wrote: 

> Hello Matthew,

> many thanks for all the work and all the improvements on data.table.

> Just a practical question : 
> looking on
http://cran.r-project.org/web/packages/data.table/index.html [12] I see
that mac/win versions are still 1.8.6 unlike the sources to be built
(tar.gz), is it an error or is it expected (I am not aware of what is
requested by cran to package devs) 
> 
> Again many thanks
> 
> 2013/3/6
<datatable-help-request at lists.r-forge.r-project.org [13]>
> 
>> Send
datatable-help mailing list submissions to
>>
datatable-help at lists.r-forge.r-project.org [1]
>> 
>> To subscribe or
unsubscribe via the World Wide Web, visit
>>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[2]
>> 
>> or, via email, send a message with subject or body 'help'
to
>> datatable-help-request at lists.r-forge.r-project.org [3]
>> 
>> You
can reach the person managing the list at
>>
datatable-help-owner at lists.r-forge.r-project.org [4]
>> 
>> When
replying, please edit your Subject line so it is more specific
>> than
"Re: Contents of datatable-help digest..."
>> 
>> Today's Topics:
>> 
>>
1. v1.8.8 is now on CRAN (Matthew Dowle)
>> 
>>
----------------------------------------------------------------------
>>

>> Message: 1
>> Date: Wed, 06 Mar 2013 08:47:08 +0000
>> From:
Matthew Dowle <mdowle at mdowle.plus.com [5]>
>> To:
<datatable-help at lists.r-forge.r-project.org [6]>
>> Subject:
[datatable-help] v1.8.8 is now on CRAN
>> Message-ID:
<303fa2c18913ee4f367c1521e97117f0 at imap.plus.net [7]>
>> Content-Type:
text/plain; charset=UTF-8; format=flowed
>> 
>> Please see NEWS :
>> 
>>
https://r-forge.r-project.org/scm/viewvc.php/pkg/NEWS?view=markup&root=datatable
[8]
>> 
>> and the new paragraphs at the top of ?fread :
>> 
>>
https://r-forge.r-project.org/scm/viewvc.php/pkg/man/fread.Rd?view=markup&root=datatable
[9]
>> 
>> As normal it will take a few days to reach all mirrors.
>>

>> R-Forge has now bumped to 1.8.9. The idea of even numbers on CRAN is
to
>> make it
>> impossible for anyone to be running a slightly
different version of
>> 1.8.8 other than
>> the one on CRAN (1.8.8 has
never been available from R-Forge, even
>> fleetingly).
>> 
>>
Matthew
>> 
>> ------------------------------
>> 
>>
_______________________________________________
>> datatable-help
mailing list
>> datatable-help at lists.r-forge.r-project.org [10]
>>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[11]
>> 
>> End of datatable-help Digest, Vol 37, Issue 4
>>
*********************************************

 
Links:
------
[1]
mailto:datatable-help at lists.r-forge.r-project.org
[2]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[3]
mailto:datatable-help-request at lists.r-forge.r-project.org
[4]
mailto:datatable-help-owner at lists.r-forge.r-project.org
[5]
mailto:mdowle at mdowle.plus.com
[6]
mailto:datatable-help at lists.r-forge.r-project.org
[7]
mailto:303fa2c18913ee4f367c1521e97117f0 at imap.plus.net
[8]
https://r-forge.r-project.org/scm/viewvc.php/pkg/NEWS?view=markup&amp;root=datatable
[9]
https://r-forge.r-project.org/scm/viewvc.php/pkg/man/fread.Rd?view=markup&amp;root=datatable
[10]
mailto:datatable-help at lists.r-forge.r-project.org
[11]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[12]
http://cran.r-project.org/web/packages/data.table/index.html
[13]
mailto:datatable-help-request at lists.r-forge.r-project.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130306/0a8c4fe1/attachment.html>

From victor.kryukov at gmail.com  Thu Mar  7 04:47:18 2013
From: victor.kryukov at gmail.com (Victor Kryukov)
Date: Wed, 6 Mar 2013 19:47:18 -0800
Subject: [datatable-help] Error in a package that imports data.table
In-Reply-To: <CAHA9McMAJitoYxX0jCB5MUZDtf-XL9TMhjK4ENoxe4xf6owzNA@mail.gmail.com>
References: <1527B53B-4FB5-4937-8FEA-867EDC588482@gmail.com>
 <efdd363a8e79e1477aa02f019b3904c9@imap.plus.net>
 <2DD70661-7319-4D42-A308-9F3A6EC50AD0@gmail.com>
 <6f4c9b413e4fd40bb994370d8993be4e@imap.plus.net>
 <CAHA9McMAJitoYxX0jCB5MUZDtf-XL9TMhjK4ENoxe4xf6owzNA@mail.gmail.com>
Message-ID: <CANJmMqTo6qvyiCh6DPeo+o6Lx6jwszH1PKhCtnTJUxXvNzQCBg@mail.gmail.com>

Hello everyone, and thanks for your replys.

It looks like the problem was in my use of lubridate with datatable.
Removing lubridate from imports fixes it.

Every time I import *both* of these packages in R session, I get the
following:

> library(lubridate)
> library(data.table)
data.table 1.8.8  For help type: help("data.table")

Attaching package: ?data.table?

The following object(s) are masked from ?package:lubridate?:

    hour, mday, month, quarter, wday, week, yday, year

I haven't really paid attention to it, but now that I started
investigating, I noticed that data.table also defines all this function as
helpers to work with IDateTime. So there should be a name conflict
somewhere.

I'm puzzled about why data table would include this function/classes (isn't
it better to leave data handling to specialized classes?), but I understand
that there may be a good reason for that. Unfortunately, my code is using
lubridate heavily (it's just too good...), which leaves me in a tough spot
- I would like to use both. If I had to choose, I would be forced to
replace all lubridate code with standard R, which is not fun, but I guess I
have to bite the bullet.

Regards,
Victor

Yours Sincerely,
Victor Kryukov

US cell: +1-650-733-6510


On Mon, Mar 4, 2013 at 1:39 PM, Steve Lianoglou <
mailinglist.honeypot at gmail.com> wrote:

> I'm not sure if order matters in the NAMESPACE, but maybe you could
> try to write it manually and put the import statements up top?
>
> I haven't come across this problem, and I've got several packages that
> use data.table via importing it as you show here ...
>
> On Mon, Mar 4, 2013 at 2:35 AM, Matthew Dowle <mdowle at mdowle.plus.com>
> wrote:
> >
> > Hi,
> >
> > I don't see what's wrong then.
> >
> > Can you whittle the package down to the essential code such that you can
> > attach it and we can reproduce?
> >
> > Thanks,
> > Matthew
> >
> >
> >
> > On 04.03.2013 06:32, Victor Kryukov wrote:
> >>
> >> Hi Matthew,
> >>
> >> my DESCRIPTION file has the following section:
> >>
> >> Imports:
> >>     data.table,
> >>     lubridate
> >>
> >> and my (generated) NAMESPACE contains
> >>
> >> export(ProcessTransactionSurvey)
> >> import(data.table)
> >> import(lubridate)
> >>
> >> My R CMD CHECK (run with check() from devtools) mostly runs OK but
> >> fails at the end with the following error, which is expected since I
> >> haven't created any documentation yet. I'm not sure yet have to fix
> >> this LaTeX warning (I do have latex installed on my machine).
> >>
> >> * checking PDF version of manual ... WARNING
> >> LaTeX errors when creating PDF version.
> >> This typically indicates Rd problems.
> >> LaTeX errors found:
> >> * checking PDF version of manual without hyperrefs or index ... ERROR
> >> Error: Command failed (1)
> >>
> >> Anything else I should check?
> >>
> >> Victor
> >>
> >>
> >> On Mar 3, 2013, at 3:26 PM, Matthew Dowle <mdowle at mdowle.plus.com>
> wrote:
> >>
> >>>
> >>> Hi,
> >>>
> >>> Did you include data.table in either the Imports or Depends field of
> your
> >>> package's DESCRIPTION file?
> >>>
> >>> I've just improved data.table FAQ 6.9 to make that clearer.
> >>>
> >>> If it still doesn't work, does your package fully pass "R CMD check"?
> >>>
> >>> Matthew
> >>>
> >>>
> >>> On 03.03.2013 22:25, Victor Kryukov wrote:
> >>>>
> >>>> Hello,
> >>>>
> >>>> I'm developing an R package which will be used internally at my
> >>>> company, and I have troubles using data.table. I'm very new to package
> >>>> development and I'm not really sure whether the errors I see are
> >>>> related to data.table or not, but here it is anyway.
> >>>>
> >>>> I have a function that imports data from .csv files and cleans the
> >>>> data (subsets, converting fields to numeric etc.). As the end of the
> >>>> function definition, I convert the resulting data.frame to data.table
> >>>> and return the result:
> >>>>
> >>>> ProcessData <- function(?) {
> >>>>         ...
> >>>>         df <- data.table(df)
> >>>>         df
> >>>> }
> >>>>
> >>>> When I use this function standalone, after
> >>>>
> >>>> library(data.package)
> >>>>
> >>>> everything works as expected. However, when I'm defining this
> >>>> function as a part of a package and later call it, I'm getting the
> >>>> following error:
> >>>>
> >>>> Error in rbind(deparse.level, ...) :
> >>>>  could not find function ".rbind.data.table"
> >>>>
> >>>> Please note that in the package .R files, I'm not importing
> >>>> data.table directly with library(data.package) but rather have
> >>>> `import(data.table)` statement in my NAMESPACE, as recommended here
> >>>> https://github.com/hadley/devtools/wiki/Namespaces.
> >>>>
> >>>> When I import data.table directly with library(data.table) after
> >>>> importing my package, everything works as expected.
> >>>>
> >>>> I suspect there may be something going wrong with namespaces in
> >>>> data.table.
> >>>>
> >>>> My environment: I'm using R 2.15.3 on Mac and have tested the above
> >>>> on both data.table 1.8.6 and 1.8.7. Please let me know if I need to
> >>>> provide more info. Any help will be much appreciated!
> >>>>
> >>>> Regards,
> >>>> Victor
> >>>>
> >>>> _______________________________________________
> >>>> datatable-help mailing list
> >>>> datatable-help at lists.r-forge.r-project.org
> >>>>
> >>>>
> >>>>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >>>
> >>>
> >>
> >> _______________________________________________
> >> datatable-help mailing list
> >> datatable-help at lists.r-forge.r-project.org
> >>
> >>
> >>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> >
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130306/4d2d317b/attachment-0001.html>

From mailinglist.honeypot at gmail.com  Thu Mar  7 05:09:29 2013
From: mailinglist.honeypot at gmail.com (Steve Lianoglou)
Date: Wed, 6 Mar 2013 23:09:29 -0500
Subject: [datatable-help] Error in a package that imports data.table
In-Reply-To: <CANJmMqTo6qvyiCh6DPeo+o6Lx6jwszH1PKhCtnTJUxXvNzQCBg@mail.gmail.com>
References: <1527B53B-4FB5-4937-8FEA-867EDC588482@gmail.com>
 <efdd363a8e79e1477aa02f019b3904c9@imap.plus.net>
 <2DD70661-7319-4D42-A308-9F3A6EC50AD0@gmail.com>
 <6f4c9b413e4fd40bb994370d8993be4e@imap.plus.net>
 <CAHA9McMAJitoYxX0jCB5MUZDtf-XL9TMhjK4ENoxe4xf6owzNA@mail.gmail.com>
 <CANJmMqTo6qvyiCh6DPeo+o6Lx6jwszH1PKhCtnTJUxXvNzQCBg@mail.gmail.com>
Message-ID: <CAHA9McP0e8QeJfjxBtZPzs+6dnSyAChzKaqrrFo+22N4M60-Sw@mail.gmail.com>

Hi,

On Wed, Mar 6, 2013 at 10:47 PM, Victor Kryukov
<victor.kryukov at gmail.com> wrote:
[snip]
> I'm puzzled about why data table would include this function/classes (isn't
> it better to leave data handling to specialized classes?), but I understand
> that there may be a good reason for that.

I became a data.table user after IDateTime was in there (and I don't
ever use it, actually), but my *guess* would be that it's there to use
dates as keys for data.table ...

> Unfortunately, my code is using
> lubridate heavily (it's just too good...), which leaves me in a tough spot -
> I would like to use both. If I had to choose, I would be forced to replace
> all lubridate code with standard R, which is not fun, but I guess I have to
> bite the bullet.

You don't have to choose one over the other.

I suspect import order could do the trick. Perhaps import()-ing
data.table first, then lubridate might be all you have to do.

If not, I *think* if you define hour, mday, mont, etc. in your package code as:

mday <- lubridate::mday
hour <- lubridate::hour

And ensure that those functions are loaded first (either by using
Collate: and specifying that file first, or putting that in a function
called aaa.R or something), perhaps your code will recover "just like
that"

If that doesn't work either, another option is that  you just prefix
every lubridate call in your package code with the lubridate package
name, eg. instead of `year(whenever)` you do
`lubridate::year(whenever)`.

HTH,
-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

From victor.kryukov at gmail.com  Thu Mar  7 05:16:33 2013
From: victor.kryukov at gmail.com (Victor Kryukov)
Date: Wed, 6 Mar 2013 20:16:33 -0800
Subject: [datatable-help] Error in a package that imports data.table
In-Reply-To: <CAHA9McP0e8QeJfjxBtZPzs+6dnSyAChzKaqrrFo+22N4M60-Sw@mail.gmail.com>
References: <1527B53B-4FB5-4937-8FEA-867EDC588482@gmail.com>
 <efdd363a8e79e1477aa02f019b3904c9@imap.plus.net>
 <2DD70661-7319-4D42-A308-9F3A6EC50AD0@gmail.com>
 <6f4c9b413e4fd40bb994370d8993be4e@imap.plus.net>
 <CAHA9McMAJitoYxX0jCB5MUZDtf-XL9TMhjK4ENoxe4xf6owzNA@mail.gmail.com>
 <CANJmMqTo6qvyiCh6DPeo+o6Lx6jwszH1PKhCtnTJUxXvNzQCBg@mail.gmail.com>
 <CAHA9McP0e8QeJfjxBtZPzs+6dnSyAChzKaqrrFo+22N4M60-Sw@mail.gmail.com>
Message-ID: <CANJmMqQLTGe7OQz4vtoYG9TcMiV6QrMJoobiCO7jzxsxX0SYnQ@mail.gmail.com>

Thanks Steve. That's a good suggestion regarding the order or fully
specifying lubridate names.

I actually downloaded data.table code and looked through it, and unless I'm
missing something, IDateTime is totally separate from everything else. At
least if you search for 'IDate' or 'hour' or 'minute', you don't find them
mentioned in other .R or .c files besides IDateTime.R

And yes - lubridate IS mentioned in IDateTime.R code :)

###################################################################
# Date - time extraction functions
#   Adapted from Hadley Wickham's routines cited below to ensure
#   integer results.
#     http://gist.github.com/10238
#   See also Hadley's more advanced and complex lubridate package:
#     http://github.com/hadley/lubridate
#   lubridate routines do not return integer values.
###################################################################

On Wed, Mar 6, 2013 at 8:09 PM, Steve Lianoglou <
mailinglist.honeypot at gmail.com> wrote:

> Hi,
>
> On Wed, Mar 6, 2013 at 10:47 PM, Victor Kryukov
> <victor.kryukov at gmail.com> wrote:
> [snip]
> > I'm puzzled about why data table would include this function/classes
> (isn't
> > it better to leave data handling to specialized classes?), but I
> understand
> > that there may be a good reason for that.
>
> I became a data.table user after IDateTime was in there (and I don't
> ever use it, actually), but my *guess* would be that it's there to use
> dates as keys for data.table ...
>
> > Unfortunately, my code is using
> > lubridate heavily (it's just too good...), which leaves me in a tough
> spot -
> > I would like to use both. If I had to choose, I would be forced to
> replace
> > all lubridate code with standard R, which is not fun, but I guess I have
> to
> > bite the bullet.
>
> You don't have to choose one over the other.
>
> I suspect import order could do the trick. Perhaps import()-ing
> data.table first, then lubridate might be all you have to do.
>
> If not, I *think* if you define hour, mday, mont, etc. in your package
> code as:
>
> mday <- lubridate::mday
> hour <- lubridate::hour
>
> And ensure that those functions are loaded first (either by using
> Collate: and specifying that file first, or putting that in a function
> called aaa.R or something), perhaps your code will recover "just like
> that"
>
> If that doesn't work either, another option is that  you just prefix
> every lubridate call in your package code with the lubridate package
> name, eg. instead of `year(whenever)` you do
> `lubridate::year(whenever)`.
>
> HTH,
> -steve
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130306/fe403c0f/attachment.html>

From victor.kryukov at gmail.com  Thu Mar  7 06:22:08 2013
From: victor.kryukov at gmail.com (Victor Kryukov)
Date: Wed, 6 Mar 2013 21:22:08 -0800
Subject: [datatable-help] Error in a package that imports data.table
In-Reply-To: <CANJmMqQLTGe7OQz4vtoYG9TcMiV6QrMJoobiCO7jzxsxX0SYnQ@mail.gmail.com>
References: <1527B53B-4FB5-4937-8FEA-867EDC588482@gmail.com>
 <efdd363a8e79e1477aa02f019b3904c9@imap.plus.net>
 <2DD70661-7319-4D42-A308-9F3A6EC50AD0@gmail.com>
 <6f4c9b413e4fd40bb994370d8993be4e@imap.plus.net>
 <CAHA9McMAJitoYxX0jCB5MUZDtf-XL9TMhjK4ENoxe4xf6owzNA@mail.gmail.com>
 <CANJmMqTo6qvyiCh6DPeo+o6Lx6jwszH1PKhCtnTJUxXvNzQCBg@mail.gmail.com>
 <CAHA9McP0e8QeJfjxBtZPzs+6dnSyAChzKaqrrFo+22N4M60-Sw@mail.gmail.com>
 <CANJmMqQLTGe7OQz4vtoYG9TcMiV6QrMJoobiCO7jzxsxX0SYnQ@mail.gmail.com>
Message-ID: <CANJmMqS+KkEbTbi8MYd50=BqmTgs1RR4e8acKCaOS1DjoOv-XQ@mail.gmail.com>

Update: it looks the order in NAMESPACE doesn't matter for that particular
problem. I can confirm that when I change it the order of package loading
changes, as it's either data.table or lubridate that warns about
overwritting each other's functions, but the problem exists in either case.

I think my next steps will be to perform a surgery on data.table by
removing all IDateTime from my local copy - will see if it helps :).

On Wed, Mar 6, 2013 at 8:16 PM, Victor Kryukov <victor.kryukov at gmail.com>wrote:

> Thanks Steve. That's a good suggestion regarding the order or fully
> specifying lubridate names.
>
> I actually downloaded data.table code and looked through it, and unless
> I'm missing something, IDateTime is totally separate from everything else.
> At least if you search for 'IDate' or 'hour' or 'minute', you don't find
> them mentioned in other .R or .c files besides IDateTime.R
>
> And yes - lubridate IS mentioned in IDateTime.R code :)
>
> ###################################################################
> # Date - time extraction functions
> #   Adapted from Hadley Wickham's routines cited below to ensure
> #   integer results.
> #     http://gist.github.com/10238
> #   See also Hadley's more advanced and complex lubridate package:
> #     http://github.com/hadley/lubridate
> #   lubridate routines do not return integer values.
> ###################################################################
>
>
> On Wed, Mar 6, 2013 at 8:09 PM, Steve Lianoglou <
> mailinglist.honeypot at gmail.com> wrote:
>
>> Hi,
>>
>> On Wed, Mar 6, 2013 at 10:47 PM, Victor Kryukov
>> <victor.kryukov at gmail.com> wrote:
>> [snip]
>> > I'm puzzled about why data table would include this function/classes
>> (isn't
>> > it better to leave data handling to specialized classes?), but I
>> understand
>> > that there may be a good reason for that.
>>
>> I became a data.table user after IDateTime was in there (and I don't
>> ever use it, actually), but my *guess* would be that it's there to use
>> dates as keys for data.table ...
>>
>> > Unfortunately, my code is using
>> > lubridate heavily (it's just too good...), which leaves me in a tough
>> spot -
>> > I would like to use both. If I had to choose, I would be forced to
>> replace
>> > all lubridate code with standard R, which is not fun, but I guess I
>> have to
>> > bite the bullet.
>>
>> You don't have to choose one over the other.
>>
>> I suspect import order could do the trick. Perhaps import()-ing
>> data.table first, then lubridate might be all you have to do.
>>
>> If not, I *think* if you define hour, mday, mont, etc. in your package
>> code as:
>>
>> mday <- lubridate::mday
>> hour <- lubridate::hour
>>
>> And ensure that those functions are loaded first (either by using
>> Collate: and specifying that file first, or putting that in a function
>> called aaa.R or something), perhaps your code will recover "just like
>> that"
>>
>> If that doesn't work either, another option is that  you just prefix
>> every lubridate call in your package code with the lubridate package
>> name, eg. instead of `year(whenever)` you do
>> `lubridate::year(whenever)`.
>>
>> HTH,
>> -steve
>>
>> --
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>>  | Memorial Sloan-Kettering Cancer Center
>>  | Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130306/0cac1c74/attachment.html>

From mailinglist.honeypot at gmail.com  Thu Mar  7 06:40:11 2013
From: mailinglist.honeypot at gmail.com (Steve Lianoglou)
Date: Thu, 7 Mar 2013 00:40:11 -0500
Subject: [datatable-help] Error in a package that imports data.table
In-Reply-To: <CANJmMqS+KkEbTbi8MYd50=BqmTgs1RR4e8acKCaOS1DjoOv-XQ@mail.gmail.com>
References: <1527B53B-4FB5-4937-8FEA-867EDC588482@gmail.com>
 <efdd363a8e79e1477aa02f019b3904c9@imap.plus.net>
 <2DD70661-7319-4D42-A308-9F3A6EC50AD0@gmail.com>
 <6f4c9b413e4fd40bb994370d8993be4e@imap.plus.net>
 <CAHA9McMAJitoYxX0jCB5MUZDtf-XL9TMhjK4ENoxe4xf6owzNA@mail.gmail.com>
 <CANJmMqTo6qvyiCh6DPeo+o6Lx6jwszH1PKhCtnTJUxXvNzQCBg@mail.gmail.com>
 <CAHA9McP0e8QeJfjxBtZPzs+6dnSyAChzKaqrrFo+22N4M60-Sw@mail.gmail.com>
 <CANJmMqQLTGe7OQz4vtoYG9TcMiV6QrMJoobiCO7jzxsxX0SYnQ@mail.gmail.com>
 <CANJmMqS+KkEbTbi8MYd50=BqmTgs1RR4e8acKCaOS1DjoOv-XQ@mail.gmail.com>
Message-ID: <CAHA9McMgniVyyBeixOSMX8h39uV4fdfEY8Ri+JrFzHmozm7jMA@mail.gmail.com>

Hi,

On Thu, Mar 7, 2013 at 12:22 AM, Victor Kryukov
<victor.kryukov at gmail.com> wrote:
> Update: it looks the order in NAMESPACE doesn't matter for that particular
> problem. I can confirm that when I change it the order of package loading
> changes, as it's either data.table or lubridate that warns about
> overwritting each other's functions, but the problem exists in either case.
>
> I think my next steps will be to perform a surgery on data.table by removing
> all IDateTime from my local copy - will see if it helps :).

It's your prerogative to do what you like, but I feel like the other
two alternatives I gave are a bit less intense than what you are
proposing, no?

It also has the bonus feature of not requiring a non-standard
data.table install, which is good if you expect anybody else to use
your package.

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

From victor.kryukov at gmail.com  Thu Mar  7 06:49:56 2013
From: victor.kryukov at gmail.com (Victor Kryukov)
Date: Wed, 6 Mar 2013 21:49:56 -0800
Subject: [datatable-help] Building data.table 1.8.9 from sources fails
Message-ID: <CANJmMqRb+xNK6iUqS6c98CawdRmC2Bp+VsoJpWCZrxWyxmYz_w@mail.gmail.com>

Hello,

when I'm trying to build freshly cloned data.table 1.8.9 from source via
either 'R CMD build .' in pkg/ directory or calling devtools build(), it
fails with the following error. I don't really understand what's going on,
as in lines 16-17 in datatable-faq.Rnw we have

if (!exists("data.table",.GlobalEnv)) library(data.table)  # see Intro.Rnw
for comments on these two lines
rm(list=as.character(tables()$NAME),envir=.GlobalEnv)

and first line should load data.table if not loaded. Even when I load it
explicitly in line 16 via library(data.table), i.e. removing if(), it fails
with the same error.

Any ideas why?

'R CMD check .' finishes without any problems. My systemInfo() is below
just in case.

Regards,
Victor

====

> build()
'/usr/local/Cellar/r/2.15.3/R.framework/Resources/bin/R' --vanilla CMD
build '/Users/victor/Documents/R/datatable/pkg' --no-manual
--no-resave-data

* checking for file '/Users/victor/Documents/R/datatable/pkg/DESCRIPTION'
... OK
* preparing 'data.table':
* checking DESCRIPTION meta-information ... OK
* cleaning src
* installing the package to re-build vignettes
* creating vignettes ... ERROR

Error: processing vignette 'datatable-faq.Rnw' failed with diagnostics:
 chunk 1
Error in rm(list = as.character(tables()$NAME), envir = .GlobalEnv) :
  could not find function "tables"
Execution halted
Error: Command failed (1)

====

> sessionInfo()
R version 2.15.3 (2013-03-01)
Platform: x86_64-apple-darwin12.2.1 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130306/cb695ee4/attachment-0001.html>

From victor.kryukov at gmail.com  Thu Mar  7 06:55:26 2013
From: victor.kryukov at gmail.com (Victor Kryukov)
Date: Wed, 6 Mar 2013 21:55:26 -0800
Subject: [datatable-help] Building data.table 1.8.9 from sources fails
In-Reply-To: <CANJmMqRb+xNK6iUqS6c98CawdRmC2Bp+VsoJpWCZrxWyxmYz_w@mail.gmail.com>
References: <CANJmMqRb+xNK6iUqS6c98CawdRmC2Bp+VsoJpWCZrxWyxmYz_w@mail.gmail.com>
Message-ID: <CANJmMqQm8U6j44+FDL4nFvKKCh_=Dt=61hQrVV42t5gnuoTyEw@mail.gmail.com>

Please ignore this. I've accidentally replaced NAMESPACE by running
document(). Everything builds fine from fresh source. I should really go
home now...

Yours Sincerely,
Victor Kryukov

On Wed, Mar 6, 2013 at 9:49 PM, Victor Kryukov <victor.kryukov at gmail.com>wrote:

> Hello,
>
> when I'm trying to build freshly cloned data.table 1.8.9 from source via
> either 'R CMD build .' in pkg/ directory or calling devtools build(), it
> fails with the following error. I don't really understand what's going on,
> as in lines 16-17 in datatable-faq.Rnw we have
>
> if (!exists("data.table",.GlobalEnv)) library(data.table)  # see Intro.Rnw
> for comments on these two lines
> rm(list=as.character(tables()$NAME),envir=.GlobalEnv)
>
> and first line should load data.table if not loaded. Even when I load it
> explicitly in line 16 via library(data.table), i.e. removing if(), it fails
> with the same error.
>
> Any ideas why?
>
> 'R CMD check .' finishes without any problems. My systemInfo() is below
> just in case.
>
> Regards,
> Victor
>
> ====
>
> > build()
> '/usr/local/Cellar/r/2.15.3/R.framework/Resources/bin/R' --vanilla CMD
> build '/Users/victor/Documents/R/datatable/pkg' --no-manual
> --no-resave-data
>
> * checking for file '/Users/victor/Documents/R/datatable/pkg/DESCRIPTION'
> ... OK
> * preparing 'data.table':
> * checking DESCRIPTION meta-information ... OK
> * cleaning src
> * installing the package to re-build vignettes
> * creating vignettes ... ERROR
>
> Error: processing vignette 'datatable-faq.Rnw' failed with diagnostics:
>  chunk 1
> Error in rm(list = as.character(tables()$NAME), envir = .GlobalEnv) :
>   could not find function "tables"
> Execution halted
> Error: Command failed (1)
>
> ====
>
> > sessionInfo()
> R version 2.15.3 (2013-03-01)
> Platform: x86_64-apple-darwin12.2.1 (64-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130306/add42840/attachment.html>

From mdowle at mdowle.plus.com  Thu Mar  7 09:55:25 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Thu, 07 Mar 2013 08:55:25 +0000
Subject: [datatable-help] Error in a package that imports data.table
In-Reply-To: <CAHA9McMgniVyyBeixOSMX8h39uV4fdfEY8Ri+JrFzHmozm7jMA@mail.gmail.com>
References: <1527B53B-4FB5-4937-8FEA-867EDC588482@gmail.com>
 <efdd363a8e79e1477aa02f019b3904c9@imap.plus.net>
 <2DD70661-7319-4D42-A308-9F3A6EC50AD0@gmail.com>
 <6f4c9b413e4fd40bb994370d8993be4e@imap.plus.net>
 <CAHA9McMAJitoYxX0jCB5MUZDtf-XL9TMhjK4ENoxe4xf6owzNA@mail.gmail.com>
 <CANJmMqTo6qvyiCh6DPeo+o6Lx6jwszH1PKhCtnTJUxXvNzQCBg@mail.gmail.com>
 <CAHA9McP0e8QeJfjxBtZPzs+6dnSyAChzKaqrrFo+22N4M60-Sw@mail.gmail.com>
 <CANJmMqQLTGe7OQz4vtoYG9TcMiV6QrMJoobiCO7jzxsxX0SYnQ@mail.gmail.com>
 <CANJmMqS+KkEbTbi8MYd50=BqmTgs1RR4e8acKCaOS1DjoOv-XQ@mail.gmail.com>
 <CAHA9McMgniVyyBeixOSMX8h39uV4fdfEY8Ri+JrFzHmozm7jMA@mail.gmail.com>
Message-ID: <0d3f4e3e8834acc589df3d422d71b13c@imap.plus.net>


Victor,

As Steve says you shouldn't need to do that.

If it's just the mask warnings you're trying to suppress have you tried 
:

     suppressPackageStartupMessages({
         library(...)
         library(...)
     })

I haven't used lubdridate before. I tried :

> install.packages("lubdridate")
Warning message:
package ?lubdridate? is not available (for R version 2.15.3)
>

Seems odd.   Anyway: is lubridate fast?   As the code comment you 
pasted said, it stores Date as numeric (type double) doesn't it, as base 
R does? Won't that mean sorting won't be as fast on it?  That's the 
reason IDate exists and what the I stands for.

Matthew


On 07.03.2013 05:40, Steve Lianoglou wrote:
> Hi,
>
> On Thu, Mar 7, 2013 at 12:22 AM, Victor Kryukov
> <victor.kryukov at gmail.com> wrote:
>> Update: it looks the order in NAMESPACE doesn't matter for that 
>> particular
>> problem. I can confirm that when I change it the order of package 
>> loading
>> changes, as it's either data.table or lubridate that warns about
>> overwritting each other's functions, but the problem exists in 
>> either case.
>>
>> I think my next steps will be to perform a surgery on data.table by 
>> removing
>> all IDateTime from my local copy - will see if it helps :).
>
> It's your prerogative to do what you like, but I feel like the other
> two alternatives I gave are a bit less intense than what you are
> proposing, no?
>
> It also has the bonus feature of not requiring a non-standard
> data.table install, which is good if you expect anybody else to use
> your package.
>
> -steve

From statquant at outlook.com  Thu Mar  7 13:49:51 2013
From: statquant at outlook.com (statquant3)
Date: Thu, 7 Mar 2013 04:49:51 -0800 (PST)
Subject: [datatable-help] Error in a package that imports data.table
In-Reply-To: <0d3f4e3e8834acc589df3d422d71b13c@imap.plus.net>
References: <efdd363a8e79e1477aa02f019b3904c9@imap.plus.net>
 <2DD70661-7319-4D42-A308-9F3A6EC50AD0@gmail.com>
 <6f4c9b413e4fd40bb994370d8993be4e@imap.plus.net>
 <CAHA9McMAJitoYxX0jCB5MUZDtf-XL9TMhjK4ENoxe4xf6owzNA@mail.gmail.com>
 <CANJmMqTo6qvyiCh6DPeo+o6Lx6jwszH1PKhCtnTJUxXvNzQCBg@mail.gmail.com>
 <CAHA9McP0e8QeJfjxBtZPzs+6dnSyAChzKaqrrFo+22N4M60-Sw@mail.gmail.com>
 <CANJmMqQLTGe7OQz4vtoYG9TcMiV6QrMJoobiCO7jzxsxX0SYnQ@mail.gmail.com>
 <CANJmMqS+KkEbTbi8MYd50=BqmTgs1RR4e8acKCaOS1DjoOv-XQ@mail.gmail.com>
 <CAHA9McMgniVyyBeixOSMX8h39uV4fdfEY8Ri+JrFzHmozm7jMA@mail.gmail.com>
 <0d3f4e3e8834acc589df3d422d71b13c@imap.plus.net>
Message-ID: <1362660591788-4660598.post@n4.nabble.com>

Hello,
I do not think lubridate is fast, It just acts as a syntax sweetener.

Here is the description of the package by its author : 
Quote:
Lubridate makes it easier to work with dates and times by
providing functions to identify and parse date-time data,extract and modify
components of a datetime
(years, months,days, hours, minutes, and seconds), perform accurate math on
date-times, handle time zones and Daylight Savings Time.
Lubridate has a consistent, memorable syntax, that makes
working with dates fun instead of frustrating.

As far as I know no package provide a quickest datetime handle, even if
data.table proposes IDate, ITime... that stores integer for fast sorting.
(Problem is for me that it does not support sub-second but we spoke about it
already)

If any package provides a faster implementation for datetimes I'll be glag
to hear about it.


--
View this message in context: http://r.789695.n4.nabble.com/Error-in-a-package-that-imports-data-table-tp4660173p4660598.html
Sent from the datatable-help mailing list archive at Nabble.com.

From saporta at scarletmail.rutgers.edu  Thu Mar  7 17:45:02 2013
From: saporta at scarletmail.rutgers.edu (Ricardo Saporta)
Date: Thu, 7 Mar 2013 11:45:02 -0500
Subject: [datatable-help] unique, by full row (not just key)
Message-ID: <CAE7Aa4S7ehnXNOhkrdXWknXpxPvj+E36zWNGZ_Jcs-PP8XpmOg@mail.gmail.com>

I have a keyed data.table, DT, with 800k rows, of which about 0.5% are
duplicates that need to removed.

Using unique(DT) of course widdles down the whole table to one row per key.

I would like to get results similar to unique.data.frame(DT)
Two problems with using unique.data.frame:  (1) Speed  (2) loss of key(DT)

So instead Im using a wrapper that
  (1) caches key(DT) (2) removes the key (3) calls unique on DT (4) then
repplies the key

However, this is convoluted (and also requires modifying setkey(.) and
getdots(.)).
It occurs to me that I might be overlooking a simpler alternative.

anythoughts?

Thanks,
Rick


_Here is what I am using_:

 uniqueRows <- function(DT) {
    # If already keyed (or not a DT), use regular unique(DT)
    if (!haskey(DT) ||  !is.data.table(x) )
      return(unique(DT))

    .key <- key(DT)
    setkey(DT, NULL)
    setkeyE(unique(DT), eval(.key))
  }


  getdotsWithEval <- function () {
      dots <-
        as.character(match.call(sys.function(-1), call = sys.call(-1),
            expand.dots = FALSE)$...)

      if (grepl("^eval\\(", dots) && grepl("\\)$", dots))
        return(eval(parse(text=dots)))
      return(dots)
  }

  setkeyE <- function (x, ..., verbose = getOption("datatable.verbose")) {
    # SAME AS setkey(.) WITH ADDITION THAT
    # IF KEY IS WRAPPED IN eval(.) IT WILL BE PARSED
      if (is.character(x))
          stop("x may no longer be the character name of the data.table.
The possibility was undocumented and has been removed.")
      #** THIS IS THE MODIFIED LINE **#
      # OLD**:  cols = getdots()
      cols <- getdotsWithEval()
      if (!length(cols))
          cols = colnames(x)
      else if (identical(cols, "NULL"))
          cols = NULL
      setkeyv(x, cols, verbose = verbose)
  }


-- 
Ricardo Saporta
Graduate Student, Data Analytics
Rutgers University, New Jersey
e: saporta at rutgers.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130307/4ef352b3/attachment.html>

From mdowle at mdowle.plus.com  Thu Mar  7 18:03:00 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Thu, 07 Mar 2013 17:03:00 +0000
Subject: [datatable-help] unique, by full row (not just key)
In-Reply-To: <CAE7Aa4S7ehnXNOhkrdXWknXpxPvj+E36zWNGZ_Jcs-PP8XpmOg@mail.gmail.com>
References: <CAE7Aa4S7ehnXNOhkrdXWknXpxPvj+E36zWNGZ_Jcs-PP8XpmOg@mail.gmail.com>
Message-ID: <0e76e44156d43f0cef683fd5ec170873@imap.plus.net>

 
Hi, 

Are the duplicates next to each other in the table? Or could
duplicates be within each key, separated by other rows? 

If duplicates
are together, calling data.table:::duplist directly should do it. (see
source of data.table:::unique.data.table). It loops through the rows by
column and works like diff(x)==0 would i.e. looking at the previous row
only, but does compare all columns. If a subset of columns are needed,
then maybe a data.table:::shallow followed by column removal of the ones
you don't need on that shallow copy (the shallow copy and column removal
being instant). Just because duplist doesn't accept a subset of the list
of columns it is passed. 

shallow() is on the agenda to be exported for
user use (so suggesting it is an excuse to get you to test it!). Hadn't
thought about duplist but could do, too. They are both relied on
internally, so should be reliable. But as soon as they're exported we
can't make non-backwards compatible changes to them. 

Matthew 

On
07.03.2013 16:45, Ricardo Saporta wrote: 

> I have a keyed data.table,
DT, with 800k rows, of which about 0.5% are duplicates that need to
removed. 
> Using unique(DT) of course widdles down the whole table to
one row per key. 
> I would like to get results similar to
unique.data.frame(DT) 
> Two problems with using unique.data.frame: (1)
Speed (2) loss of key(DT) 
> So instead Im using a wrapper that 
> (1)
caches key(DT) (2) removes the key (3) calls unique on DT (4) then
repplies the key 
> However, this is convoluted (and also requires
modifying setkey(.) and getdots(.)). 
> It occurs to me that I might be
overlooking a simpler alternative. 
> anythoughts? 
> Thanks, 
> Rick 
>
_Here is what I am using_: 
> uniqueRows 
> # If already keyed (or not a
DT), use regular unique(DT) 
> if (!haskey(DT) || !is.data.table(x) ) 
>
return(unique(DT)) 
> .key 
> setkey(DT, NULL) 
> setkeyE(unique(DT),
eval(.key)) 
> } 
> getdotsWithEval 
> dots 
>
as.character(match.call(sys.function(-1), call = sys.call(-1), 
>
expand.dots = FALSE)$...) 
> if (grepl("^eval\(", dots) && grepl("\)$",
dots)) 
> return(eval(parse(text=dots))) 
> return(dots) 
> } 
> setkeyE

> # SAME AS setkey(.) WITH ADDITION THAT 
> # IF KEY IS WRAPPED IN
eval(.) IT WILL BE PARSED 
> if (is.character(x)) 
> stop("x may no
longer be the character name of the data.table. The possibility was
undocumented and has been removed.") 
> #** THIS IS THE MODIFIED LINE
**# 
> # OLD**: cols = getdots() 
> cols 
> if (!length(cols)) 
> cols =
colnames(x) 
> else if (identical(cols, "NULL")) 
> cols = NULL 
>
setkeyv(x, cols, verbose = verbose) 
> } -- 
> 
> Ricardo Saporta 
>
Graduate Student, Data Analytics 
> Rutgers University, New Jersey 
> e:
saporta at rutgers.edu [1]

 
Links:
------
[1]
mailto:saporta at rutgers.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130307/f1438d87/attachment.html>

From mdowle at mdowle.plus.com  Thu Mar  7 18:07:28 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Thu, 07 Mar 2013 17:07:28 +0000
Subject: [datatable-help] unique, by full row (not just key)
In-Reply-To: <0e76e44156d43f0cef683fd5ec170873@imap.plus.net>
References: <CAE7Aa4S7ehnXNOhkrdXWknXpxPvj+E36zWNGZ_Jcs-PP8XpmOg@mail.gmail.com>
 <0e76e44156d43f0cef683fd5ec170873@imap.plus.net>
Message-ID: <62cd21531cb6e8953cd952a2283e8693@imap.plus.net>

 
Which means that unique.data.table itself can be improved
internally, in the way I just suggested using shallow() ... 

Most of
the time the key will be small so that copy of the key columns to pass
to duplist won't be huge, but, still a copy. And could slow down key
only tables most, relatively. 

On 07.03.2013 17:03, Matthew Dowle
wrote: 

> Hi, 
> 
> Are the duplicates next to each other in the table?
Or could duplicates be within each key, separated by other rows? 
> 
>
If duplicates are together, calling data.table:::duplist directly should
do it. (see source of data.table:::unique.data.table). It loops through
the rows by column and works like diff(x)==0 would i.e. looking at the
previous row only, but does compare all columns. If a subset of columns
are needed, then maybe a data.table:::shallow followed by column removal
of the ones you don't need on that shallow copy (the shallow copy and
column removal being instant). Just because duplist doesn't accept a
subset of the list of columns it is passed. 
> 
> shallow() is on the
agenda to be exported for user use (so suggesting it is an excuse to get
you to test it!). Hadn't thought about duplist but could do, too. They
are both relied on internally, so should be reliable. But as soon as
they're exported we can't make non-backwards compatible changes to them.

> 
> Matthew 
> 
> On 07.03.2013 16:45, Ricardo Saporta wrote: 
> 
>> I
have a keyed data.table, DT, with 800k rows, of which about 0.5% are
duplicates that need to removed. 
>> Using unique(DT) of course widdles
down the whole table to one row per key. 
>> I would like to get results
similar to unique.data.frame(DT) 
>> Two problems with using
unique.data.frame: (1) Speed (2) loss of key(DT) 
>> So instead Im using
a wrapper that 
>> (1) caches key(DT) (2) removes the key (3) calls
unique on DT (4) then repplies the key 
>> However, this is convoluted
(and also requires modifying setkey(.) and getdots(.)). 
>> It occurs to
me that I might be overlooking a simpler alternative. 
>> anythoughts?

>> Thanks, 
>> Rick 
>> _Here is what I am using_: 
>> uniqueRows 
>> #
If already keyed (or not a DT), use regular unique(DT) 
>> if
(!haskey(DT) || !is.data.table(x) ) 
>> return(unique(DT)) 
>> .key 
>>
setkey(DT, NULL) 
>> setkeyE(unique(DT), eval(.key)) 
>> } 
>>
getdotsWithEval 
>> dots 
>> as.character(match.call(sys.function(-1),
call = sys.call(-1), 
>> expand.dots = FALSE)$...) 
>> if
(grepl("^eval\(", dots) && grepl("\)$", dots)) 
>>
return(eval(parse(text=dots))) 
>> return(dots) 
>> } 
>> setkeyE 
>> #
SAME AS setkey(.) WITH ADDITION THAT 
>> # IF KEY IS WRAPPED IN eval(.)
IT WILL BE PARSED 
>> if (is.character(x)) 
>> stop("x may no longer be
the character name of the data.table. The possibility was undocumented
and has been removed.") 
>> #** THIS IS THE MODIFIED LINE **# 
>> #
OLD**: cols = getdots() 
>> cols 
>> if (!length(cols)) 
>> cols =
colnames(x) 
>> else if (identical(cols, "NULL")) 
>> cols = NULL 
>>
setkeyv(x, cols, verbose = verbose) 
>> } -- 
>> 
>> Ricardo Saporta 
>>
Graduate Student, Data Analytics 
>> Rutgers University, New Jersey 
>>
e: saporta at rutgers.edu [1]

 
Links:
------
[1]
mailto:saporta at rutgers.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130307/81cf5984/attachment-0001.html>

From victor.kryukov at gmail.com  Thu Mar  7 18:25:31 2013
From: victor.kryukov at gmail.com (Victor Kryukov)
Date: Thu, 7 Mar 2013 09:25:31 -0800
Subject: [datatable-help] Error in a package that imports data.table
In-Reply-To: <0d3f4e3e8834acc589df3d422d71b13c@imap.plus.net>
References: <1527B53B-4FB5-4937-8FEA-867EDC588482@gmail.com>
 <efdd363a8e79e1477aa02f019b3904c9@imap.plus.net>
 <2DD70661-7319-4D42-A308-9F3A6EC50AD0@gmail.com>
 <6f4c9b413e4fd40bb994370d8993be4e@imap.plus.net>
 <CAHA9McMAJitoYxX0jCB5MUZDtf-XL9TMhjK4ENoxe4xf6owzNA@mail.gmail.com>
 <CANJmMqTo6qvyiCh6DPeo+o6Lx6jwszH1PKhCtnTJUxXvNzQCBg@mail.gmail.com>
 <CAHA9McP0e8QeJfjxBtZPzs+6dnSyAChzKaqrrFo+22N4M60-Sw@mail.gmail.com>
 <CANJmMqQLTGe7OQz4vtoYG9TcMiV6QrMJoobiCO7jzxsxX0SYnQ@mail.gmail.com>
 <CANJmMqS+KkEbTbi8MYd50=BqmTgs1RR4e8acKCaOS1DjoOv-XQ@mail.gmail.com>
 <CAHA9McMgniVyyBeixOSMX8h39uV4fdfEY8Ri+JrFzHmozm7jMA@mail.gmail.com>
 <0d3f4e3e8834acc589df3d422d71b13c@imap.plus.net>
Message-ID: <CANJmMqSntm9zUmNhRYewDr7mjhpboPcSooD_b2R9r8s1Gm8WFw@mail.gmail.com>

OK, I think I have solved it. The problem seemed to be related to FAQ 2.23.

When I was *importing* data.table with 'Imports:', I think what was going
on is that R was making functions from data.table's namespace available to
my package, but the data.table package itself was not loaded. As a
consequence, .onLoad was never called and hense FAQ 2.23's magic never
happened.

Now my depends section in DESCRIPTION looks like this:

Depends:
    data.table,
    lubridate

and everything seems to work - no error messages about .rbind.data.table
not available, and lubridate's hour, minute etc. mask data.table's, which
is what expected. The order does matter in that case.

Thanks for Matthew and Steve for providing support. At least I had a reason
to downloaded data.table  and poke around its sources; wish it was
available on github...

Regards,
Victor

On Thu, Mar 7, 2013 at 12:55 AM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:

>
> Victor,
>
> As Steve says you shouldn't need to do that.
>
> If it's just the mask warnings you're trying to suppress have you tried :
>
>     suppressPackageStartupMessages**({
>         library(...)
>         library(...)
>     })
>
> I haven't used lubdridate before. I tried :
>
>  install.packages("lubdridate")
>>
> Warning message:
> package ?lubdridate? is not available (for R version 2.15.3)
>
>>
>>
> Seems odd.   Anyway: is lubridate fast?   As the code comment you pasted
> said, it stores Date as numeric (type double) doesn't it, as base R does?
> Won't that mean sorting won't be as fast on it?  That's the reason IDate
> exists and what the I stands for.
>
> Matthew
>
>
>
> On 07.03.2013 05:40, Steve Lianoglou wrote:
>
>> Hi,
>>
>> On Thu, Mar 7, 2013 at 12:22 AM, Victor Kryukov
>> <victor.kryukov at gmail.com> wrote:
>>
>>> Update: it looks the order in NAMESPACE doesn't matter for that
>>> particular
>>> problem. I can confirm that when I change it the order of package loading
>>> changes, as it's either data.table or lubridate that warns about
>>> overwritting each other's functions, but the problem exists in either
>>> case.
>>>
>>> I think my next steps will be to perform a surgery on data.table by
>>> removing
>>> all IDateTime from my local copy - will see if it helps :).
>>>
>>
>> It's your prerogative to do what you like, but I feel like the other
>> two alternatives I gave are a bit less intense than what you are
>> proposing, no?
>>
>> It also has the bonus feature of not requiring a non-standard
>> data.table install, which is good if you expect anybody else to use
>> your package.
>>
>> -steve
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130307/62a87548/attachment.html>

From mdowle at mdowle.plus.com  Thu Mar  7 18:57:55 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Thu, 07 Mar 2013 17:57:55 +0000
Subject: [datatable-help] Error in a package that imports data.table
In-Reply-To: <CANJmMqSntm9zUmNhRYewDr7mjhpboPcSooD_b2R9r8s1Gm8WFw@mail.gmail.com>
References: <1527B53B-4FB5-4937-8FEA-867EDC588482@gmail.com>
 <efdd363a8e79e1477aa02f019b3904c9@imap.plus.net>
 <2DD70661-7319-4D42-A308-9F3A6EC50AD0@gmail.com>
 <6f4c9b413e4fd40bb994370d8993be4e@imap.plus.net>
 <CAHA9McMAJitoYxX0jCB5MUZDtf-XL9TMhjK4ENoxe4xf6owzNA@mail.gmail.com>
 <CANJmMqTo6qvyiCh6DPeo+o6Lx6jwszH1PKhCtnTJUxXvNzQCBg@mail.gmail.com>
 <CAHA9McP0e8QeJfjxBtZPzs+6dnSyAChzKaqrrFo+22N4M60-Sw@mail.gmail.com>
 <CANJmMqQLTGe7OQz4vtoYG9TcMiV6QrMJoobiCO7jzxsxX0SYnQ@mail.gmail.com>
 <CANJmMqS+KkEbTbi8MYd50=BqmTgs1RR4e8acKCaOS1DjoOv-XQ@mail.gmail.com>
 <CAHA9McMgniVyyBeixOSMX8h39uV4fdfEY8Ri+JrFzHmozm7jMA@mail.gmail.com>
 <0d3f4e3e8834acc589df3d422d71b13c@imap.plus.net>
 <CANJmMqSntm9zUmNhRYewDr7mjhpboPcSooD_b2R9r8s1Gm8WFw@mail.gmail.com>
Message-ID: <aea2609d474e069edba7eef388e3cb10@imap.plus.net>

 
Interesting, thanks for update. That's news to me. But then how do
the datatable options get set if it's just Imported ? .onLoad sets those
options too. Does any other function get run when a package is imported.
Is there a .onImport ? That can't be right, otherwise how do datatable
options get set for the 3 packages on CRAN that Import data.table? Hm...


Just to check, you know you can poke around the source (updated in
real time) online too :


https://r-forge.r-project.org/scm/viewvc.php/?root=datatable [3] 

Not
as pretty as github but just checking you know you can browse there.


Matthew 

On 07.03.2013 17:25, Victor Kryukov wrote: 

> OK, I think I
have solved it. The problem seemed to be related to FAQ 2.23.
> 
> When
I was *importing* data.table with 'Imports:', I think what was going on
is that R was making functions from data.table's namespace available to
my package, but the data.table package itself was not loaded. As a
consequence, .onLoad was never called and hense FAQ 2.23's magic never
happened.
> 
> Now my depends section in DESCRIPTION looks like this:
>

> Depends:
> data.table,
> lubridate
> 
> and everything seems to work
- no error messages about .rbind.data.table not available, and
lubridate's hour, minute etc. mask data.table's, which is what expected.
The order does matter in that case.
> 
> Thanks for Matthew and Steve
for providing support. At least I had a reason to downloaded data.table
and poke around its sources; wish it was available on github...
> 
>
Regards,
> Victor 
> 
> On Thu, Mar 7, 2013 at 12:55 AM, Matthew Dowle
<mdowle at mdowle.plus.com [2]> wrote:
> 
>> Victor,
>> 
>> As Steve says
you shouldn't need to do that.
>> 
>> If it's just the mask warnings
you're trying to suppress have you tried :
>> 
>>
suppressPackageStartupMessages({
>> library(...)
>> library(...)
>>
})
>> 
>> I haven't used lubdridate before. I tried :
>> 
>>>
install.packages("lubdridate")
>> Warning message:
>> package
'lubdridate' is not available (for R version 2.15.3)
>> 
>> Seems odd.
Anyway: is lubridate fast? As the code comment you pasted said, it
stores Date as numeric (type double) doesn't it, as base R does? Won't
that mean sorting won't be as fast on it? That's the reason IDate exists
and what the I stands for.
>> 
>> Matthew 
>> 
>> On 07.03.2013 05:40,
Steve Lianoglou wrote:
>> 
>>> Hi,
>>> 
>>> On Thu, Mar 7, 2013 at 12:22
AM, Victor Kryukov
>>> <victor.kryukov at gmail.com [1]> wrote:
>>> 
>>>>
Update: it looks the order in NAMESPACE doesn't matter for that
particular
>>>> problem. I can confirm that when I change it the order
of package loading
>>>> changes, as it's either data.table or lubridate
that warns about
>>>> overwritting each other's functions, but the
problem exists in either case.
>>>> 
>>>> I think my next steps will be
to perform a surgery on data.table by removing
>>>> all IDateTime from
my local copy - will see if it helps :).
>>> 
>>> It's your prerogative
to do what you like, but I feel like the other
>>> two alternatives I
gave are a bit less intense than what you are
>>> proposing, no?
>>>

>>> It also has the bonus feature of not requiring a non-standard
>>>
data.table install, which is good if you expect anybody else to use
>>>
your package.
>>> 
>>> -steve

 
Links:
------
[1]
mailto:victor.kryukov at gmail.com
[2] mailto:mdowle at mdowle.plus.com
[3]
https://r-forge.r-project.org/scm/viewvc.php/?root=datatable
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130307/d6bc4bbd/attachment.html>

From lambandme at gmail.com  Fri Mar  8 06:25:12 2013
From: lambandme at gmail.com (Yi Yuan)
Date: Fri, 8 Mar 2013 00:25:12 -0500
Subject: [datatable-help] How to do binary search on integer key?
Message-ID: <CADzSX5c9kBkooWUwDeAJk4GC1Txz56tngbRXMTmJtyagfaOKXA@mail.gmail.com>

Hi, all:
I know someone has asked exactly the same question and there's even an
answer. But I think the answer is wrong. Following is the url of that
question
http://r.789695.n4.nabble.com/Binary-search-with-integer-key-td3705686.html

so if the key is integer and I would like to select all records where the
key=654, how do I do that?
suppose the data table is named table, key variable's name is id

I know you can do it by writing: table[id==645,], but R will conduct vector
search this way and is a lot slower than binary search.

So how can I do binary search on integer key??

Thanks!!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130308/214dac9d/attachment.html>

From michael.nelson at sydney.edu.au  Fri Mar  8 06:42:00 2013
From: michael.nelson at sydney.edu.au (Michael Nelson)
Date: Fri, 8 Mar 2013 05:42:00 +0000
Subject: [datatable-help] How to do binary search on integer key?
In-Reply-To: <CADzSX5c9kBkooWUwDeAJk4GC1Txz56tngbRXMTmJtyagfaOKXA@mail.gmail.com>
References: <CADzSX5c9kBkooWUwDeAJk4GC1Txz56tngbRXMTmJtyagfaOKXA@mail.gmail.com>
Message-ID: <6FB5193A6CDCDF499486A833B7AFBDCD5827E395@EX-MBX-PRO-04.mcs.usyd.edu.au>

you just need to wrap the values in `list()` or  `.()


eg

table[list(645)]

table[.(645)]
________________________________
From: datatable-help-bounces at lists.r-forge.r-project.org [datatable-help-bounces at lists.r-forge.r-project.org] on behalf of Yi Yuan [lambandme at gmail.com]
Sent: Friday, 8 March 2013 4:25 PM
To: datatable-help at lists.r-forge.r-project.org
Subject: [datatable-help] How to do binary search on integer key?

Hi, all:
I know someone has asked exactly the same question and there's even an answer. But I think the answer is wrong. Following is the url of that question
http://r.789695.n4.nabble.com/Binary-search-with-integer-key-td3705686.html

so if the key is integer and I would like to select all records where the key=654, how do I do that?
suppose the data table is named table, key variable's name is id

I know you can do it by writing: table[id==645,], but R will conduct vector search this way and is a lot slower than binary search.

So how can I do binary search on integer key??

Thanks!!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130308/5d649a48/attachment-0001.html>

From lambandme at gmail.com  Fri Mar  8 06:48:08 2013
From: lambandme at gmail.com (Yi Yuan)
Date: Fri, 8 Mar 2013 00:48:08 -0500
Subject: [datatable-help] How to do binary search on integer key?
In-Reply-To: <6FB5193A6CDCDF499486A833B7AFBDCD5827E395@EX-MBX-PRO-04.mcs.usyd.edu.au>
References: <CADzSX5c9kBkooWUwDeAJk4GC1Txz56tngbRXMTmJtyagfaOKXA@mail.gmail.com>
 <6FB5193A6CDCDF499486A833B7AFBDCD5827E395@EX-MBX-PRO-04.mcs.usyd.edu.au>
Message-ID: <CADzSX5djPgVwSM+5DtFG0QRE20MknSRtkmdH4_n0HU6E4mJOPQ@mail.gmail.com>

I tried

table[list(645)]

table[.(645)]

table[J(45)]

they're all returning 78 records when in fact there should only be 18
records related to key 645. However if I use table[id==645,], I get the
right result.

On Fri, Mar 8, 2013 at 12:42 AM, Michael Nelson <
michael.nelson at sydney.edu.au> wrote:

>  you just need to wrap the values in `list()` or  `.()
>
>
>  eg
>
>  table[list(645)]
>
>  table[.(645)]
>  ------------------------------
> *From:* datatable-help-bounces at lists.r-forge.r-project.org [
> datatable-help-bounces at lists.r-forge.r-project.org] on behalf of Yi Yuan [
> lambandme at gmail.com]
> *Sent:* Friday, 8 March 2013 4:25 PM
> *To:* datatable-help at lists.r-forge.r-project.org
> *Subject:* [datatable-help] How to do binary search on integer key?
>
>   Hi, all:
> I know someone has asked exactly the same question and there's even an
> answer. But I think the answer is wrong. Following is the url of that
> question
> http://r.789695.n4.nabble.com/Binary-search-with-integer-key-td3705686.html
>
>  so if the key is integer and I would like to select all records where
> the key=654, how do I do that?
>  suppose the data table is named table, key variable's name is id
>
>  I know you can do it by writing: table[id==645,], but R will conduct
> vector search this way and is a lot slower than binary search.
>
>  So how can I do binary search on integer key??
>
>  Thanks!!
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130308/0ce2910c/attachment.html>

From mdowle at mdowle.plus.com  Fri Mar  8 09:10:40 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Fri, 8 Mar 2013 08:10:40 -0000
Subject: [datatable-help] How to do binary search on integer key?
In-Reply-To: <CADzSX5djPgVwSM+5DtFG0QRE20MknSRtkmdH4_n0HU6E4mJOPQ@mail.gmail.com>
References: <CADzSX5c9kBkooWUwDeAJk4GC1Txz56tngbRXMTmJtyagfaOKXA@mail.gmail.com>
 <6FB5193A6CDCDF499486A833B7AFBDCD5827E395@EX-MBX-PRO-04.mcs.usyd.edu.au>
 <CADzSX5djPgVwSM+5DtFG0QRE20MknSRtkmdH4_n0HU6E4mJOPQ@mail.gmail.com>
Message-ID: <0d9770e89cac54ee8504c4fb4f20a0e2.squirrel@webmail.plus.net>

I assume that 45 is a typo and should be 645.
All I can think is that you're using an architecture not covered by CRAN.
Only 32bit and 64bit on Unix, Mac or Windows is covered, not anything
else.
Please provide the output of :
   sessionInfo()
   test.data.table()
Also try again in fresh R session. After any slight memory corruption in
any package, strange things can happen.
Finally, do make sure to use the latest version of data.table (1.8.8) to
save us time in supporting you.  Only if what you said isn't true and in
fact the key column is double not integer, and, there are NA in it too can
I guess that the bug fixes in 1.8.8 would be in play.
Matthew

> I tried
>
> table[list(645)]
>
> table[.(645)]
>
> table[J(45)]
>
> they're all returning 78 records when in fact there should only be 18
> records related to key 645. However if I use table[id==645,], I get the
> right result.
>
> On Fri, Mar 8, 2013 at 12:42 AM, Michael Nelson <
> michael.nelson at sydney.edu.au> wrote:
>
>>  you just need to wrap the values in `list()` or  `.()
>>
>>
>>  eg
>>
>>  table[list(645)]
>>
>>  table[.(645)]
>>  ------------------------------
>> *From:* datatable-help-bounces at lists.r-forge.r-project.org [
>> datatable-help-bounces at lists.r-forge.r-project.org] on behalf of Yi Yuan
>> [
>> lambandme at gmail.com]
>> *Sent:* Friday, 8 March 2013 4:25 PM
>> *To:* datatable-help at lists.r-forge.r-project.org
>> *Subject:* [datatable-help] How to do binary search on integer key?
>>
>>   Hi, all:
>> I know someone has asked exactly the same question and there's even an
>> answer. But I think the answer is wrong. Following is the url of that
>> question
>> http://r.789695.n4.nabble.com/Binary-search-with-integer-key-td3705686.html
>>
>>  so if the key is integer and I would like to select all records where
>> the key=654, how do I do that?
>>  suppose the data table is named table, key variable's name is id
>>
>>  I know you can do it by writing: table[id==645,], but R will conduct
>> vector search this way and is a lot slower than binary search.
>>
>>  So how can I do binary search on integer key??
>>
>>  Thanks!!
>>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


From statquant at outlook.com  Mon Mar 11 13:40:26 2013
From: statquant at outlook.com (stat quant)
Date: Mon, 11 Mar 2013 13:40:26 +0100
Subject: [datatable-help] fread suggestion
Message-ID: <CAJJHHA9QaiEj6_3LmLX9+vz33r+FXPcx18NSLQiDSGm9zX0wbA@mail.gmail.com>

Hello list,
We like FREAD because it is very fast, yet sometimes files are huge and R
cannot handle that much data, some packages handle this limitation but they
do not provide a similar to fread function.
Yet sometimes only subsets of a file is really needed, subsets that could
fit into RAM.

So what about adding a grep option to fread that would allow to load only
lines that matches a regular expression?

I'll add a request if you think the idea is worth implementing.

Cheers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130311/d76db1cf/attachment.html>

From micheledemeo at gmail.com  Mon Mar 11 13:53:19 2013
From: micheledemeo at gmail.com (MICHELE DE MEO)
Date: Mon, 11 Mar 2013 13:53:19 +0100
Subject: [datatable-help] fread suggestion
In-Reply-To: <CAJJHHA9QaiEj6_3LmLX9+vz33r+FXPcx18NSLQiDSGm9zX0wbA@mail.gmail.com>
References: <CAJJHHA9QaiEj6_3LmLX9+vz33r+FXPcx18NSLQiDSGm9zX0wbA@mail.gmail.com>
Message-ID: <CAOfgy=kzEDvqckk72vHbDj27HOejNmmapTd5pqo8-S5eWUh6fg@mail.gmail.com>

Very interesting request. I also would be interested in this possibility.

Cheers


2013/3/11 stat quant <statquant at outlook.com>

> Hello list,
> We like FREAD because it is very fast, yet sometimes files are huge and R
> cannot handle that much data, some packages handle this limitation but they
> do not provide a similar to fread function.
> Yet sometimes only subsets of a file is really needed, subsets that could
> fit into RAM.
>
> So what about adding a grep option to fread that would allow to load only
> lines that matches a regular expression?
>
> I'll add a request if you think the idea is worth implementing.
>
> Cheers
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>


-- 
***************************************************************
*Michele De Meo, Ph.D*
*Statistical and data mining solutions
http://micheledemeo.blogspot.com/
skype: demeo.michele*
*
*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130311/10ba688c/attachment.html>

From mdowle at mdowle.plus.com  Mon Mar 11 14:09:29 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Mon, 11 Mar 2013 13:09:29 +0000
Subject: [datatable-help] fread suggestion
In-Reply-To: <CAOfgy=kzEDvqckk72vHbDj27HOejNmmapTd5pqo8-S5eWUh6fg@mail.gmail.com>
References: <CAJJHHA9QaiEj6_3LmLX9+vz33r+FXPcx18NSLQiDSGm9zX0wbA@mail.gmail.com>
 <CAOfgy=kzEDvqckk72vHbDj27HOejNmmapTd5pqo8-S5eWUh6fg@mail.gmail.com>
Message-ID: <e4e648e013c74fd4d92c15ec7894ea32@imap.plus.net>

 
Good idea statquant, please file it then. How about something more
general e.g. 

 fread(input, chunk.nrows=10000, chunk.filter = <anything
acceptable to i of DT[i]>) 

That <anything> could be grep() or any
expression of column names. It wouldn't be efficient to call that for
every row one by one and similarly couldn't be called for the whole DT,
since the point is that DT is greater than RAM. So some batch size need
be defined hence chunk.nrows=10000. That filter would then be called for
each chunk and any rows passing would make it into the final table.


read.ffdf has something like this I believe, and Jens already
suggested that when I ran the timings in example(fread) past him. We
should probably follow his lead on that in terms of argument names etc.


Perhaps chunk should be defined in terms of RAM e.g. chunk=100MB.
Since that is how it needs to be internally, in terms of number of pages
to map. Or maybe both as nrows or MB would be acceptable. 

Ultimately
(maybe in 5 years!) we're heading towards fread reading into on-disk
tables rather than RAM. Filtering in chunks will always be a good option
to have though, even then, as you might want to filter what makes it to
the on-disk table. 

Matthew 

On 11.03.2013 12:53, MICHELE DE MEO
wrote: 

> Very interesting request. I also would be interested in this
possibility. 
> Cheers 
> 
> 2013/3/11 stat quant <statquant at outlook.com
[3]>
> 
>> Hello list, 
>> We like FREAD because it is very fast, yet
sometimes files are huge and R cannot handle that much data, some
packages handle this limitation but they do not provide a similar to
fread function. 
>> Yet sometimes only subsets of a file is really
needed, subsets that could fit into RAM. 
>> 
>> So what about adding a
grep option to fread that would allow to load only lines that matches a
regular expression? 
>> 
>> I'll add a request if you think the idea is
worth implementing. 
>> 
>> Cheers 
>> 
>>
_______________________________________________
>> datatable-help
mailing list
>> datatable-help at lists.r-forge.r-project.org [1]
>>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[2]
> 
> -- 
> 
>
_*************************************************************_ 
>
_MICHELE DE MEO, PH.D_ 
> Statistical and data mining solutions
>
http://micheledemeo.blogspot.com/ [4]
> skype: demeo.michele


Links:
------
[1]
mailto:datatable-help at lists.r-forge.r-project.org
[2]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[3]
mailto:statquant at outlook.com
[4] http://micheledemeo.blogspot.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130311/543401a4/attachment.html>

From statquant at outlook.com  Mon Mar 11 15:12:23 2013
From: statquant at outlook.com (stat quant)
Date: Mon, 11 Mar 2013 15:12:23 +0100
Subject: [datatable-help] fread suggestion
In-Reply-To: <CAJJHHA_b+kxUFf16De6EnQoLaDDnkB1C8fAM6ZV+9EjdqyN0jQ@mail.gmail.com>
References: <CAJJHHA9QaiEj6_3LmLX9+vz33r+FXPcx18NSLQiDSGm9zX0wbA@mail.gmail.com>
 <CAOfgy=kzEDvqckk72vHbDj27HOejNmmapTd5pqo8-S5eWUh6fg@mail.gmail.com>
 <e4e648e013c74fd4d92c15ec7894ea32@imap.plus.net>
 <CAJJHHA_b+kxUFf16De6EnQoLaDDnkB1C8fAM6ZV+9EjdqyN0jQ@mail.gmail.com>
Message-ID: <CAJJHHA90Pm-0zJr341HbpCAWYv1NGqK8wjmZ-Ypm9nUnSWE+FQ@mail.gmail.com>

Filled as #2605
About your ultimate goal... why would you want on-disk tables rather than
RAM (apart from being able to read >RAM limit file) ? Wouldnt RAM always be
quicker ?
I think data.table::fread is priceless because it is way faster than any
other read function.
I just benchmarked fread reading a csv file against R loading its own
.RData binary format, and shockingly fread is much faster!
I think it is too bad R doesn't provide a very fast way of loading objects
saved from a previous R session (well why don't I do it if it is so easy...)


2013/3/11 stat quant <mail.statquant at gmail.com>

> On my way to fill it in.
>
> About your ultimate goal... why would you want on-disk tables rather than
> RAM (apart from being able to read >RAM limit file) ? Wouldnt RAM always be
> quicker ?
>
> I think data.table::fread is priceless because it is way faster than any
> other read function.
> I just benchmarked fread reading a csv file against R loading its own
> .RData binary format, and shockingly fread is much faster!
> I think it is too bad R doesn't provide a very fast way of loading objects
> saved from a previous R session (well why don't I do it if it is so easy...)
>
>
>
> 2013/3/11 Matthew Dowle <mdowle at mdowle.plus.com>
>
>> **
>>
>>
>>
>> Good idea statquant, please file it then.  How about something more
>> general e.g.
>>
>>     fread(input,  chunk.nrows=10000, chunk.filter = <anything acceptable
>> to i of DT[i]>)
>>
>> That <anything> could be grep() or any expression of column names. It
>> wouldn't be efficient to call that for every row one by one and similarly
>> couldn't be called for the whole DT, since the point is that DT is greater
>> than RAM.  So some batch size need be defined hence chunk.nrows=10000.
>> That filter would then be called for each chunk and any rows passing would
>> make it into the final table.
>>
>> read.ffdf has something like this I believe,  and Jens already suggested
>> that when I ran the timings in example(fread) past him.  We should probably
>> follow his lead on that in terms of argument names etc.
>>
>> Perhaps chunk should be defined in terms of RAM e.g. chunk=100MB.   Since
>> that is how it needs to be internally,  in terms of number of pages to map.
>>  Or maybe both as nrows or MB would be acceptable.
>>
>> Ultimately (maybe in 5 years!) we're heading towards fread reading into
>> on-disk tables rather than RAM.   Filtering in chunks will always be a good
>> option to have though, even then, as you might want to filter what makes it
>> to the on-disk table.
>>
>> Matthew
>>
>>
>>
>> On 11.03.2013 12:53, MICHELE DE MEO wrote:
>>
>> Very interesting request. I also would be interested in this possibility.
>> Cheers
>>
>>
>> 2013/3/11 stat quant <statquant at outlook.com>
>>
>>> Hello list,
>>> We like FREAD because it is very fast, yet sometimes files are huge and
>>> R cannot handle that much data, some packages handle this limitation but
>>> they do not provide a similar to fread function.
>>> Yet sometimes only subsets of a file is really needed, subsets that
>>> could fit into RAM.
>>>
>>> So what about adding a grep option to fread that would allow to load
>>> only lines that matches a regular expression?
>>>
>>> I'll add a request if you think the idea is worth implementing.
>>>
>>> Cheers
>>>
>>>
>>> _______________________________________________
>>> datatable-help mailing list
>>> datatable-help at lists.r-forge.r-project.org
>>>
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>>
>> --
>> ***************************************************************
>> *Michele De Meo, Ph.D*
>> *Statistical and data mining solutions
>> http://micheledemeo.blogspot.com/
>> skype: demeo.michele*
>> *
>> *
>>
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130311/ec203166/attachment-0001.html>

From mdowle at mdowle.plus.com  Mon Mar 11 15:51:01 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Mon, 11 Mar 2013 14:51:01 +0000
Subject: [datatable-help] fread suggestion
In-Reply-To: <CAJJHHA90Pm-0zJr341HbpCAWYv1NGqK8wjmZ-Ypm9nUnSWE+FQ@mail.gmail.com>
References: <CAJJHHA9QaiEj6_3LmLX9+vz33r+FXPcx18NSLQiDSGm9zX0wbA@mail.gmail.com>
 <CAOfgy=kzEDvqckk72vHbDj27HOejNmmapTd5pqo8-S5eWUh6fg@mail.gmail.com>
 <e4e648e013c74fd4d92c15ec7894ea32@imap.plus.net>
 <CAJJHHA_b+kxUFf16De6EnQoLaDDnkB1C8fAM6ZV+9EjdqyN0jQ@mail.gmail.com>
 <CAJJHHA90Pm-0zJr341HbpCAWYv1NGqK8wjmZ-Ypm9nUnSWE+FQ@mail.gmail.com>
Message-ID: <711bb8b59cd6c3abfa5fb135e79461b6@imap.plus.net>

 
Exactly RAM would always be quicker. But maybe you want to read data
from on-disk data.table using data.table syntax, rather than some other
database or flat text file. i.e. on-disk data.table would not need to
fit in RAM. 

Benchmark sounds intriguing. Please share if you can.
compress=TRUE by default so maybe the decompress takes time, though.


On 11.03.2013 14:12, stat quant wrote: 

> Filled as #2605 
> About
your ultimate goal... why would you want on-disk tables rather than RAM
(apart from being able to read >RAM limit file) ? Wouldnt RAM always be
quicker ? 
> I think data.table::fread is priceless because it is way
faster than any other read function. 
> I just benchmarked fread reading
a csv file against R loading its own .RData binary format, and
shockingly fread is much faster! 
> I think it is too bad R doesn't
provide a very fast way of loading objects saved from a previous R
session (well why don't I do it if it is so easy...) 
> 
> 2013/3/11
stat quant <mail.statquant at gmail.com [6]>
> 
>> On my way to fill it in.

>> 
>> About your ultimate goal... why would you want on-disk tables
rather than RAM (apart from being able to read >RAM limit file) ?
Wouldnt RAM always be quicker ? 
>> 
>> I think data.table::fread is
priceless because it is way faster than any other read function. 
>> I
just benchmarked fread reading a csv file against R loading its own
.RData binary format, and shockingly fread is much faster! 
>> I think
it is too bad R doesn't provide a very fast way of loading objects saved
from a previous R session (well why don't I do it if it is so easy...)

>> 
>> 2013/3/11 Matthew Dowle <mdowle at mdowle.plus.com [5]>
>> 
>>>
Good idea statquant, please file it then. How about something more
general e.g. 
>>> 
>>> fread(input, chunk.nrows=10000, chunk.filter =)

>>> 
>>> Thatcould be grep() or any expression of column names. It
wouldn't be efficient to call that for every row one by one and
similarly couldn't be called for the whole DT, since the point is that
DT is greater than RAM. So some batch size need be defined hence
chunk.nrows=10000. That filter would then be called for each chunk and
any rows passing would make it into the final table. 
>>> 
>>> read.ffdf
has something like this I believe, and Jens already suggested that when
I ran the timings in example(fread) past him. We should probably follow
his lead on that in terms of argument names etc. 
>>> 
>>> Perhaps chunk
should be defined in terms of RAM e.g. chunk=100MB. Since that is how it
needs to be internally, in terms of number of pages to map. Or maybe
both as nrows or MB would be acceptable. 
>>> 
>>> Ultimately (maybe in
5 years!) we're heading towards fread reading into on-disk tables rather
than RAM. Filtering in chunks will always be a good option to have
though, even then, as you might want to filter what makes it to the
on-disk table. 
>>> 
>>> Matthew 
>>> 
>>> On 11.03.2013 12:53, MICHELE
DE MEO wrote: 
>>> 
>>>> Very interesting request. I also would be
interested in this possibility. 
>>>> Cheers 
>>>> 
>>>> 2013/3/11 stat
quant <statquant at outlook.com [3]>
>>>> 
>>>>> Hello list, 
>>>>> We like
FREAD because it is very fast, yet sometimes files are huge and R cannot
handle that much data, some packages handle this limitation but they do
not provide a similar to fread function. 
>>>>> Yet sometimes only
subsets of a file is really needed, subsets that could fit into RAM.

>>>>> 
>>>>> So what about adding a grep option to fread that would
allow to load only lines that matches a regular expression? 
>>>>>

>>>>> I'll add a request if you think the idea is worth implementing.

>>>>> 
>>>>> Cheers 
>>>>> 
>>>>>
_______________________________________________
>>>>> datatable-help
mailing list
>>>>> datatable-help at lists.r-forge.r-project.org [1]
>>>>>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[2]
>>>> 
>>>> -- 
>>>> 
>>>>
_*************************************************************_ 
>>>>
_MICHELE DE MEO, PH.D_ 
>>>> Statistical and data mining solutions
>>>>
http://micheledemeo.blogspot.com/ [4]
>>>> skype: demeo.michele


Links:
------
[1]
mailto:datatable-help at lists.r-forge.r-project.org
[2]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[3]
mailto:statquant at outlook.com
[4] http://micheledemeo.blogspot.com/
[5]
mailto:mdowle at mdowle.plus.com
[6] mailto:mail.statquant at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130311/e3f48476/attachment.html>

From mdowle at mdowle.plus.com  Mon Mar 11 16:10:32 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Mon, 11 Mar 2013 15:10:32 +0000
Subject: [datatable-help] fread suggestion
In-Reply-To: <711bb8b59cd6c3abfa5fb135e79461b6@imap.plus.net>
References: <CAJJHHA9QaiEj6_3LmLX9+vz33r+FXPcx18NSLQiDSGm9zX0wbA@mail.gmail.com>
 <CAOfgy=kzEDvqckk72vHbDj27HOejNmmapTd5pqo8-S5eWUh6fg@mail.gmail.com>
 <e4e648e013c74fd4d92c15ec7894ea32@imap.plus.net>
 <CAJJHHA_b+kxUFf16De6EnQoLaDDnkB1C8fAM6ZV+9EjdqyN0jQ@mail.gmail.com>
 <CAJJHHA90Pm-0zJr341HbpCAWYv1NGqK8wjmZ-Ypm9nUnSWE+FQ@mail.gmail.com>
 <711bb8b59cd6c3abfa5fb135e79461b6@imap.plus.net>
Message-ID: <f046c0d5acbe85021abdbf9dd7027a16@imap.plus.net>

 
Also, fread works by first memory mapping the file. The first time
it does this for a particular file is therefore slower (you may have
noticed the longer pause the first time before the percentage counter
starts). The time to memory map is reported when verbose=TRUE (but you
need the formatting fix in v1.8.9 to see that time as the formatted
number is messed up in v1.8.8). If you repeat the same fread call again
it won't spend as long memory mapping since it's already mapped,
depending on if you did anything else memory intensive on this
computer/server in the meantime. 

I don't know if base R's load()
memory maps, but if it doesn't it'll need to read from disk each time.
So be strictly fair, the time to compare is a "cold" read after a reboot
and the first run only of fread. But in practice we often do tend to
read the same file several times, so fread benefits from this. The OS
caches the file in RAM for you, basically. It might do this anyway. It's
all very OS and usage dependent! It may also depend on how your
particular R environment has been compiled. 

I don't think a fresh R
session is enough to reproduce this effect. You need a reboot as it's
the OS that caches/maps the file, not R/data.table. 

So in short - to
report the very fast time along with the time to memory map file from
cold, would be the fairest and most complete way to compare. 

Matthew


On 11.03.2013 14:51, Matthew Dowle wrote: 

> Exactly RAM would always
be quicker. But maybe you want to read data from on-disk data.table
using data.table syntax, rather than some other database or flat text
file. i.e. on-disk data.table would not need to fit in RAM. 
> 
>
Benchmark sounds intriguing. Please share if you can. compress=TRUE by
default so maybe the decompress takes time, though. 
> 
> On 11.03.2013
14:12, stat quant wrote: 
> 
>> Filled as #2605 
>> About your ultimate
goal... why would you want on-disk tables rather than RAM (apart from
being able to read >RAM limit file) ? Wouldnt RAM always be quicker ?

>> I think data.table::fread is priceless because it is way faster than
any other read function. 
>> I just benchmarked fread reading a csv file
against R loading its own .RData binary format, and shockingly fread is
much faster! 
>> I think it is too bad R doesn't provide a very fast way
of loading objects saved from a previous R session (well why don't I do
it if it is so easy...) 
>> 
>> 2013/3/11 stat quant
<mail.statquant at gmail.com [6]>
>> 
>>> On my way to fill it in. 
>>>

>>> About your ultimate goal... why would you want on-disk tables
rather than RAM (apart from being able to read >RAM limit file) ?
Wouldnt RAM always be quicker ? 
>>> 
>>> I think data.table::fread is
priceless because it is way faster than any other read function. 
>>> I
just benchmarked fread reading a csv file against R loading its own
.RData binary format, and shockingly fread is much faster! 
>>> I think
it is too bad R doesn't provide a very fast way of loading objects saved
from a previous R session (well why don't I do it if it is so easy...)

>>> 
>>> 2013/3/11 Matthew Dowle <mdowle at mdowle.plus.com [5]>
>>> 
>>>>
Good idea statquant, please file it then. How about something more
general e.g. 
>>>> 
>>>> fread(input, chunk.nrows=10000, chunk.filter =)

>>>> 
>>>> Thatcould be grep() or any expression of column names. It
wouldn't be efficient to call that for every row one by one and
similarly couldn't be called for the whole DT, since the point is that
DT is greater than RAM. So some batch size need be defined hence
chunk.nrows=10000. That filter would then be called for each chunk and
any rows passing would make it into the final table. 
>>>> 
>>>>
read.ffdf has something like this I believe, and Jens already suggested
that when I ran the timings in example(fread) past him. We should
probably follow his lead on that in terms of argument names etc. 
>>>>

>>>> Perhaps chunk should be defined in terms of RAM e.g. chunk=100MB.
Since that is how it needs to be internally, in terms of number of pages
to map. Or maybe both as nrows or MB would be acceptable. 
>>>> 
>>>>
Ultimately (maybe in 5 years!) we're heading towards fread reading into
on-disk tables rather than RAM. Filtering in chunks will always be a
good option to have though, even then, as you might want to filter what
makes it to the on-disk table. 
>>>> 
>>>> Matthew 
>>>> 
>>>> On
11.03.2013 12:53, MICHELE DE MEO wrote: 
>>>> 
>>>>> Very interesting
request. I also would be interested in this possibility. 
>>>>> Cheers

>>>>> 
>>>>> 2013/3/11 stat quant <statquant at outlook.com [3]>
>>>>>

>>>>>> Hello list, 
>>>>>> We like FREAD because it is very fast, yet
sometimes files are huge and R cannot handle that much data, some
packages handle this limitation but they do not provide a similar to
fread function. 
>>>>>> Yet sometimes only subsets of a file is really
needed, subsets that could fit into RAM. 
>>>>>> 
>>>>>> So what about
adding a grep option to fread that would allow to load only lines that
matches a regular expression? 
>>>>>> 
>>>>>> I'll add a request if you
think the idea is worth implementing. 
>>>>>> 
>>>>>> Cheers 
>>>>>>

>>>>>> _______________________________________________
>>>>>>
datatable-help mailing list
>>>>>>
datatable-help at lists.r-forge.r-project.org [1]
>>>>>>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[2]
>>>>> 
>>>>> -- 
>>>>> 
>>>>>
_*************************************************************_ 
>>>>>
_MICHELE DE MEO, PH.D_ 
>>>>> Statistical and data mining
solutions
>>>>> http://micheledemeo.blogspot.com/ [4]
>>>>> skype:
demeo.michele

 
Links:
------
[1]
mailto:datatable-help at lists.r-forge.r-project.org
[2]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[3]
mailto:statquant at outlook.com
[4] http://micheledemeo.blogspot.com/
[5]
mailto:mdowle at mdowle.plus.com
[6] mailto:mail.statquant at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130311/4d5fa5d0/attachment-0001.html>

From mdowle at mdowle.plus.com  Tue Mar 19 17:51:45 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Tue, 19 Mar 2013 16:51:45 +0000
Subject: [datatable-help] =?utf-8?q?See_you_at_R/Finance_2013=3F?=
Message-ID: <0303e9151b7b69d88f742ea4375e5b58@imap.plus.net>


Dear datatablers,

I'll be giving a 1 hour tutorial on the Friday and a lightening talk on 
the Saturday.

http://www.rinfinance.com/agenda/

Hope to see you there!

Matthew


From statquant at outlook.com  Wed Mar 20 16:46:16 2013
From: statquant at outlook.com (stat quant)
Date: Wed, 20 Mar 2013 16:46:16 +0100
Subject: [datatable-help] datatable-help Digest, Vol 37, Issue 13
In-Reply-To: <mailman.7.1363777203.25116.datatable-help@lists.r-forge.r-project.org>
References: <mailman.7.1363777203.25116.datatable-help@lists.r-forge.r-project.org>
Message-ID: <CAJJHHA8Ttg9yQ9p722UrMbL_oiRsZoa1ny6qbqFa-r2p7HZuAQ@mail.gmail.com>

Too bad I can't be there, hopefully we'll have a video !
Best of luck for the presentation (but no pressure ;))

2013/3/20 <datatable-help-request at lists.r-forge.r-project.org>

> Send datatable-help mailing list submissions to
>         datatable-help at lists.r-forge.r-project.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
> or, via email, send a message with subject or body 'help' to
>         datatable-help-request at lists.r-forge.r-project.org
>
> You can reach the person managing the list at
>         datatable-help-owner at lists.r-forge.r-project.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of datatable-help digest..."
>
>
> Today's Topics:
>
>    1. See you at R/Finance 2013? (Matthew Dowle)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 19 Mar 2013 16:51:45 +0000
> From: Matthew Dowle <mdowle at mdowle.plus.com>
> To: <datatable-help at lists.r-forge.r-project.org>
> Subject: [datatable-help] See you at R/Finance 2013?
> Message-ID: <0303e9151b7b69d88f742ea4375e5b58 at imap.plus.net>
> Content-Type: text/plain; charset=UTF-8; format=flowed
>
>
> Dear datatablers,
>
> I'll be giving a 1 hour tutorial on the Friday and a lightening talk on
> the Saturday.
>
> http://www.rinfinance.com/agenda/
>
> Hope to see you there!
>
> Matthew
>
>
>
>
> ------------------------------
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
> End of datatable-help Digest, Vol 37, Issue 13
> **********************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130320/6e3ffba8/attachment.html>

From ekbrown at ksu.edu  Fri Mar 22 03:39:43 2013
From: ekbrown at ksu.edu (ekbrown)
Date: Thu, 21 Mar 2013 19:39:43 -0700 (PDT)
Subject: [datatable-help] Quicker w/o keys set
Message-ID: <1363919983128-4662157.post@n4.nabble.com>

Hello. I'm new to data.table(). I am apparently not setting the keys
correctly to get the increase in speed talked about in the vignettes, as I
get a (much) quicker time *without* keys set. Take a look at the following
benchmarking tests. Any ideas? Thanks. Earl Brown

> library("data.table")
> library("rbenchmark")
> 
> # generates random data
> num.files <- 2000
> num.words <- 1000000
> logical.vector <- sample(c(TRUE, FALSE), num.words, replace=T)
> file.names <- rep(1:num.files, length.out=num.words)
> 
> # defines functions
> benDTNoKey <- function(aa, bb) {
+ 	dt <- data.table(as.numeric(aa), bb)
+ 	dt[,sum(V1), by = bb][,V1]
+ }
> 
> benDTWithKey <- function(aa, bb) {
+ 	dt <- data.table(as.numeric(aa), bb)
+ 	setkey(dt)
+ 	dt[,sum(V1), by = bb][,V1]
+ }
> 
> benTapply <- function(aa, bb) tapply(aa, bb, sum)
> 
> # runs benchmarking
> benchmark(benTapply(logical.vector, file.names),
> benDTWithKey(logical.vector, file.names), benDTNoKey(logical.vector,
> file.names), replications = 10, columns = c("test", "replications",
> "elapsed"))
                                      test replications elapsed
3   benDTNoKey(logical.vector, file.names)           10   *0.753*
2 benDTWithKey(logical.vector, file.names)           10   *4.776*
1    benTapply(logical.vector, file.names)           10   6.218
> 
> # tests for sameness among results
> one <- benTapply(logical.vector, file.names)
> two <- benDTWithKey(logical.vector, file.names)
> three <- benDTNoKey(logical.vector, file.names)
> identical(as.integer(one), as.integer(two))
[1] TRUE
> identical(as.integer(two), as.integer(three))
[1] TRUE


--
View this message in context: http://r.789695.n4.nabble.com/Quicker-w-o-keys-set-tp4662157.html
Sent from the datatable-help mailing list archive at Nabble.com.

From michael.nelson at sydney.edu.au  Fri Mar 22 03:43:11 2013
From: michael.nelson at sydney.edu.au (Michael Nelson)
Date: Fri, 22 Mar 2013 02:43:11 +0000
Subject: [datatable-help] Quicker w/o keys set
In-Reply-To: <1363919983128-4662157.post@n4.nabble.com>
References: <1363919983128-4662157.post@n4.nabble.com>
Message-ID: <6FB5193A6CDCDF499486A833B7AFBDCD62B73A09@EX-MBX-PRO-03.mcs.usyd.edu.au>

Don't include the key setting within the benchmark.


________________________________________
From: datatable-help-bounces at lists.r-forge.r-project.org [datatable-help-bounces at lists.r-forge.r-project.org] on behalf of ekbrown [ekbrown at ksu.edu]
Sent: Friday, 22 March 2013 1:39 PM
To: datatable-help at lists.r-forge.r-project.org
Subject: [datatable-help] Quicker w/o keys set

Hello. I'm new to data.table(). I am apparently not setting the keys
correctly to get the increase in speed talked about in the vignettes, as I
get a (much) quicker time *without* keys set. Take a look at the following
benchmarking tests. Any ideas? Thanks. Earl Brown

> library("data.table")
> library("rbenchmark")
>
> # generates random data
> num.files <- 2000
> num.words <- 1000000
> logical.vector <- sample(c(TRUE, FALSE), num.words, replace=T)
> file.names <- rep(1:num.files, length.out=num.words)
>
> # defines functions
> benDTNoKey <- function(aa, bb) {
+       dt <- data.table(as.numeric(aa), bb)
+       dt[,sum(V1), by = bb][,V1]
+ }
>
> benDTWithKey <- function(aa, bb) {
+       dt <- data.table(as.numeric(aa), bb)
+       setkey(dt)
+       dt[,sum(V1), by = bb][,V1]
+ }
>
> benTapply <- function(aa, bb) tapply(aa, bb, sum)
>
> # runs benchmarking
> benchmark(benTapply(logical.vector, file.names),
> benDTWithKey(logical.vector, file.names), benDTNoKey(logical.vector,
> file.names), replications = 10, columns = c("test", "replications",
> "elapsed"))
                                      test replications elapsed
3   benDTNoKey(logical.vector, file.names)           10   *0.753*
2 benDTWithKey(logical.vector, file.names)           10   *4.776*
1    benTapply(logical.vector, file.names)           10   6.218
>
> # tests for sameness among results
> one <- benTapply(logical.vector, file.names)
> two <- benDTWithKey(logical.vector, file.names)
> three <- benDTNoKey(logical.vector, file.names)
> identical(as.integer(one), as.integer(two))
[1] TRUE
> identical(as.integer(two), as.integer(three))
[1] TRUE


--
View this message in context: http://r.789695.n4.nabble.com/Quicker-w-o-keys-set-tp4662157.html
Sent from the datatable-help mailing list archive at Nabble.com.
_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

From saporta at scarletmail.rutgers.edu  Fri Mar 22 05:31:01 2013
From: saporta at scarletmail.rutgers.edu (Ricardo Saporta)
Date: Fri, 22 Mar 2013 00:31:01 -0400
Subject: [datatable-help] Quicker w/o keys set
In-Reply-To: <1363919983128-4662157.post@n4.nabble.com>
References: <1363919983128-4662157.post@n4.nabble.com>
Message-ID: <CAE7Aa4RcD1XUzUtCYH4SWhZHKcgZE2M04amQnwDusdvq5ouojw@mail.gmail.com>

When you set the key, it sorts the table -- this is part of what allows for
the speed.
This initial sorting is what is slowing down your benchmarks.

While it makes sense to compare the initial sort time if you are trying to
get a 'full' comparison, in most practice applications, you will only be
setting the key once.

Therefore, if you want to see what sort of speed increases you are actually
getting, create your DT's first, then benchmark the specific operations of
interest.

Also, searching stackoverflow for [r] data.table and benchmarks will
produce several useful results

Cheers
Rick

On Thursday, March 21, 2013, ekbrown wrote:

> Hello. I'm new to data.table(). I am apparently not setting the keys
> correctly to get the increase in speed talked about in the vignettes, as I
> get a (much) quicker time *without* keys set. Take a look at the following
> benchmarking tests. Any ideas? Thanks. Earl Brown
>
> > library("data.table")
> > library("rbenchmark")
> >
> > # generates random data
> > num.files <- 2000
> > num.words <- 1000000
> > logical.vector <- sample(c(TRUE, FALSE), num.words, replace=T)
> > file.names <- rep(1:num.files, length.out=num.words)
> >
> > # defines functions
> > benDTNoKey <- function(aa, bb) {
> +       dt <- data.table(as.numeric(aa), bb)
> +       dt[,sum(V1), by = bb][,V1]
> + }
> >
> > benDTWithKey <- function(aa, bb) {
> +       dt <- data.table(as.numeric(aa), bb)
> +       setkey(dt)
> +       dt[,sum(V1), by = bb][,V1]
> + }
> >
> > benTapply <- function(aa, bb) tapply(aa, bb, sum)
> >
> > # runs benchmarking
> > benchmark(benTapply(logical.vector, file.names),
> > benDTWithKey(logical.vector, file.names), benDTNoKey(logical.vector,
> > file.names), replications = 10, columns = c("test", "replications",
> > "elapsed"))
>                                       test replications elapsed
> 3   benDTNoKey(logical.vector, file.names)           10   *0.753*
> 2 benDTWithKey(logical.vector, file.names)           10   *4.776*
> 1    benTapply(logical.vector, file.names)           10   6.218
> >
> > # tests for sameness among results
> > one <- benTapply(logical.vector, file.names)
> > two <- benDTWithKey(logical.vector, file.names)
> > three <- benDTNoKey(logical.vector, file.names)
> > identical(as.integer(one), as.integer(two))
> [1] TRUE
> > identical(as.integer(two), as.integer(three))
> [1] TRUE
>
>
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/Quicker-w-o-keys-set-tp4662157.html
> Sent from the datatable-help mailing list archive at Nabble.com.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org <javascript:;>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>


-- 
Ricardo Saporta
Graduate Student, Data Analytics
Rutgers University, New Jersey
e: saporta at rutgers.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130322/0f925f09/attachment.html>

From mdowle at mdowle.plus.com  Fri Mar 22 12:05:18 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Fri, 22 Mar 2013 11:05:18 +0000
Subject: [datatable-help] Quicker w/o keys set
In-Reply-To: <CAE7Aa4RcD1XUzUtCYH4SWhZHKcgZE2M04amQnwDusdvq5ouojw@mail.gmail.com>
References: <1363919983128-4662157.post@n4.nabble.com>
 <CAE7Aa4RcD1XUzUtCYH4SWhZHKcgZE2M04amQnwDusdvq5ouojw@mail.gmail.com>
Message-ID: <2e6e5ee66ef4b0c66dfb789f33a83e67@imap.plus.net>

 
Whilst what Rick and Michael said is very true, I suspect that
you've found that setting a key on a *numeric* type column is much
slower than setkey on an *integer* column. There was an awful (but
correct) benchmark on S.O. recently and that's what I replied, but I
can't find it now. All I can think is that the OP deleted the question,
which would be a shame. If that OP is watching, and that is what
happened, please can they undelete it. 

Also you have a setkey(DT)
there, with no columns specified. In that case, it will key all the
columns; think key only table. But if you have numeric value columns in
there as well, or any non-key columns at all, then that will be
wasteful. 

Anyway, in the code you posted, try changing 


as.numeric(aa)

to 

 as.integer(aa)

and you should see setkey run
dramatically faster. Then what Rick and Michael said applies from
there.

Matthew

On 22.03.2013 04:31, Ricardo Saporta wrote: 

> When
you set the key, it sorts the table -- this is part of what allows for
the speed. 
> This initial sorting is what is slowing down your
benchmarks. 
> 
> While it makes sense to compare the initial sort time
if you are trying to get a 'full' comparison, in most practice
applications, you will only be setting the key once. 
> 
> Therefore, if
you want to see what sort of speed increases you are actually getting,
create your DT's first, then benchmark the specific operations of
interest. 
> 
> Also, searching stackoverflow for [r] data.table and
benchmarks will produce several useful results 
> 
> Cheers
> Rick
> 
>
On Thursday, March 21, 2013, ekbrown wrote:
> 
>> Hello. I'm new to
data.table(). I am apparently not setting the keys
>> correctly to get
the increase in speed talked about in the vignettes, as I
>> get a
(much) quicker time *without* keys set. Take a look at the following
>>
benchmarking tests. Any ideas? Thanks. Earl Brown
>> 
>> >
library("data.table")
>> > library("rbenchmark")
>> >
>> > # generates
random data
>> > num.files > num.words > logical.vector > file.names
>
>> > # defines functions
>> > benDTNoKey + dt + dt[,sum(V1), by =
bb][,V1]
>> + }
>> >
>> > benDTWithKey + dt + setkey(dt)
>> +
dt[,sum(V1), by = bb][,V1]
>> + }
>> >
>> > benTapply >
>> > # runs
benchmarking
>> > benchmark(benTapply(logical.vector, file.names),
>> >
benDTWithKey(logical.vector, file.names), benDTNoKey(logical.vector,
>>
> file.names), replications = 10, columns = c("test", "replications",
>>
> "elapsed"))
>> test replications elapsed
>> 3
benDTNoKey(logical.vector, file.names) 10 *0.753*
>> 2
benDTWithKey(logical.vector, file.names) 10 *4.776*
>> 1
benTapply(logical.vector, file.names) 10 6.218
>> >
>> > # tests for
sameness among results
>> > one > two > three >
identical(as.integer(one), as.integer(two))
>> [1] TRUE
>> >
identical(as.integer(two), as.integer(three))
>> [1] TRUE
>> 
>> --
>>
View this message in context:
http://r.789695.n4.nabble.com/Quicker-w-o-keys-set-tp4662157.html [1]
>>
Sent from the datatable-help mailing list archive at Nabble.com.
>>
_______________________________________________
>> datatable-help
mailing list
>> datatable-help at lists.r-forge.r-project.org
>>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[2]
> 
> -- 
> 
> Ricardo Saporta 
> Graduate Student, Data Analytics

> Rutgers University, New Jersey 
> e: saporta at rutgers.edu [3]


Links:
------
[1]
http://r.789695.n4.nabble.com/Quicker-w-o-keys-set-tp4662157.html
[2]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[3]
mailto:saporta at rutgers.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130322/68156b57/attachment.html>

From mdowle at mdowle.plus.com  Fri Mar 22 13:01:06 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Fri, 22 Mar 2013 12:01:06 +0000
Subject: [datatable-help] Quicker w/o keys set
In-Reply-To: <2e6e5ee66ef4b0c66dfb789f33a83e67@imap.plus.net>
References: <1363919983128-4662157.post@n4.nabble.com>
 <CAE7Aa4RcD1XUzUtCYH4SWhZHKcgZE2M04amQnwDusdvq5ouojw@mail.gmail.com>
 <2e6e5ee66ef4b0c66dfb789f33a83e67@imap.plus.net>
Message-ID: <90d30047cccb59d3a651e70402cebf65@imap.plus.net>

 
And this nice answer by Michael might be of interest too :


http://stackoverflow.com/a/13694673/403310 

On 22.03.2013 11:05,
Matthew Dowle wrote: 

> Whilst what Rick and Michael said is very true,
I suspect that you've found that setting a key on a *numeric* type
column is much slower than setkey on an *integer* column. There was an
awful (but correct) benchmark on S.O. recently and that's what I
replied, but I can't find it now. All I can think is that the OP deleted
the question, which would be a shame. If that OP is watching, and that
is what happened, please can they undelete it. 
> 
> Also you have a
setkey(DT) there, with no columns specified. In that case, it will key
all the columns; think key only table. But if you have numeric value
columns in there as well, or any non-key columns at all, then that will
be wasteful. 
> 
> Anyway, in the code you posted, try changing 
> 
>
as.numeric(aa)
> 
> to 
> 
> as.integer(aa)
> 
> and you should see
setkey run dramatically faster. Then what Rick and Michael said applies
from there.
> 
> Matthew
> 
> On 22.03.2013 04:31, Ricardo Saporta
wrote: 
> 
>> When you set the key, it sorts the table -- this is part
of what allows for the speed. 
>> This initial sorting is what is
slowing down your benchmarks. 
>> 
>> While it makes sense to compare
the initial sort time if you are trying to get a 'full' comparison, in
most practice applications, you will only be setting the key once. 
>>

>> Therefore, if you want to see what sort of speed increases you are
actually getting, create your DT's first, then benchmark the specific
operations of interest. 
>> 
>> Also, searching stackoverflow for [r]
data.table and benchmarks will produce several useful results 
>> 
>>
Cheers
>> Rick
>> 
>> On Thursday, March 21, 2013, ekbrown wrote:
>>

>>> Hello. I'm new to data.table(). I am apparently not setting the
keys
>>> correctly to get the increase in speed talked about in the
vignettes, as I
>>> get a (much) quicker time *without* keys set. Take a
look at the following
>>> benchmarking tests. Any ideas? Thanks. Earl
Brown
>>> 
>>> > library("data.table")
>>> > library("rbenchmark")
>>>
>
>>> > # generates random data
>>> > num.files > num.words >
logical.vector > file.names >
>>> > # defines functions
>>> > benDTNoKey
+ dt + dt[,sum(V1), by = bb][,V1]
>>> + }
>>> >
>>> > benDTWithKey + dt
+ setkey(dt)
>>> + dt[,sum(V1), by = bb][,V1]
>>> + }
>>> >
>>> >
benTapply >
>>> > # runs benchmarking
>>> >
benchmark(benTapply(logical.vector, file.names),
>>> >
benDTWithKey(logical.vector, file.names), benDTNoKey(logical.vector,
>>>
> file.names), replications = 10, columns = c("test",
"replications",
>>> > "elapsed"))
>>> test replications elapsed
>>> 3
benDTNoKey(logical.vector, file.names) 10 *0.753*
>>> 2
benDTWithKey(logical.vector, file.names) 10 *4.776*
>>> 1
benTapply(logical.vector, file.names) 10 6.218
>>> >
>>> > # tests for
sameness among results
>>> > one > two > three >
identical(as.integer(one), as.integer(two))
>>> [1] TRUE
>>> >
identical(as.integer(two), as.integer(three))
>>> [1] TRUE
>>> 
>>>
--
>>> View this message in context:
http://r.789695.n4.nabble.com/Quicker-w-o-keys-set-tp4662157.html
[1]
>>> Sent from the datatable-help mailing list archive at
Nabble.com.
>>> _______________________________________________
>>>
datatable-help mailing list
>>>
datatable-help at lists.r-forge.r-project.org
>>>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[2]
>> 
>> -- 
>> 
>> Ricardo Saporta 
>> Graduate Student, Data
Analytics 
>> Rutgers University, New Jersey 
>> e: saporta at rutgers.edu
[3]

 
Links:
------
[1]
http://r.789695.n4.nabble.com/Quicker-w-o-keys-set-tp4662157.html
[2]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[3]
mailto:saporta at rutgers.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130322/73da8999/attachment.html>

From s_milberg at hotmail.com  Fri Mar 22 23:23:06 2013
From: s_milberg at hotmail.com (Sadao Milberg)
Date: Fri, 22 Mar 2013 18:23:06 -0400
Subject: [datatable-help] data.table and cbind()
Message-ID: <BAY163-W6214E2CEFB55D4B78AB08499D40@phx.gbl>


I've recently discovered the dramatic performance improvements 
data.table provides over ddply() and merge(), and I'm looking forward to
 integrating it into my work.  While messing around with benchmarks, I 
ran into an unexpected outcome with cbind(), where operations are 
actually much faster with data frames than data tables.  Don't ask my 
why I'd ever do the following, but I am curious as to why it is so much 
slower:

USArrests.dt <- data.table(USArrests)
lst.USArrests <- replicate(1000, USArrests, simplify=FALSE)
lst.USArrests.dt <- replicate(1000, USArrests.dt, simplify=FALSE)

microbenchmark(do.call(cbind, lst.USArrests),
               do.call(cbind, lst.USArrests.dt),
               times=10)
Unit: milliseconds
                             expr       min        lq    median        uq       max neval
    do.call(cbind, lst.USArrests)  42.26891  47.70086  48.71271  49.88542  51.25453    10
 do.call(cbind, lst.USArrests.dt) 750.70469 761.70511 773.91232 816.85707 880.45896    10
This is run on an Ubuntu system. 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130322/804da446/attachment.html>

From mdowle at mdowle.plus.com  Sat Mar 23 02:39:28 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Sat, 23 Mar 2013 01:39:28 +0000
Subject: [datatable-help] data.table and cbind()
In-Reply-To: <BAY163-W6214E2CEFB55D4B78AB08499D40@phx.gbl>
References: <BAY163-W6214E2CEFB55D4B78AB08499D40@phx.gbl>
Message-ID: <b4dde3f03aab0125f4bb57fae29cca9b@imap.plus.net>

 
Interesting. Well asked. 

On my netbook : 

> Rprof()
>
system.time(do.call(cbind, lst.USArrests.dt))
 user system elapsed 

4.008 0.000 4.012 
> Rprof(NULL)
> summaryRprof()
$by.self
 self.time
self.pct total.time total.pct
"make.names" 1.82 44.39 1.82
44.39
"data.table" 1.74 42.44 4.00 97.56
"[[.data.frame" 0.12 2.93 0.26
6.34
"gc" 0.10 2.44 0.10 2.44
"match" 0.08 1.95 0.10 2.44
"length" 0.06
1.46 0.06 1.46
"[[" 0.04 0.98 0.30 7.32
"%in%" 0.04 0.98 0.14
3.41
"NROW" 0.02 0.49 0.12 2.93
"is.data.frame" 0.02 0.49 0.02
0.49
"names" 0.02 0.49 0.02 0.49
"paste" 0.02 0.49 0.02 0.49
"sys.call"
0.02 0.49 0.02 0.49

So almost half of it is in make.names() [notice
that cbind.data.frame calls data.frame with check.names=FALSE] and the
other half in data.table() but not sure exactly where. So we can do
better, or maybe we need a cbindlist (analogous to the existing
rbindlist). But as you allude, we've spent most effort on := and set()
to add columns by reference rather than copying using cbind().

I've
added a feature request to tackle this anyway. Thanks for highlighting,
great
test.

https://r-forge.r-project.org/tracker/?group_id=240&atid=978&func=detail&aid=2636

Matthew

On
22.03.2013 22:23, Sadao Milberg wrote: 

> I've recently discovered the
dramatic performance improvements data.table provides over ddply() and
merge(), and I'm looking forward to integrating it into my work. While
messing around with benchmarks, I ran into an unexpected outcome with
cbind(), where operations are actually much faster with data frames than
data tables. Don't ask my why I'd ever do the following, but I am
curious as to why it is so much slower:
> 
> USArrests.dt 
>
lst.USArrests 
> lst.USArrests.dt 
> 
> microbenchmark(do.call(cbind,
lst.USArrests),
> do.call(cbind, lst.USArrests.dt),
> times=10)
> 
>
Unit: milliseconds
> expr min lq median uq max neval
> do.call(cbind,
lst.USArrests) 42.26891 47.70086 48.71271 49.88542 51.25453 10
>
do.call(cbind, lst.USArrests.dt) 750.70469 761.70511 773.91232 816.85707
880.45896 10
> 
> This is run on an Ubuntu system.

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130323/436e66a9/attachment-0001.html>

From gaizoule at gmail.com  Sat Mar 23 08:06:33 2013
From: gaizoule at gmail.com (gaizoule)
Date: Sat, 23 Mar 2013 00:06:33 -0700 (PDT)
Subject: [datatable-help] Suggestion on ITime class implementing.
Message-ID: <1364022393122-4662281.post@n4.nabble.com>

Hi, everyone,
data.table is really a fantastic package,  I have become accustomed to using
it and saved a lot of time. 
In my daily work, I need to analysis lots of tick data,  the IDateTime is
very useful for me.  However, ITime class can not handle Millisecond. I
suggest using  the numbers of milliseconds to represents the introday time,
for example,  for time "11:00:00.000",  using  integer 11 * 60 * 60 * 1000
to represent it.  I have used kdb+/q ,  kdb+/q just do with time by that
way.

best regards,

gaizoule


--
View this message in context: http://r.789695.n4.nabble.com/Suggestion-on-ITime-class-implementing-tp4662281.html
Sent from the datatable-help mailing list archive at Nabble.com.

From statquant at outlook.com  Sun Mar 24 10:38:39 2013
From: statquant at outlook.com (statquant3)
Date: Sun, 24 Mar 2013 02:38:39 -0700 (PDT)
Subject: [datatable-help] Suggestion on ITime class implementing.
In-Reply-To: <1364022393122-4662281.post@n4.nabble.com>
References: <1364022393122-4662281.post@n4.nabble.com>
Message-ID: <1364117919491-4662320.post@n4.nabble.com>

I wrote almost the same message few month ago (so that Matthew knows that I
am not duplicating ids to trick him into implementing this :))

More seriously I recently discovered that R itself handles tdatetime very
wrongly. 

It has the nice POSIXlt which stores date and time as a list (on 40 bytes
which is why data.table do not handle it)
R> lt = as.POSIXlt("2011-01-01 12:32.234354")
R> attributes(lt)
$names
[1] "sec"   "min"   "hour"  "mday"  "mon"   "year"  "wday"  "yday"  "isdst"
$class
[1] "POSIXlt" "POSIXt" 

It has POSIXct which stores the datetime as a double but very often displays
the datetime wrongly

See my SO post 
http://stackoverflow.com/questions/15383057/accurately-converting-from-character-posixct-character-with-sub-millisecond-da
and the one it links to
http://stackoverflow.com/questions/7726034/how-r-formats-posixct-with-fractional-seconds

Dirk Edd. wrote something (then deleted it) stating that Windows could not
handle more than milli-second datetimes and Linux "almost" micros. I never
understood this...
He is developing a RccpBDT that expose Boost::Datetime class to R that I
would like to try but it not mature enough (according to him). 

So at the moment datetime are really a nasty thing that is not handled as
accurately as it should


--
View this message in context: http://r.789695.n4.nabble.com/Suggestion-on-ITime-class-implementing-tp4662281p4662320.html
Sent from the datatable-help mailing list archive at Nabble.com.

From gaizoule at gmail.com  Sun Mar 24 12:31:26 2013
From: gaizoule at gmail.com (gaizoule)
Date: Sun, 24 Mar 2013 04:31:26 -0700 (PDT)
Subject: [datatable-help] Suggestion on ITime class implementing.
In-Reply-To: <1364117919491-4662320.post@n4.nabble.com>
References: <1364022393122-4662281.post@n4.nabble.com>
 <1364117919491-4662320.post@n4.nabble.com>
Message-ID: <1364124686513-4662322.post@n4.nabble.com>

I've met the  same problem which caused by the POSIXct. I think POSIXlt's
storage is wasting a lot of space  and R should support  the intraday time
handling.  Thank you for your useful comments,  I am reading the
StackOverflow.

As for the "Windows could not handle more than milli-second datetimes and
Linux "almost" micros",  this is decided by the OS standard,  not caused by
other thing.  And so,  I think millisecond is enough for me. 


--
View this message in context: http://r.789695.n4.nabble.com/Suggestion-on-ITime-class-implementing-tp4662281p4662322.html
Sent from the datatable-help mailing list archive at Nabble.com.

From jholtman at gmail.com  Sun Mar 24 15:12:30 2013
From: jholtman at gmail.com (Jim Holtman)
Date: Sun, 24 Mar 2013 10:12:30 -0400
Subject: [datatable-help] Suggestion on ITime class implementing.
Message-ID: <r1kisjqwfi231092otlug4hd.1364134350086@email.android.com>

One thing to remember about POSIXct is that with floating point you only have about 15 digits of accuracy. ? With 1970 as the base year there are about 12 digits used to get seconds so you only have 3 digits for the subseconds so milliseconds is the limit.?


Sent from my Verizon Wireless 4G LTE Smartphone

-------- Original message --------
From: gaizoule <gaizoule at gmail.com> 
Date: 03/24/2013  07:31  (GMT-05:00) 
To: datatable-help at lists.r-forge.r-project.org 
Subject: Re: [datatable-help] Suggestion on ITime class implementing. 
 
I've met the? same problem which caused by the POSIXct. I think POSIXlt's
storage is wasting a lot of space? and R should support? the intraday time
handling.? Thank you for your useful comments,? I am reading the
StackOverflow.

As for the "Windows could not handle more than milli-second datetimes and
Linux "almost" micros",? this is decided by the OS standard,? not caused by
other thing.? And so,? I think millisecond is enough for me. 


--
View this message in context: http://r.789695.n4.nabble.com/Suggestion-on-ITime-class-implementing-tp4662281p4662322.html
Sent from the datatable-help mailing list archive at Nabble.com.
_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130324/aeab216e/attachment.html>

From statquant at outlook.com  Mon Mar 25 09:52:15 2013
From: statquant at outlook.com (stat quant)
Date: Mon, 25 Mar 2013 09:52:15 +0100
Subject: [datatable-help] Fwd: Does it worth a feature request ?
In-Reply-To: <CAJJHHA_kubGHnm6zL4pufnc7JrZ7yb9ji6b0py-=E0wC5s6suQ@mail.gmail.com>
References: <CAJJHHA_kubGHnm6zL4pufnc7JrZ7yb9ji6b0py-=E0wC5s6suQ@mail.gmail.com>
Message-ID: <CAJJHHA_4mM7cAdM=pnVfWx-yAWnoOagzs3tA5SaiwU1D-T6nNg@mail.gmail.com>

Hello data.tablers,
I am aware of binary search using J in data.table selects, this works for
"AND" if your table is keyed by 2 columns like

setkey(DT,x,y)
DT[J('A',23),] <=> DT[x=='A' & y==23] #but binary search is much faster for
big/large tables

But does it work with "OR" ??
There is a post on SO along those lines
http://stackoverflow.com/questions/15597971/can-we-do-binary-search-in-data-table-with-or-select-queries
What about a feature request ?

Cheers
Colin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130325/ff2a3480/attachment.html>

From gaizoule at gmail.com  Tue Mar 26 02:54:34 2013
From: gaizoule at gmail.com (gaizoule)
Date: Mon, 25 Mar 2013 18:54:34 -0700 (PDT)
Subject: [datatable-help] Suggestion on ITime class implementing.
In-Reply-To: <r1kisjqwfi231092otlug4hd.1364134350086@email.android.com>
References: <1364022393122-4662281.post@n4.nabble.com>
 <r1kisjqwfi231092otlug4hd.1364134350086@email.android.com>
Message-ID: <1364262874029-4662450.post@n4.nabble.com>

Thank you for your insight comments, make me come to the essential of the
problem.


--
View this message in context: http://r.789695.n4.nabble.com/Suggestion-on-ITime-class-implementing-tp4662281p4662450.html
Sent from the datatable-help mailing list archive at Nabble.com.

From mdowle at mdowle.plus.com  Tue Mar 26 11:58:09 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Tue, 26 Mar 2013 10:58:09 +0000
Subject: [datatable-help] Suggestion on ITime class implementing.
In-Reply-To: <1364022393122-4662281.post@n4.nabble.com>
References: <1364022393122-4662281.post@n4.nabble.com>
Message-ID: <2ddc85c35b817f1aaeeca5a9dbb0f0f3@imap.plus.net>


Hi,

An alternative to POSIXct is integer time :

12:34:56.789  =>  123456789L

which I do quite a bit. And integer dates: 26 Mar 2013 => 20130326L.

You can get quite far with two integer columns: date and time. Quite 
often I don't use any DateTime class at all. Each column is 4 bytes and 
`roll=TRUE` then only rolls within the same day which is what I usually 
want.

But, yes ITime should be in milliseconds. I couldn't find this on the 
tracker so have now filed it here :
     
https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2644&group_id=240&atid=978
If any links to posts or S.O. questions are not reachable from there, 
please add.

For micro (and nanosecond, why not) then perhaps we could use integer64 
to avoid any floating point issues.

24*60*60*1e9 * 365*100  ==  3.15e18

which fits in 2^63 (9.2e18), if I've got the arithmetic right.  The 
nano timestamp could be +/- 292 years of precise nanoseconds around the 
epoch.

And/or, for time only with no date, it could go to picoseconds :  
24*60*60*1e12 = 8.6e16 < 2^63

All that would be required is availability of integer64, which is 
pretty standard (even on 32bit machines).

Matthew


On 23.03.2013 07:06, gaizoule wrote:
> Hi, everyone,
> data.table is really a fantastic package,  I have become accustomed 
> to using
> it and saved a lot of time.
> In my daily work, I need to analysis lots of tick data,  the 
> IDateTime is
> very useful for me.  However, ITime class can not handle Millisecond. 
> I
> suggest using  the numbers of milliseconds to represents the introday 
> time,
> for example,  for time "11:00:00.000",  using  integer 11 * 60 * 60 * 
> 1000
> to represent it.  I have used kdb+/q ,  kdb+/q just do with time by 
> that
> way.
>
> best regards,
>
> gaizoule
>
>
>
>
> --
> View this message in context:
> 
> http://r.789695.n4.nabble.com/Suggestion-on-ITime-class-implementing-tp4662281.html
> Sent from the datatable-help mailing list archive at Nabble.com.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> 
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

From mdowle at mdowle.plus.com  Tue Mar 26 12:11:52 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Tue, 26 Mar 2013 11:11:52 +0000
Subject: [datatable-help]
 =?utf-8?q?Fwd=3A_Does_it_worth_a_feature_request?= =?utf-8?q?_=3F?=
In-Reply-To: <CAJJHHA_4mM7cAdM=pnVfWx-yAWnoOagzs3tA5SaiwU1D-T6nNg@mail.gmail.com>
References: <CAJJHHA_kubGHnm6zL4pufnc7JrZ7yb9ji6b0py-=E0wC5s6suQ@mail.gmail.com>
 <CAJJHHA_4mM7cAdM=pnVfWx-yAWnoOagzs3tA5SaiwU1D-T6nNg@mail.gmail.com>
Message-ID: <dc04cee6b8e2eb14e1edc2f4468d3c4f@imap.plus.net>

 
Hi, 

Yes, please file it. 

Auto converting x=='A' & y==23 to the
relevant join syntax internally might be possible in the distant furture
as well (declarative i rather than imperative i). And might be needed
sooner rather than later depending on how we implement the syntax for
joining using secondary keys (creating set2key is the easy part).


Matthew 

On 25.03.2013 08:52, stat quant wrote: 

> Hello
data.tablers, 
> I am aware of binary search using J in data.table
selects, this works for "AND" if your table is keyed by 2 columns like
>

> setkey(DT,x,y)
> DT[J('A',23),] DT[x=='A' & y==23] #but binary search
is much faster for big/large tables
> 
> But does it work with "OR" ??
>
There is a post on SO along those lines
http://stackoverflow.com/questions/15597971/can-we-do-binary-search-in-data-table-with-or-select-queries
[1]
> What about a feature request ?
> 
> Cheers
> Colin


Links:
------
[1]
http://stackoverflow.com/questions/15597971/can-we-do-binary-search-in-data-table-with-or-select-queries
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130326/6e583cd1/attachment.html>

From timothee.carayol at gmail.com  Wed Mar 27 20:45:05 2013
From: timothee.carayol at gmail.com (=?ISO-8859-1?Q?Timoth=E9e_Carayol?=)
Date: Wed, 27 Mar 2013 19:45:05 +0000
Subject: [datatable-help] fread(character string) limited to strings less
	than 4096 long?
Message-ID: <CAGam+C7ASqn39X9zTgoWkyzXqu_KB=1Lgc3gYVSiMyqYQdKwgg@mail.gmail.com>

Hi,

I have an example of a string of 4097 characters which can't be parsed by
fread; however, if I remove any character, it can be parsed just fine. Is
that a known limitation?

(If I write the string to a file and then fread the file name, it works
too.)

Let me know if you need the string and/or a bug report.

Thanks
Timoth?e
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130327/f145d07d/attachment.html>

From mdowle at mdowle.plus.com  Wed Mar 27 22:23:41 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Wed, 27 Mar 2013 21:23:41 -0000
Subject: [datatable-help] fread(character string) limited to strings
 less than 4096 long?
In-Reply-To: <CAGam+C7ASqn39X9zTgoWkyzXqu_KB=1Lgc3gYVSiMyqYQdKwgg@mail.gmail.com>
References: <CAGam+C7ASqn39X9zTgoWkyzXqu_KB=1Lgc3gYVSiMyqYQdKwgg@mail.gmail.com>
Message-ID: <83979bd1fc26d19625910fd1ad31f0e4.squirrel@webmail.plus.net>

Hi,
Nice to hear from you. Nope not known to me. Obviously 4096 is 4k, is that
the R limit for a character string length? What happens at 4097?
Matthew

> Hi,
>
> I have an example of a string of 4097 characters which can't be parsed by
> fread; however, if I remove any character, it can be parsed just fine. Is
> that a known limitation?
>
> (If I write the string to a file and then fread the file name, it works
> too.)
>
> Let me know if you need the string and/or a bug report.
>
> Thanks
> Timoth?e
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


From mhwaliji at google.com  Wed Mar 27 22:51:46 2013
From: mhwaliji at google.com (Muhammad Waliji)
Date: Wed, 27 Mar 2013 14:51:46 -0700
Subject: [datatable-help] fread(character string) limited to strings
 less than 4096 long?
In-Reply-To: <83979bd1fc26d19625910fd1ad31f0e4.squirrel@webmail.plus.net>
References: <CAGam+C7ASqn39X9zTgoWkyzXqu_KB=1Lgc3gYVSiMyqYQdKwgg@mail.gmail.com>
 <83979bd1fc26d19625910fd1ad31f0e4.squirrel@webmail.plus.net>
Message-ID: <CAHNAUjHZXC1Q6EcbtsvWQiK2QEYsZb243PNis2=mD1nsqc=oUA@mail.gmail.com>

R is happy with strings of length 4097:

> paste(rep("a", 4097), collapse="")


On Wed, Mar 27, 2013 at 2:23 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:

> Hi,
> Nice to hear from you. Nope not known to me. Obviously 4096 is 4k, is that
> the R limit for a character string length? What happens at 4097?
> Matthew
>
> > Hi,
> >
> > I have an example of a string of 4097 characters which can't be parsed by
> > fread; however, if I remove any character, it can be parsed just fine. Is
> > that a known limitation?
> >
> > (If I write the string to a file and then fread the file name, it works
> > too.)
> >
> > Let me know if you need the string and/or a bug report.
> >
> > Thanks
> > Timoth?e
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> >
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130327/465912a7/attachment.html>

From timothee.carayol at gmail.com  Wed Mar 27 23:49:32 2013
From: timothee.carayol at gmail.com (=?ISO-8859-1?Q?Timoth=E9e_Carayol?=)
Date: Wed, 27 Mar 2013 22:49:32 +0000
Subject: [datatable-help] fread(character string) limited to strings
 less than 4096 long?
In-Reply-To: <83979bd1fc26d19625910fd1ad31f0e4.squirrel@webmail.plus.net>
References: <CAGam+C7ASqn39X9zTgoWkyzXqu_KB=1Lgc3gYVSiMyqYQdKwgg@mail.gmail.com>
 <83979bd1fc26d19625910fd1ad31f0e4.squirrel@webmail.plus.net>
Message-ID: <CAGam+C6fZnjYdugrFD=6gw20pOrTT5PZOnrkTj2qsj_EmD6aKQ@mail.gmail.com>

Agree with Muhammad, longer character strings are definitely permitted in R.

A minimal example that show something strange happening with fread:

for (n in c(1023:1025, 10000)) {
  A <- fread(
           paste(
                 rep('a\tb\n', n),
                 collapse=''
                 ),
           sep='\t'
           )
  print(nrow(A))
}

On my computer, I obtain:
[1] 1022
[1] 1023
[1] 1023
[1] 1023

Hope this helps
Timoth?e


On Wed, Mar 27, 2013 at 9:23 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:

> Hi,
> Nice to hear from you. Nope not known to me. Obviously 4096 is 4k, is that
> the R limit for a character string length? What happens at 4097?
> Matthew
>
> > Hi,
> >
> > I have an example of a string of 4097 characters which can't be parsed by
> > fread; however, if I remove any character, it can be parsed just fine. Is
> > that a known limitation?
> >
> > (If I write the string to a file and then fread the file name, it works
> > too.)
> >
> > Let me know if you need the string and/or a bug report.
> >
> > Thanks
> > Timoth?e
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> >
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130327/ef676dcc/attachment.html>

From mdowle at mdowle.plus.com  Thu Mar 28 15:31:56 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Thu, 28 Mar 2013 14:31:56 +0000
Subject: [datatable-help]
 =?utf-8?q?fread=28character_string=29_limited_to?=
 =?utf-8?q?_strings_less_than_4096_long=3F?=
In-Reply-To: <CAGam+C6fZnjYdugrFD=6gw20pOrTT5PZOnrkTj2qsj_EmD6aKQ@mail.gmail.com>
References: <CAGam+C7ASqn39X9zTgoWkyzXqu_KB=1Lgc3gYVSiMyqYQdKwgg@mail.gmail.com>
 <83979bd1fc26d19625910fd1ad31f0e4.squirrel@webmail.plus.net>
 <CAGam+C6fZnjYdugrFD=6gw20pOrTT5PZOnrkTj2qsj_EmD6aKQ@mail.gmail.com>
Message-ID: <2c2af8789733127541fe78c1ccde5412@imap.plus.net>

 
Interesting, what's your sessionInfo() please? 

For me it seems to
work ok : 

[1] 1022
[1] 1023
[1] 1024 
[1] 9999

> sessionInfo()
R
version 2.15.2 (2012-10-26)
Platform: x86_64-w64-mingw32/x64
(64-bit)

On 27.03.2013 22:49, Timoth?e Carayol wrote: 

> Agree with
Muhammad, longer character strings are definitely permitted in R. 
> A
minimal example that show something strange happening with fread: 
> 
>
for (n in c(1023:1025, 10000)) { 
> A 
> paste( 
> rep('atbn', n), 
>
collapse='' 
> ), 
> sep='t' 
> ) 
> print(nrow(A)) 
> } 
> On my
computer, I obtain: 
> 
> [1] 1022 
> [1] 1023 
> [1] 1023 
> [1] 1023

> Hope this helps 
> Timoth?e 
> 
> On Wed, Mar 27, 2013 at 9:23 PM,
Matthew Dowle <mdowle at mdowle.plus.com [3]> wrote:
> 
>> Hi,
>> Nice to
hear from you. Nope not known to me. Obviously 4096 is 4k, is that
>>
the R limit for a character string length? What happens at 4097?
>>
Matthew
>> 
>> > Hi,
>> >
>> > I have an example of a string of 4097
characters which can't be parsed by
>> > fread; however, if I remove any
character, it can be parsed just fine. Is
>> > that a known
limitation?
>> >
>> > (If I write the string to a file and then fread
the file name, it works
>> > too.)
>> >
>> > Let me know if you need the
string and/or a bug report.
>> >
>> > Thanks
>> > Timoth?e >
_______________________________________________
>> > datatable-help
mailing list
>> > datatable-help at lists.r-forge.r-project.org [1]
>> >
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[2]

 
Links:
------
[1]
mailto:datatable-help at lists.r-forge.r-project.org
[2]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[3]
mailto:mdowle at mdowle.plus.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130328/29d19fd0/attachment.html>

From timothee.carayol at gmail.com  Thu Mar 28 15:38:37 2013
From: timothee.carayol at gmail.com (=?ISO-8859-1?Q?Timoth=E9e_Carayol?=)
Date: Thu, 28 Mar 2013 14:38:37 +0000
Subject: [datatable-help] fread(character string) limited to strings
 less than 4096 long?
In-Reply-To: <2c2af8789733127541fe78c1ccde5412@imap.plus.net>
References: <CAGam+C7ASqn39X9zTgoWkyzXqu_KB=1Lgc3gYVSiMyqYQdKwgg@mail.gmail.com>
 <83979bd1fc26d19625910fd1ad31f0e4.squirrel@webmail.plus.net>
 <CAGam+C6fZnjYdugrFD=6gw20pOrTT5PZOnrkTj2qsj_EmD6aKQ@mail.gmail.com>
 <2c2af8789733127541fe78c1ccde5412@imap.plus.net>
Message-ID: <CAGam+C69Vm=qjJb+8_Qo-FsJXpnvtDVxApdcFBPse1DOpvB3Fw@mail.gmail.com>

Curiouser and curiouser..

I can reproduce on two computers with different versions of R and of
data.table.


Computer 1 (it says unknown-linux but is actually ubuntu):

R version 2.15.3 (2013-03-01)

Platform: x86_64-unknown-linux-gnu (64-bit)


locale:

 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
LC_MONETARY=en_GB.UTF-8
   LC_MESSAGES=en_GB.UTF-8    LC_PAPER=C                 LC_NAME=C
         LC_ADDRESS=C
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8
LC_IDENTIFICATION=C


attached base packages:

[1] stats     graphics  grDevices utils     datasets  methods   base


other attached packages:

[1] bit64_0.9-2      bit_1.1-10       data.table_1.8.9 colorout_1.0-0


Computer 2:

R version 2.15.2 (2012-10-26)
Platform: x86_64-redhat-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
 [7] LC_PAPER=C                 LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] data.table_1.8.8

loaded via a namespace (and not attached):
[1] tools_2.15.2


On Thu, Mar 28, 2013 at 2:31 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:

> **
>
>
>
> Interesting, what's your sessionInfo() please?
>
> For me it seems to work ok :
>
> [1] 1022
> [1] 1023
> [1] 1024
> [1] 9999
>
> > sessionInfo()
> R version 2.15.2 (2012-10-26)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
>
>
>
> On 27.03.2013 22:49, Timoth?e Carayol wrote:
>
>  Agree with Muhammad, longer character strings are definitely permitted
> in R.
> A minimal example that show something strange happening with fread:
>  for (n in c(1023:1025, 10000)) {
>   A
>            paste(
>                  rep('a\tb\n', n),
>                  collapse=''
>                  ),
>            sep='\t'
>            )
>   print(nrow(A))
> }
> On my computer, I obtain:
>  [1] 1022
> [1] 1023
> [1] 1023
> [1] 1023
>  Hope this helps
> Timoth?e
>
>
> On Wed, Mar 27, 2013 at 9:23 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>
>> Hi,
>> Nice to hear from you. Nope not known to me. Obviously 4096 is 4k, is that
>> the R limit for a character string length? What happens at 4097?
>> Matthew
>>
>> > Hi,
>> >
>> > I have an example of a string of 4097 characters which can't be parsed
>> by
>> > fread; however, if I remove any character, it can be parsed just fine.
>> Is
>> > that a known limitation?
>> >
>> > (If I write the string to a file and then fread the file name, it works
>> > too.)
>> >
>> > Let me know if you need the string and/or a bug report.
>> >
>> > Thanks
>> > Timoth?e
>>  > _______________________________________________
>> > datatable-help mailing list
>> > datatable-help at lists.r-forge.r-project.org
>> >
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130328/358bfb64/attachment.html>

From mdowle at mdowle.plus.com  Thu Mar 28 15:55:17 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Thu, 28 Mar 2013 14:55:17 +0000
Subject: [datatable-help]
 =?utf-8?q?fread=28character_string=29_limited_to?=
 =?utf-8?q?_strings_less_than_4096_long=3F?=
In-Reply-To: <CAGam+C69Vm=qjJb+8_Qo-FsJXpnvtDVxApdcFBPse1DOpvB3Fw@mail.gmail.com>
References: <CAGam+C7ASqn39X9zTgoWkyzXqu_KB=1Lgc3gYVSiMyqYQdKwgg@mail.gmail.com>
 <83979bd1fc26d19625910fd1ad31f0e4.squirrel@webmail.plus.net>
 <CAGam+C6fZnjYdugrFD=6gw20pOrTT5PZOnrkTj2qsj_EmD6aKQ@mail.gmail.com>
 <2c2af8789733127541fe78c1ccde5412@imap.plus.net>
 <CAGam+C69Vm=qjJb+8_Qo-FsJXpnvtDVxApdcFBPse1DOpvB3Fw@mail.gmail.com>
Message-ID: <dff0921757ffe0687e4c79185ab5243c@imap.plus.net>

 
Hm this is odd. 

Could you run the following and paste back the
(verbose) results please. 

for (n in c(1023:1025, 10000)) {
 input =
paste( rep('atbn', n), collapse='')
 A = fread(input,verbose=TRUE)

cat(nchar(input), nrow(A), "n")
}

On 28.03.2013 14:38, Timoth?e Carayol
wrote: 

> Curiouser and curiouser.. 
> 
> I can reproduce on two
computers with different versions of R and of data.table. 
> 
> Computer
1 (it says unknown-linux but is actually ubuntu): 
> 
> R version 2.15.3
(2013-03-01) 
> Platform: x86_64-unknown-linux-gnu (64-bit) 
> 
>
locale: 
> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8
LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 
>
LC_MESSAGES=en_GB.UTF-8 LC_PAPER=C LC_NAME=C LC_ADDRESS=C 
> [10]
LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C 
> 
>
attached base packages: 
> [1] stats graphics grDevices utils datasets
methods base 
> 
> other attached packages: 
> [1] bit64_0.9-2
bit_1.1-10 data.table_1.8.9 colorout_1.0-0 
> Computer 2: 
> 
> R
version 2.15.2 (2012-10-26) 
> Platform: x86_64-redhat-linux-gnu
(64-bit) 
> 
> locale: 
> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C 
> [3]
LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 
> [5]
LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 
> [7] LC_PAPER=C
LC_NAME=C 
> [9] LC_ADDRESS=C LC_TELEPHONE=C 
> [11]
LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C 
> 
> attached base
packages: 
> [1] stats graphics grDevices utils datasets methods base 
>

> other attached packages: 
> [1] data.table_1.8.8 
> 
> loaded via a
namespace (and not attached): 
> [1] tools_2.15.2 
> 
> On Thu, Mar 28,
2013 at 2:31 PM, Matthew Dowle <mdowle at mdowle.plus.com [4]> wrote:
> 
>>
Interesting, what's your sessionInfo() please? 
>> 
>> For me it seems
to work ok : 
>> 
>> [1] 1022
>> [1] 1023
>> [1] 1024 
>> [1] 9999
>>

>>> sessionInfo()
>> R version 2.15.2 (2012-10-26)
>> Platform:
x86_64-w64-mingw32/x64 (64-bit)
>> 
>> On 27.03.2013 22:49, Timoth?e
Carayol wrote: 
>> 
>>> Agree with Muhammad, longer character strings
are definitely permitted in R. 
>>> A minimal example that show
something strange happening with fread: 
>>> 
>>> for (n in c(1023:1025,
10000)) { 
>>> A 
>>> 
>>> paste( 
>>> rep('atbn', n), 
>>> collapse=''

>>> ), 
>>> sep='t' 
>>> ) 
>>> print(nrow(A)) 
>>> } 
>>> On my
computer, I obtain: 
>>> 
>>> [1] 1022 
>>> [1] 1023 
>>> [1] 1023 
>>>
[1] 1023 
>>> Hope this helps 
>>> Timoth?e 
>>> 
>>> On Wed, Mar 27,
2013 at 9:23 PM, Matthew Dowle <mdowle at mdowle.plus.com [3]> wrote:
>>>

>>>> Hi,
>>>> Nice to hear from you. Nope not known to me. Obviously
4096 is 4k, is that
>>>> the R limit for a character string length? What
happens at 4097?
>>>> Matthew
>>>> 
>>>> > Hi,
>>>> >
>>>> > I have an
example of a string of 4097 characters which can't be parsed by
>>>> >
fread; however, if I remove any character, it can be parsed just fine.
Is
>>>> > that a known limitation?
>>>> >
>>>> > (If I write the string
to a file and then fread the file name, it works
>>>> > too.)
>>>>
>
>>>> > Let me know if you need the string and/or a bug report.
>>>>
>
>>>> > Thanks
>>>> > Timoth?e >
_______________________________________________
>>>> > datatable-help
mailing list
>>>> > datatable-help at lists.r-forge.r-project.org [1]
>>>>
>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[2]

 
Links:
------
[1]
mailto:datatable-help at lists.r-forge.r-project.org
[2]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[3]
mailto:mdowle at mdowle.plus.com
[4] mailto:mdowle at mdowle.plus.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130328/e2470899/attachment-0001.html>

From timothee.carayol at gmail.com  Thu Mar 28 15:58:37 2013
From: timothee.carayol at gmail.com (=?ISO-8859-1?Q?Timoth=E9e_Carayol?=)
Date: Thu, 28 Mar 2013 14:58:37 +0000
Subject: [datatable-help] fread(character string) limited to strings
 less than 4096 long?
In-Reply-To: <dff0921757ffe0687e4c79185ab5243c@imap.plus.net>
References: <CAGam+C7ASqn39X9zTgoWkyzXqu_KB=1Lgc3gYVSiMyqYQdKwgg@mail.gmail.com>
 <83979bd1fc26d19625910fd1ad31f0e4.squirrel@webmail.plus.net>
 <CAGam+C6fZnjYdugrFD=6gw20pOrTT5PZOnrkTj2qsj_EmD6aKQ@mail.gmail.com>
 <2c2af8789733127541fe78c1ccde5412@imap.plus.net>
 <CAGam+C69Vm=qjJb+8_Qo-FsJXpnvtDVxApdcFBPse1DOpvB3Fw@mail.gmail.com>
 <dff0921757ffe0687e4c79185ab5243c@imap.plus.net>
Message-ID: <CAGam+C6bc1LA+ioOejVFAvZ5-=43-oZy5P94YgxRR0c7+VfEjA@mail.gmail.com>

Input contains a \n (or is ""), taking this to be text input (not a
filename)
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.

Using line 30 to detect sep (the last non blank line in the first 30) ...
'\t'
Found 2 columns

First row with 2 fields occurs on line 1 (either column names or first row
of data)
All the fields on line 1 are character fields. Treating as the column
names.
Count of eol after first data row: 1023

Subtracted 1 for last eol and any trailing empty lines, leaving 1022 data
rows
Type codes: 33 (first 5 rows)

Type codes: 33 (+middle 5 rows)

Type codes: 33 (+last 5 rows)

   0.000s (-nan%) Memory map (rerun may be quicker)

   0.000s (-nan%) sep and header detection

   0.000s (-nan%) Count rows (wc -l)

   0.000s (-nan%) Column type detection (first, middle and last 5 rows)

   0.000s (-nan%) Allocation of 1022x2 result (xMB) in RAM

   0.000s (-nan%) Reading data

   0.000s (-nan%) Allocation for type bumps (if any), including gc time if
triggered
   0.000s (-nan%) Coercing data already read in type bumps (if any)

   0.000s (-nan%) Changing na.strings to NA

   0.000s        Total

4092 1022

Input contains a \n (or is ""), taking this to be text input (not a
filename)
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.

Using line 30 to detect sep (the last non blank line in the first 30) ...
'\t'
Found 2 columns

First row with 2 fields occurs on line 1 (either column names or first row
of data)
All the fields on line 1 are character fields. Treating as the column
names.
Count of eol after first data row: 1023

Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data
rows
Type codes: 33 (first 5 rows)

Type codes: 33 (+middle 5 rows)

Type codes: 33 (+last 5 rows)

   0.000s (-nan%) Memory map (rerun may be quicker)

   0.000s (-nan%) sep and header detection

   0.000s (-nan%) Count rows (wc -l)

   0.000s (-nan%) Column type detection (first, middle and last 5 rows)

   0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM

   0.000s (-nan%) Reading data

   0.000s (-nan%) Allocation for type bumps (if any), including gc time if
triggered
   0.000s (-nan%) Coercing data already read in type bumps (if any)

   0.000s (-nan%) Changing na.strings to NA

   0.000s        Total

4096 1023

Input contains a \n (or is ""), taking this to be text input (not a
filename)
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.

Using line 30 to detect sep (the last non blank line in the first 30) ...
'\t'
Found 2 columns

First row with 2 fields occurs on line 1 (either column names or first row
of data)
All the fields on line 1 are character fields. Treating as the column
names.
Count of eol after first data row: 1023

Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data
rows
Type codes: 33 (first 5 rows)

Type codes: 33 (+middle 5 rows)

Type codes: 33 (+last 5 rows)

   0.000s (-nan%) Memory map (rerun may be quicker)

   0.000s (-nan%) sep and header detection

   0.000s (-nan%) Count rows (wc -l)

   0.000s (-nan%) Column type detection (first, middle and last 5 rows)

   0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM

   0.000s (-nan%) Reading data

   0.000s (-nan%) Allocation for type bumps (if any), including gc time if
triggered
   0.000s (-nan%) Coercing data already read in type bumps (if any)

   0.000s (-nan%) Changing na.strings to NA

   0.000s        Total

4100 1023

Input contains a \n (or is ""), taking this to be text input (not a
filename)
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.

Using line 30 to detect sep (the last non blank line in the first 30) ...
'\t'
Found 2 columns

First row with 2 fields occurs on line 1 (either column names or first row
of data)
All the fields on line 1 are character fields. Treating as the column
names.
Count of eol after first data row: 1023

Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data
rows
Type codes: 33 (first 5 rows)

Type codes: 33 (+middle 5 rows)

Type codes: 33 (+last 5 rows)

   0.000s (-nan%) Memory map (rerun may be quicker)

   0.000s (-nan%) sep and header detection

   0.000s (-nan%) Count rows (wc -l)

   0.000s (-nan%) Column type detection (first, middle and last 5 rows)

   0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM

   0.000s (-nan%) Reading data

   0.000s (-nan%) Allocation for type bumps (if any), including gc time if
triggered
   0.000s (-nan%) Coercing data already read in type bumps (if any)

   0.000s (-nan%) Changing na.strings to NA

   0.000s        Total

40000 1023


On Thu, Mar 28, 2013 at 2:55 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:

> **
>
>
>
> Hm this is odd.
>
> Could you run the following and paste back the (verbose) results please.
>
> for (n in c(1023:1025, 10000)) {
>  input = paste( rep('a\tb\n', n), collapse='')
>  A = fread(input,verbose=TRUE)
>  cat(nchar(input), nrow(A), "\n")
> }
>
>
>
>
>
> On 28.03.2013 14:38, Timoth?e Carayol wrote:
>
>  Curiouser and curiouser..
>
> I can reproduce on two computers with different versions of R and of
> data.table.
>
>
>
> Computer 1 (it says unknown-linux but is actually ubuntu):
>
> R version 2.15.3 (2013-03-01)
>
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
>
>
> locale:
>
>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
> LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
> LC_MONETARY=en_GB.UTF-8
>    LC_MESSAGES=en_GB.UTF-8    LC_PAPER=C                 LC_NAME=C
>          LC_ADDRESS=C
> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8
> LC_IDENTIFICATION=C
>
>
>
> attached base packages:
>
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
>
>
> other attached packages:
>
> [1] bit64_0.9-2      bit_1.1-10       data.table_1.8.9 colorout_1.0-0
>
> Computer 2:
>  R version 2.15.2 (2012-10-26)
> Platform: x86_64-redhat-linux-gnu (64-bit)
>
> locale:
>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
>  [7] LC_PAPER=C                 LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] data.table_1.8.8
>
> loaded via a namespace (and not attached):
> [1] tools_2.15.2
>
>
> On Thu, Mar 28, 2013 at 2:31 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>
>>
>>
>> Interesting, what's your sessionInfo() please?
>>
>> For me it seems to work ok :
>>
>> [1] 1022
>> [1] 1023
>> [1] 1024
>> [1] 9999
>>
>> > sessionInfo()
>> R version 2.15.2 (2012-10-26)
>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>
>>
>>
>> On 27.03.2013 22:49, Timoth?e Carayol wrote:
>>
>>  Agree with Muhammad, longer character strings are definitely permitted
>> in R.
>> A minimal example that show something strange happening with fread:
>>   for (n in c(1023:1025, 10000)) {
>>   A
>>              paste(
>>                  rep('a\tb\n', n),
>>                  collapse=''
>>                  ),
>>            sep='\t'
>>            )
>>   print(nrow(A))
>> }
>> On my computer, I obtain:
>>  [1] 1022
>> [1] 1023
>> [1] 1023
>> [1] 1023
>>  Hope this helps
>> Timoth?e
>>
>>
>> On Wed, Mar 27, 2013 at 9:23 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>>
>>> Hi,
>>> Nice to hear from you. Nope not known to me. Obviously 4096 is 4k, is
>>> that
>>> the R limit for a character string length? What happens at 4097?
>>> Matthew
>>>
>>> > Hi,
>>> >
>>> > I have an example of a string of 4097 characters which can't be parsed
>>> by
>>> > fread; however, if I remove any character, it can be parsed just fine.
>>> Is
>>> > that a known limitation?
>>> >
>>> > (If I write the string to a file and then fread the file name, it works
>>> > too.)
>>> >
>>> > Let me know if you need the string and/or a bug report.
>>> >
>>> > Thanks
>>> > Timoth?e
>>>  > _______________________________________________
>>> > datatable-help mailing list
>>> > datatable-help at lists.r-forge.r-project.org
>>> >
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>
>>>
>>>
>>
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130328/b4527072/attachment-0001.html>

From mdowle at mdowle.plus.com  Thu Mar 28 16:19:52 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Thu, 28 Mar 2013 15:19:52 +0000
Subject: [datatable-help]
 =?utf-8?q?fread=28character_string=29_limited_to?=
 =?utf-8?q?_strings_less_than_4096_long=3F?=
In-Reply-To: <CAGam+C6bc1LA+ioOejVFAvZ5-=43-oZy5P94YgxRR0c7+VfEjA@mail.gmail.com>
References: <CAGam+C7ASqn39X9zTgoWkyzXqu_KB=1Lgc3gYVSiMyqYQdKwgg@mail.gmail.com>
 <83979bd1fc26d19625910fd1ad31f0e4.squirrel@webmail.plus.net>
 <CAGam+C6fZnjYdugrFD=6gw20pOrTT5PZOnrkTj2qsj_EmD6aKQ@mail.gmail.com>
 <2c2af8789733127541fe78c1ccde5412@imap.plus.net>
 <CAGam+C69Vm=qjJb+8_Qo-FsJXpnvtDVxApdcFBPse1DOpvB3Fw@mail.gmail.com>
 <dff0921757ffe0687e4c79185ab5243c@imap.plus.net>
 <CAGam+C6bc1LA+ioOejVFAvZ5-=43-oZy5P94YgxRR0c7+VfEjA@mail.gmail.com>
Message-ID: <230b0040889556349b21822824a5fb7e@imap.plus.net>

 
Hi, 

Thanks. That was from v1.8.8 on computer 2 I hope. Computer 1
with v1.8.9 should have the -nan% problem fixed. 

I'm a bit stumped for
the moment. I've filed a bug report. Probably, if I still can't
reproduce my end, I'll add some more detailed tracing to verbose output
and ask you to try again next week if that's ok. 

Thanks for reporting!


Matthew 

On 28.03.2013 14:58, Timoth?e Carayol wrote: 

> Input
contains a n (or is ""), taking this to be text input (not a filename)

> Detected eol as n only (no r afterwards), the UNIX and Mac standard.

> Using line 30 to detect sep (the last non blank line in the first 30)
... 't' 
> Found 2 columns 
> First row with 2 fields occurs on line 1
(either column names or first row of data) 
> All the fields on line 1
are character fields. Treating as the column names. 
> Count of eol
after first data row: 1023 
> Subtracted 1 for last eol and any trailing
empty lines, leaving 1022 data rows 
> Type codes: 33 (first 5 rows) 
>
Type codes: 33 (+middle 5 rows) 
> Type codes: 33 (+last 5 rows) 
>
0.000s (-nan%) Memory map (rerun may be quicker) 
> 0.000s (-nan%) sep
and header detection 
> 0.000s (-nan%) Count rows (wc -l) 
> 0.000s
(-nan%) Column type detection (first, middle and last 5 rows) 
> 0.000s
(-nan%) Allocation of 1022x2 result (xMB) in RAM 
> 0.000s (-nan%)
Reading data 
> 0.000s (-nan%) Allocation for type bumps (if any),
including gc time if triggered 
> 0.000s (-nan%) Coercing data already
read in type bumps (if any) 
> 0.000s (-nan%) Changing na.strings to NA

> 0.000s Total 
> 4092 1022 
> Input contains a n (or is ""), taking
this to be text input (not a filename) 
> Detected eol as n only (no r
afterwards), the UNIX and Mac standard. 
> Using line 30 to detect sep
(the last non blank line in the first 30) ... 't' 
> Found 2 columns 
>
First row with 2 fields occurs on line 1 (either column names or first
row of data) 
> All the fields on line 1 are character fields. Treating
as the column names. 
> Count of eol after first data row: 1023 
>
Subtracted 0 for last eol and any trailing empty lines, leaving 1023
data rows 
> Type codes: 33 (first 5 rows) 
> Type codes: 33 (+middle 5
rows) 
> Type codes: 33 (+last 5 rows) 
> 0.000s (-nan%) Memory map
(rerun may be quicker) 
> 0.000s (-nan%) sep and header detection 
>
0.000s (-nan%) Count rows (wc -l) 
> 0.000s (-nan%) Column type
detection (first, middle and last 5 rows) 
> 0.000s (-nan%) Allocation
of 1023x2 result (xMB) in RAM 
> 0.000s (-nan%) Reading data 
> 0.000s
(-nan%) Allocation for type bumps (if any), including gc time if
triggered 
> 0.000s (-nan%) Coercing data already read in type bumps (if
any) 
> 0.000s (-nan%) Changing na.strings to NA 
> 0.000s Total 
> 4096
1023 
> Input contains a n (or is ""), taking this to be text input (not
a filename) 
> Detected eol as n only (no r afterwards), the UNIX and
Mac standard. 
> Using line 30 to detect sep (the last non blank line in
the first 30) ... 't' 
> Found 2 columns 
> First row with 2 fields
occurs on line 1 (either column names or first row of data) 
> All the
fields on line 1 are character fields. Treating as the column names. 
>
Count of eol after first data row: 1023 
> Subtracted 0 for last eol and
any trailing empty lines, leaving 1023 data rows 
> Type codes: 33
(first 5 rows) 
> Type codes: 33 (+middle 5 rows) 
> Type codes: 33
(+last 5 rows) 
> 0.000s (-nan%) Memory map (rerun may be quicker) 
>
0.000s (-nan%) sep and header detection 
> 0.000s (-nan%) Count rows (wc
-l) 
> 0.000s (-nan%) Column type detection (first, middle and last 5
rows) 
> 0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM 
>
0.000s (-nan%) Reading data 
> 0.000s (-nan%) Allocation for type bumps
(if any), including gc time if triggered 
> 0.000s (-nan%) Coercing data
already read in type bumps (if any) 
> 0.000s (-nan%) Changing
na.strings to NA 
> 0.000s Total 
> 4100 1023 
> Input contains a n (or
is ""), taking this to be text input (not a filename) 
> Detected eol as
n only (no r afterwards), the UNIX and Mac standard. 
> Using line 30 to
detect sep (the last non blank line in the first 30) ... 't' 
> Found 2
columns 
> First row with 2 fields occurs on line 1 (either column names
or first row of data) 
> All the fields on line 1 are character fields.
Treating as the column names. 
> Count of eol after first data row: 1023

> Subtracted 0 for last eol and any trailing empty lines, leaving 1023
data rows 
> Type codes: 33 (first 5 rows) 
> Type codes: 33 (+middle 5
rows) 
> Type codes: 33 (+last 5 rows) 
> 0.000s (-nan%) Memory map
(rerun may be quicker) 
> 0.000s (-nan%) sep and header detection 
>
0.000s (-nan%) Count rows (wc -l) 
> 0.000s (-nan%) Column type
detection (first, middle and last 5 rows) 
> 0.000s (-nan%) Allocation
of 1023x2 result (xMB) in RAM 
> 0.000s (-nan%) Reading data 
> 0.000s
(-nan%) Allocation for type bumps (if any), including gc time if
triggered 
> 0.000s (-nan%) Coercing data already read in type bumps (if
any) 
> 0.000s (-nan%) Changing na.strings to NA 
> 0.000s Total 
>
40000 1023 
> 
> On Thu, Mar 28, 2013 at 2:55 PM, Matthew Dowle
<mdowle at mdowle.plus.com [5]> wrote:
> 
>> Hm this is odd. 
>> 
>> Could
you run the following and paste back the (verbose) results please. 
>>
for (n in c(1023:1025, 10000)) { 
>> 
>> input = paste( rep('atbn', n),
collapse='')
>> A = fread(input,verbose=TRUE)
>> cat(nchar(input),
nrow(A), "n")
>> }
>> 
>> On 28.03.2013 14:38, Timoth?e Carayol wrote:

>> 
>>> Curiouser and curiouser.. 
>>> 
>>> I can reproduce on two
computers with different versions of R and of data.table. 
>>> 
>>>
Computer 1 (it says unknown-linux but is actually ubuntu): 
>>> 
>>> R
version 2.15.3 (2013-03-01) 
>>> Platform: x86_64-unknown-linux-gnu
(64-bit) 
>>> 
>>> locale: 
>>> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 
>>>
LC_MESSAGES=en_GB.UTF-8 LC_PAPER=C LC_NAME=C LC_ADDRESS=C 
>>> [10]
LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C 
>>> 
>>>
attached base packages: 
>>> [1] stats graphics grDevices utils datasets
methods base 
>>> 
>>> other attached packages: 
>>> [1] bit64_0.9-2
bit_1.1-10 data.table_1.8.9 colorout_1.0-0 
>>> Computer 2: 
>>> 
>>> R
version 2.15.2 (2012-10-26) 
>>> Platform: x86_64-redhat-linux-gnu
(64-bit) 
>>> 
>>> locale: 
>>> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C

>>> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 
>>> [5]
LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 
>>> [7] LC_PAPER=C
LC_NAME=C 
>>> [9] LC_ADDRESS=C LC_TELEPHONE=C 
>>> [11]
LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C 
>>> 
>>> attached base
packages: 
>>> [1] stats graphics grDevices utils datasets methods base

>>> 
>>> other attached packages: 
>>> [1] data.table_1.8.8 
>>> 
>>>
loaded via a namespace (and not attached): 
>>> [1] tools_2.15.2 
>>>

>>> On Thu, Mar 28, 2013 at 2:31 PM, Matthew Dowle
<mdowle at mdowle.plus.com [4]> wrote:
>>> 
>>>> Interesting, what's your
sessionInfo() please? 
>>>> 
>>>> For me it seems to work ok : 
>>>>

>>>> [1] 1022
>>>> [1] 1023
>>>> [1] 1024 
>>>> [1] 9999
>>>> 
>>>>>
sessionInfo()
>>>> R version 2.15.2 (2012-10-26)
>>>> Platform:
x86_64-w64-mingw32/x64 (64-bit)
>>>> 
>>>> On 27.03.2013 22:49, Timoth?e
Carayol wrote: 
>>>> 
>>>>> Agree with Muhammad, longer character
strings are definitely permitted in R. 
>>>>> A minimal example that
show something strange happening with fread: 
>>>>> 
>>>>> for (n in
c(1023:1025, 10000)) { 
>>>>> A 
>>>>> 
>>>>> paste( 
>>>>> rep('atbn',
n), 
>>>>> collapse='' 
>>>>> ), 
>>>>> sep='t' 
>>>>> ) 
>>>>>
print(nrow(A)) 
>>>>> } 
>>>>> On my computer, I obtain: 
>>>>> 
>>>>>
[1] 1022 
>>>>> [1] 1023 
>>>>> [1] 1023 
>>>>> [1] 1023 
>>>>> Hope
this helps 
>>>>> Timoth?e 
>>>>> 
>>>>> On Wed, Mar 27, 2013 at 9:23
PM, Matthew Dowle <mdowle at mdowle.plus.com [3]> wrote:
>>>>> 
>>>>>>
Hi,
>>>>>> Nice to hear from you. Nope not known to me. Obviously 4096
is 4k, is that
>>>>>> the R limit for a character string length? What
happens at 4097?
>>>>>> Matthew
>>>>>> 
>>>>>> > Hi,
>>>>>> >
>>>>>> > I
have an example of a string of 4097 characters which can't be parsed
by
>>>>>> > fread; however, if I remove any character, it can be parsed
just fine. Is
>>>>>> > that a known limitation?
>>>>>> >
>>>>>> > (If I
write the string to a file and then fread the file name, it works
>>>>>>
> too.)
>>>>>> >
>>>>>> > Let me know if you need the string and/or a
bug report.
>>>>>> >
>>>>>> > Thanks
>>>>>> > Timoth?e >
_______________________________________________
>>>>>> > datatable-help
mailing list
>>>>>> > datatable-help at lists.r-forge.r-project.org
[1]
>>>>>> >
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[2]

 
Links:
------
[1]
mailto:datatable-help at lists.r-forge.r-project.org
[2]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[3]
mailto:mdowle at mdowle.plus.com
[4] mailto:mdowle at mdowle.plus.com
[5]
mailto:mdowle at mdowle.plus.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130328/36910e5b/attachment-0001.html>

From gsee000 at gmail.com  Thu Mar 28 16:23:34 2013
From: gsee000 at gmail.com (G See)
Date: Thu, 28 Mar 2013 10:23:34 -0500
Subject: [datatable-help] fread(character string) limited to strings
 less than 4096 long?
In-Reply-To: <230b0040889556349b21822824a5fb7e@imap.plus.net>
References: <CAGam+C7ASqn39X9zTgoWkyzXqu_KB=1Lgc3gYVSiMyqYQdKwgg@mail.gmail.com>
 <83979bd1fc26d19625910fd1ad31f0e4.squirrel@webmail.plus.net>
 <CAGam+C6fZnjYdugrFD=6gw20pOrTT5PZOnrkTj2qsj_EmD6aKQ@mail.gmail.com>
 <2c2af8789733127541fe78c1ccde5412@imap.plus.net>
 <CAGam+C69Vm=qjJb+8_Qo-FsJXpnvtDVxApdcFBPse1DOpvB3Fw@mail.gmail.com>
 <dff0921757ffe0687e4c79185ab5243c@imap.plus.net>
 <CAGam+C6bc1LA+ioOejVFAvZ5-=43-oZy5P94YgxRR0c7+VfEjA@mail.gmail.com>
 <230b0040889556349b21822824a5fb7e@imap.plus.net>
Message-ID: <CA+xi=qYF+oMHn0THi=L==XNSr9pmvmoxdDAi2kmkYQNDG7F8rA@mail.gmail.com>

FWIW, on mac:

> for (n in c(1023:1025, 10000)) {
+   A <- fread(
+            paste(
+                  rep('a\tb\n', n),
+                  collapse=''
+                  ),
+            sep='\t'
+            )
+   print(nrow(A))
+ }
[1] 255
[1] 255
[1] 255
[1] 255
> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] data.table_1.8.9

####### and with verbose

> for (n in c(1023:1025, 10000)) {
+  input = paste( rep('a\tb\n', n), collapse='')
+  A = fread(input,verbose=TRUE)
+  cat(nchar(input), nrow(A), "\n")
+ }
Input contains a \n (or is ""), taking this to be text input (not a filename)
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Using line 30 to detect sep (the last non blank line in the first 30) ... '\t'
Found 2 columns
First row with 2 fields occurs on line 1 (either column names or first
row of data)
All the fields on line 1 are character fields. Treating as the column names.
Count of eol after first data row: 255
Subtracted 0 for last eol and any trailing empty lines, leaving 255 data rows
Type codes: 33 (first 5 rows)
Type codes: 33 (+middle 5 rows)
Type codes: 33 (+last 5 rows)
   0.000s ( 14%) Memory map (rerun may be quicker)
   0.000s ( 25%) sep and header detection
   0.000s (  8%) Count rows (wc -l)
   0.000s ( 24%) Column type detection (first, middle and last 5 rows)
   0.000s (  6%) Allocation of 255x2 result (xMB) in RAM
   0.000s ( 22%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time
if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.000s (  1%) Changing na.strings to NA
   0.000s        Total
4092 255
Input contains a \n (or is ""), taking this to be text input (not a filename)
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Using line 30 to detect sep (the last non blank line in the first 30) ... '\t'
Found 2 columns
First row with 2 fields occurs on line 1 (either column names or first
row of data)
All the fields on line 1 are character fields. Treating as the column names.
Count of eol after first data row: 255
Subtracted 0 for last eol and any trailing empty lines, leaving 255 data rows
Type codes: 33 (first 5 rows)
Type codes: 33 (+middle 5 rows)
Type codes: 33 (+last 5 rows)
   0.000s ( 10%) Memory map (rerun may be quicker)
   0.000s ( 21%) sep and header detection
   0.000s ( 10%) Count rows (wc -l)
   0.000s ( 28%) Column type detection (first, middle and last 5 rows)
   0.000s (  3%) Allocation of 255x2 result (xMB) in RAM
   0.000s ( 26%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time
if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.000s (  2%) Changing na.strings to NA
   0.000s        Total
4096 255
Input contains a \n (or is ""), taking this to be text input (not a filename)
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Using line 30 to detect sep (the last non blank line in the first 30) ... '\t'
Found 2 columns
First row with 2 fields occurs on line 1 (either column names or first
row of data)
All the fields on line 1 are character fields. Treating as the column names.
Count of eol after first data row: 255
Subtracted 0 for last eol and any trailing empty lines, leaving 255 data rows
Type codes: 33 (first 5 rows)
Type codes: 33 (+middle 5 rows)
Type codes: 33 (+last 5 rows)
   0.000s ( 10%) Memory map (rerun may be quicker)
   0.000s ( 21%) sep and header detection
   0.000s ( 10%) Count rows (wc -l)
   0.000s ( 27%) Column type detection (first, middle and last 5 rows)
   0.000s (  3%) Allocation of 255x2 result (xMB) in RAM
   0.000s ( 27%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time
if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.000s (  1%) Changing na.strings to NA
   0.000s        Total
4100 255
Input contains a \n (or is ""), taking this to be text input (not a filename)
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Using line 30 to detect sep (the last non blank line in the first 30) ... '\t'
Found 2 columns
First row with 2 fields occurs on line 1 (either column names or first
row of data)
All the fields on line 1 are character fields. Treating as the column names.
Count of eol after first data row: 255
Subtracted 0 for last eol and any trailing empty lines, leaving 255 data rows
Type codes: 33 (first 5 rows)
Type codes: 33 (+middle 5 rows)
Type codes: 33 (+last 5 rows)
   0.000s ( 10%) Memory map (rerun may be quicker)
   0.000s ( 23%) sep and header detection
   0.000s ( 10%) Count rows (wc -l)
   0.000s ( 25%) Column type detection (first, middle and last 5 rows)
   0.000s (  3%) Allocation of 255x2 result (xMB) in RAM
   0.000s ( 26%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time
if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.000s (  3%) Changing na.strings to NA
   0.000s        Total
40000 255

Best,
Garrett


On Thu, Mar 28, 2013 at 10:19 AM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
>
>
> Hi,
>
> Thanks.  That was from v1.8.8 on computer 2 I hope.  Computer 1 with v1.8.9
> should have the -nan% problem fixed.
>
> I'm a bit stumped for the moment.  I've filed a bug report.  Probably, if I
> still can't reproduce my end, I'll add some more detailed tracing to verbose
> output and ask you to try again next week if that's ok.
>
> Thanks for reporting!
>
> Matthew
>
>
>
> On 28.03.2013 14:58, Timoth?e Carayol wrote:
>
> Input contains a \n (or is ""), taking this to be text input (not a
> filename)
> Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
> Using line 30 to detect sep (the last non blank line in the first 30) ...
> '\t'
> Found 2 columns
> First row with 2 fields occurs on line 1 (either column names or first row
> of data)
> All the fields on line 1 are character fields. Treating as the column names.
> Count of eol after first data row: 1023
> Subtracted 1 for last eol and any trailing empty lines, leaving 1022 data
> rows
> Type codes: 33 (first 5 rows)
> Type codes: 33 (+middle 5 rows)
> Type codes: 33 (+last 5 rows)
>    0.000s (-nan%) Memory map (rerun may be quicker)
>    0.000s (-nan%) sep and header detection
>    0.000s (-nan%) Count rows (wc -l)
>    0.000s (-nan%) Column type detection (first, middle and last 5 rows)
>    0.000s (-nan%) Allocation of 1022x2 result (xMB) in RAM
>    0.000s (-nan%) Reading data
>    0.000s (-nan%) Allocation for type bumps (if any), including gc time if
> triggered
>    0.000s (-nan%) Coercing data already read in type bumps (if any)
>    0.000s (-nan%) Changing na.strings to NA
>    0.000s        Total
> 4092 1022
> Input contains a \n (or is ""), taking this to be text input (not a
> filename)
> Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
> Using line 30 to detect sep (the last non blank line in the first 30) ...
> '\t'
> Found 2 columns
> First row with 2 fields occurs on line 1 (either column names or first row
> of data)
> All the fields on line 1 are character fields. Treating as the column names.
> Count of eol after first data row: 1023
> Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data
> rows
> Type codes: 33 (first 5 rows)
> Type codes: 33 (+middle 5 rows)
> Type codes: 33 (+last 5 rows)
>    0.000s (-nan%) Memory map (rerun may be quicker)
>    0.000s (-nan%) sep and header detection
>    0.000s (-nan%) Count rows (wc -l)
>    0.000s (-nan%) Column type detection (first, middle and last 5 rows)
>    0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM
>    0.000s (-nan%) Reading data
>    0.000s (-nan%) Allocation for type bumps (if any), including gc time if
> triggered
>    0.000s (-nan%) Coercing data already read in type bumps (if any)
>    0.000s (-nan%) Changing na.strings to NA
>    0.000s        Total
> 4096 1023
> Input contains a \n (or is ""), taking this to be text input (not a
> filename)
> Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
> Using line 30 to detect sep (the last non blank line in the first 30) ...
> '\t'
> Found 2 columns
> First row with 2 fields occurs on line 1 (either column names or first row
> of data)
> All the fields on line 1 are character fields. Treating as the column names.
> Count of eol after first data row: 1023
> Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data
> rows
> Type codes: 33 (first 5 rows)
> Type codes: 33 (+middle 5 rows)
> Type codes: 33 (+last 5 rows)
>    0.000s (-nan%) Memory map (rerun may be quicker)
>    0.000s (-nan%) sep and header detection
>    0.000s (-nan%) Count rows (wc -l)
>    0.000s (-nan%) Column type detection (first, middle and last 5 rows)
>    0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM
>    0.000s (-nan%) Reading data
>    0.000s (-nan%) Allocation for type bumps (if any), including gc time if
> triggered
>    0.000s (-nan%) Coercing data already read in type bumps (if any)
>    0.000s (-nan%) Changing na.strings to NA
>    0.000s        Total
> 4100 1023
> Input contains a \n (or is ""), taking this to be text input (not a
> filename)
> Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
> Using line 30 to detect sep (the last non blank line in the first 30) ...
> '\t'
> Found 2 columns
> First row with 2 fields occurs on line 1 (either column names or first row
> of data)
> All the fields on line 1 are character fields. Treating as the column names.
> Count of eol after first data row: 1023
> Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data
> rows
> Type codes: 33 (first 5 rows)
> Type codes: 33 (+middle 5 rows)
> Type codes: 33 (+last 5 rows)
>    0.000s (-nan%) Memory map (rerun may be quicker)
>    0.000s (-nan%) sep and header detection
>    0.000s (-nan%) Count rows (wc -l)
>    0.000s (-nan%) Column type detection (first, middle and last 5 rows)
>    0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM
>    0.000s (-nan%) Reading data
>    0.000s (-nan%) Allocation for type bumps (if any), including gc time if
> triggered
>    0.000s (-nan%) Coercing data already read in type bumps (if any)
>    0.000s (-nan%) Changing na.strings to NA
>    0.000s        Total
> 40000 1023
>
>
> On Thu, Mar 28, 2013 at 2:55 PM, Matthew Dowle <mdowle at mdowle.plus.com>
> wrote:
>>
>>
>>
>> Hm this is odd.
>>
>> Could you run the following and paste back the (verbose) results please.
>>
>> for (n in c(1023:1025, 10000)) {
>>
>>  input = paste( rep('a\tb\n', n), collapse='')
>>  A = fread(input,verbose=TRUE)
>>  cat(nchar(input), nrow(A), "\n")
>> }
>>
>>
>>
>>
>>
>> On 28.03.2013 14:38, Timoth?e Carayol wrote:
>>
>> Curiouser and curiouser..
>>
>> I can reproduce on two computers with different versions of R and of
>> data.table.
>>
>>
>>
>> Computer 1 (it says unknown-linux but is actually ubuntu):
>>
>> R version 2.15.3 (2013-03-01)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
>> LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
>> LC_MONETARY=en_GB.UTF-8
>>    LC_MESSAGES=en_GB.UTF-8    LC_PAPER=C                 LC_NAME=C
>> LC_ADDRESS=C
>> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8
>> LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] bit64_0.9-2      bit_1.1-10       data.table_1.8.9 colorout_1.0-0
>> Computer 2:
>> R version 2.15.2 (2012-10-26)
>> Platform: x86_64-redhat-linux-gnu (64-bit)
>>
>> locale:
>>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
>>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
>>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
>>  [7] LC_PAPER=C                 LC_NAME=C
>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] data.table_1.8.8
>>
>> loaded via a namespace (and not attached):
>> [1] tools_2.15.2
>>
>>
>> On Thu, Mar 28, 2013 at 2:31 PM, Matthew Dowle <mdowle at mdowle.plus.com>
>> wrote:
>>>
>>>
>>>
>>> Interesting, what's your sessionInfo() please?
>>>
>>> For me it seems to work ok :
>>>
>>> [1] 1022
>>> [1] 1023
>>> [1] 1024
>>> [1] 9999
>>>
>>> > sessionInfo()
>>> R version 2.15.2 (2012-10-26)
>>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>
>>>
>>>
>>> On 27.03.2013 22:49, Timoth?e Carayol wrote:
>>>
>>> Agree with Muhammad, longer character strings are definitely permitted in
>>> R.
>>> A minimal example that show something strange happening with fread:
>>> for (n in c(1023:1025, 10000)) {
>>>   A
>>>            paste(
>>>                  rep('a\tb\n', n),
>>>                  collapse=''
>>>                  ),
>>>            sep='\t'
>>>            )
>>>   print(nrow(A))
>>> }
>>> On my computer, I obtain:
>>> [1] 1022
>>> [1] 1023
>>> [1] 1023
>>> [1] 1023
>>> Hope this helps
>>> Timoth?e
>>>
>>>
>>> On Wed, Mar 27, 2013 at 9:23 PM, Matthew Dowle <mdowle at mdowle.plus.com>
>>> wrote:
>>>>
>>>> Hi,
>>>> Nice to hear from you. Nope not known to me. Obviously 4096 is 4k, is
>>>> that
>>>> the R limit for a character string length? What happens at 4097?
>>>> Matthew
>>>>
>>>> > Hi,
>>>> >
>>>> > I have an example of a string of 4097 characters which can't be parsed
>>>> > by
>>>> > fread; however, if I remove any character, it can be parsed just fine.
>>>> > Is
>>>> > that a known limitation?
>>>> >
>>>> > (If I write the string to a file and then fread the file name, it
>>>> > works
>>>> > too.)
>>>> >
>>>> > Let me know if you need the string and/or a bug report.
>>>> >
>>>> > Thanks
>>>> > Timoth?e
>>>> > _______________________________________________
>>>> > datatable-help mailing list
>>>> > datatable-help at lists.r-forge.r-project.org
>>>> >
>>>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>>
>
>
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

From timothee.carayol at gmail.com  Thu Mar 28 16:26:38 2013
From: timothee.carayol at gmail.com (=?ISO-8859-1?Q?Timoth=E9e_Carayol?=)
Date: Thu, 28 Mar 2013 15:26:38 +0000
Subject: [datatable-help] fread(character string) limited to strings
 less than 4096 long?
In-Reply-To: <230b0040889556349b21822824a5fb7e@imap.plus.net>
References: <CAGam+C7ASqn39X9zTgoWkyzXqu_KB=1Lgc3gYVSiMyqYQdKwgg@mail.gmail.com>
 <83979bd1fc26d19625910fd1ad31f0e4.squirrel@webmail.plus.net>
 <CAGam+C6fZnjYdugrFD=6gw20pOrTT5PZOnrkTj2qsj_EmD6aKQ@mail.gmail.com>
 <2c2af8789733127541fe78c1ccde5412@imap.plus.net>
 <CAGam+C69Vm=qjJb+8_Qo-FsJXpnvtDVxApdcFBPse1DOpvB3Fw@mail.gmail.com>
 <dff0921757ffe0687e4c79185ab5243c@imap.plus.net>
 <CAGam+C6bc1LA+ioOejVFAvZ5-=43-oZy5P94YgxRR0c7+VfEjA@mail.gmail.com>
 <230b0040889556349b21822824a5fb7e@imap.plus.net>
Message-ID: <CAGam+C4iLjbUyskZZPHNUqOcQMPvtkwhUkP+Ana_hSWG0jewVQ@mail.gmail.com>

Of course, I'll be happy to help!

By the way the verbose output was actually from computer 1 (with 1.8.9) so
it seems like the -nan% problem is maybe still there?

Cheers
Timoth?e


On Thu, Mar 28, 2013 at 3:19 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:

> **
>
>
>
> Hi,
>
> Thanks.  That was from v1.8.8 on computer 2 I hope.  Computer 1 with
> v1.8.9 should have the -nan% problem fixed.
>
> I'm a bit stumped for the moment.  I've filed a bug report.  Probably, if
> I still can't reproduce my end, I'll add some more detailed tracing to
> verbose output and ask you to try again next week if that's ok.
>
> Thanks for reporting!
>
> Matthew
>
>
>
> On 28.03.2013 14:58, Timoth?e Carayol wrote:
>
>   Input contains a \n (or is ""), taking this to be text input (not a
> filename)
> Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
>
> Using line 30 to detect sep (the last non blank line in the first 30) ...
> '\t'
> Found 2 columns
>
> First row with 2 fields occurs on line 1 (either column names or first row
> of data)
> All the fields on line 1 are character fields. Treating as the column
> names.
> Count of eol after first data row: 1023
>
> Subtracted 1 for last eol and any trailing empty lines, leaving 1022 data
> rows
> Type codes: 33 (first 5 rows)
>
> Type codes: 33 (+middle 5 rows)
>
> Type codes: 33 (+last 5 rows)
>
>    0.000s (-nan%) Memory map (rerun may be quicker)
>
>    0.000s (-nan%) sep and header detection
>
>    0.000s (-nan%) Count rows (wc -l)
>
>    0.000s (-nan%) Column type detection (first, middle and last 5 rows)
>
>    0.000s (-nan%) Allocation of 1022x2 result (xMB) in RAM
>
>    0.000s (-nan%) Reading data
>
>    0.000s (-nan%) Allocation for type bumps (if any), including gc time if
> triggered
>    0.000s (-nan%) Coercing data already read in type bumps (if any)
>
>    0.000s (-nan%) Changing na.strings to NA
>
>    0.000s        Total
>
> 4092 1022
>
> Input contains a \n (or is ""), taking this to be text input (not a
> filename)
>  Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
>
> Using line 30 to detect sep (the last non blank line in the first 30) ...
> '\t'
> Found 2 columns
>
> First row with 2 fields occurs on line 1 (either column names or first row
> of data)
> All the fields on line 1 are character fields. Treating as the column
> names.
> Count of eol after first data row: 1023
>
> Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data
> rows
> Type codes: 33 (first 5 rows)
>
> Type codes: 33 (+middle 5 rows)
>
> Type codes: 33 (+last 5 rows)
>
>    0.000s (-nan%) Memory map (rerun may be quicker)
>
>    0.000s (-nan%) sep and header detection
>
>    0.000s (-nan%) Count rows (wc -l)
>
>    0.000s (-nan%) Column type detection (first, middle and last 5 rows)
>
>    0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM
>
>    0.000s (-nan%) Reading data
>
>    0.000s (-nan%) Allocation for type bumps (if any), including gc time if
> triggered
>    0.000s (-nan%) Coercing data already read in type bumps (if any)
>
>    0.000s (-nan%) Changing na.strings to NA
>
>    0.000s        Total
>
> 4096 1023
>
> Input contains a \n (or is ""), taking this to be text input (not a
> filename)
> Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
>
> Using line 30 to detect sep (the last non blank line in the first 30) ...
> '\t'
> Found 2 columns
>
> First row with 2 fields occurs on line 1 (either column names or first row
> of data)
> All the fields on line 1 are character fields. Treating as the column
> names.
> Count of eol after first data row: 1023
>
> Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data
> rows
> Type codes: 33 (first 5 rows)
>
> Type codes: 33 (+middle 5 rows)
>
> Type codes: 33 (+last 5 rows)
>
>    0.000s (-nan%) Memory map (rerun may be quicker)
>
>    0.000s (-nan%) sep and header detection
>
>    0.000s (-nan%) Count rows (wc -l)
>
>    0.000s (-nan%) Column type detection (first, middle and last 5 rows)
>
>    0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM
>
>    0.000s (-nan%) Reading data
>
>    0.000s (-nan%) Allocation for type bumps (if any), including gc time if
> triggered
>    0.000s (-nan%) Coercing data already read in type bumps (if any)
>
>    0.000s (-nan%) Changing na.strings to NA
>
>    0.000s        Total
>
> 4100 1023
>
> Input contains a \n (or is ""), taking this to be text input (not a
> filename)
> Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
>
> Using line 30 to detect sep (the last non blank line in the first 30) ...
> '\t'
> Found 2 columns
>
> First row with 2 fields occurs on line 1 (either column names or first row
> of data)
> All the fields on line 1 are character fields. Treating as the column
> names.
> Count of eol after first data row: 1023
>
> Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data
> rows
> Type codes: 33 (first 5 rows)
>
> Type codes: 33 (+middle 5 rows)
>
> Type codes: 33 (+last 5 rows)
>
>    0.000s (-nan%) Memory map (rerun may be quicker)
>
>    0.000s (-nan%) sep and header detection
>
>    0.000s (-nan%) Count rows (wc -l)
>
>    0.000s (-nan%) Column type detection (first, middle and last 5 rows)
>
>    0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM
>
>    0.000s (-nan%) Reading data
>
>    0.000s (-nan%) Allocation for type bumps (if any), including gc time if
> triggered
>    0.000s (-nan%) Coercing data already read in type bumps (if any)
>
>    0.000s (-nan%) Changing na.strings to NA
>
>    0.000s        Total
>
> 40000 1023
>
>
>
> On Thu, Mar 28, 2013 at 2:55 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>
>>
>>
>> Hm this is odd.
>>
>> Could you run the following and paste back the (verbose) results please.
>> for (n in c(1023:1025, 10000)) {
>>
>>  input = paste( rep('a\tb\n', n), collapse='')
>>  A = fread(input,verbose=TRUE)
>>  cat(nchar(input), nrow(A), "\n")
>> }
>>
>>
>>
>>
>>
>> On 28.03.2013 14:38, Timoth?e Carayol wrote:
>>
>>  Curiouser and curiouser..
>>
>> I can reproduce on two computers with different versions of R and of
>> data.table.
>>
>>
>>
>> Computer 1 (it says unknown-linux but is actually ubuntu):
>>
>> R version 2.15.3 (2013-03-01)
>>
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>>
>>
>> locale:
>>
>>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
>> LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
>> LC_MONETARY=en_GB.UTF-8
>>    LC_MESSAGES=en_GB.UTF-8    LC_PAPER=C                 LC_NAME=C
>>            LC_ADDRESS=C
>> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8
>> LC_IDENTIFICATION=C
>>
>>
>>
>> attached base packages:
>>
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>>
>>
>> other attached packages:
>>
>> [1] bit64_0.9-2      bit_1.1-10       data.table_1.8.9 colorout_1.0-0
>>
>> Computer 2:
>>  R version 2.15.2 (2012-10-26)
>> Platform: x86_64-redhat-linux-gnu (64-bit)
>>
>> locale:
>>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
>>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
>>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
>>  [7] LC_PAPER=C                 LC_NAME=C
>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] data.table_1.8.8
>>
>> loaded via a namespace (and not attached):
>> [1] tools_2.15.2
>>
>>
>> On Thu, Mar 28, 2013 at 2:31 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>>
>>>
>>>
>>> Interesting, what's your sessionInfo() please?
>>>
>>> For me it seems to work ok :
>>>
>>> [1] 1022
>>> [1] 1023
>>> [1] 1024
>>> [1] 9999
>>>
>>> > sessionInfo()
>>> R version 2.15.2 (2012-10-26)
>>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>
>>>
>>>
>>> On 27.03.2013 22:49, Timoth?e Carayol wrote:
>>>
>>>  Agree with Muhammad, longer character strings are definitely permitted
>>> in R.
>>> A minimal example that show something strange happening with fread:
>>>   for (n in c(1023:1025, 10000)) {
>>>   A
>>>              paste(
>>>                  rep('a\tb\n', n),
>>>                  collapse=''
>>>                  ),
>>>            sep='\t'
>>>            )
>>>   print(nrow(A))
>>> }
>>> On my computer, I obtain:
>>>  [1] 1022
>>> [1] 1023
>>> [1] 1023
>>> [1] 1023
>>>  Hope this helps
>>> Timoth?e
>>>
>>>
>>> On Wed, Mar 27, 2013 at 9:23 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>>>
>>>> Hi,
>>>> Nice to hear from you. Nope not known to me. Obviously 4096 is 4k, is
>>>> that
>>>> the R limit for a character string length? What happens at 4097?
>>>> Matthew
>>>>
>>>> > Hi,
>>>> >
>>>> > I have an example of a string of 4097 characters which can't be
>>>> parsed by
>>>> > fread; however, if I remove any character, it can be parsed just
>>>> fine. Is
>>>> > that a known limitation?
>>>> >
>>>> > (If I write the string to a file and then fread the file name, it
>>>> works
>>>> > too.)
>>>> >
>>>> > Let me know if you need the string and/or a bug report.
>>>> >
>>>> > Thanks
>>>> > Timoth?e
>>>>  > _______________________________________________
>>>> > datatable-help mailing list
>>>> > datatable-help at lists.r-forge.r-project.org
>>>> >
>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130328/9ac4975e/attachment-0001.html>

From saporta at scarletmail.rutgers.edu  Thu Mar 28 17:52:57 2013
From: saporta at scarletmail.rutgers.edu (Ricardo Saporta)
Date: Thu, 28 Mar 2013 12:52:57 -0400
Subject: [datatable-help] rbindlist on list of data.frames with factor column
Message-ID: <CAE7Aa4SW6XpmhL5KJOT0PcZF54+WcY0-5anXodarC=k917on-g@mail.gmail.com>

Hello,

I found that when using `rbindlist` on a list of data.frames with factor
columns, the factor column is getting concat'd as its numeric equivalent.

This of course, does not happen when using a list of data.tables.

    # sample data, using data.frame
    sampleList.DF <- lapply(LETTERS[1:5], function(L)
      data.frame(Val1=rnorm(3), Val2=runif(3), FactorCol=L) )

    sampleList.DF <- lapply(sampleList.DF, function(x)
      {x$StringCol <- as.character(x$FactorCol); x})

    # sample data, using data.table
    sampleList.DT <- lapply(LETTERS[1:5], function(L)
      data.table(Val1=rnorm(3), Val2=runif(3), FactorCol=L) )
    sampleList.DT <- lapply(sampleList.DT, function(x)
       x[, StringCol := as.character(FactorCol)])


# Compare the column `FactorCol`:

    rbindlist(sampleList.DT)
    rbindlist(sampleList.DF)
    do.call(rbind, sampleList.DF)

Interestingly, I originally thought it was levels dependent:
(I would have expected, for example, the following to allow for the levels
of the third list element, but it does not).

    sampleList.DF[[1]][, "FactorCol"] <- factor(c("A", "C", "A"))

    # all the levels in third element are present in the first
    all(levels(sampleList.DF[[3]][, "FactorCol"])  %in%
 levels(sampleList.DF[[1]][, "FactorCol"]))
    # [1] TRUE

But...

    rbindlist(sampleList.DF)

However:

    sampleList.DF[[1]][, "FactorCol"] <- factor(c("C", "A", "A"),
levels=c("C", "A"))
    rbindlist(sampleList.DF)

Is the above behavior intended?

Cheers,
Rick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130328/e10006ba/attachment.html>

From saporta at scarletmail.rutgers.edu  Thu Mar 28 18:34:29 2013
From: saporta at scarletmail.rutgers.edu (Ricardo Saporta)
Date: Thu, 28 Mar 2013 13:34:29 -0400
Subject: [datatable-help] rbindlist on list of data.frames with factor
	column
In-Reply-To: <CAE7Aa4SW6XpmhL5KJOT0PcZF54+WcY0-5anXodarC=k917on-g@mail.gmail.com>
References: <CAE7Aa4SW6XpmhL5KJOT0PcZF54+WcY0-5anXodarC=k917on-g@mail.gmail.com>
Message-ID: <CAE7Aa4RF6MTkpa1PFtwUApQBgNoXqcArX7K17EQ07fOFCqd8oQ@mail.gmail.com>

My apologies, I had a mistake in my previous email.  (I forgot that
data.table does not coerce strings to factor)
It looks like the `rbindlist` behavior observed occurs for *both*, a list
of data.tables and a list of data.frames (assuming, of course, that there
is factor column present)

    # sample data, using data.frame
    set.seed(1)
    sampleList.DF <- lapply(LETTERS[1:5], function(L)
      data.frame(Val1=rnorm(3), Val2=runif(3), FactorCol=factor(L)) )
    sampleList.DF <- lapply(sampleList.DF, function(x)
      {x$StringCol <- as.character(x$FactorCol); x})

    # sample data, using data.table
    set.seed(1)
    sampleList.DT <- lapply(LETTERS[1:5], function(L)
      data.table(Val1=rnorm(3), Val2=runif(3), FactorCol=factor(L)) )
    sampleList.DT <- lapply(sampleList.DT, function(x)
       x[, StringCol := as.character(FactorCol)])


# rbindlist results:

    rbindlist(sampleList.DT)
    rbindlist(sampleList.DF)

# expected behavior similiar to do.call(rbind, LIST)

    do.call(rbind, sampleList.DF)
    do.call(rbind, sampleList.DT)


On Thu, Mar 28, 2013 at 12:52 PM, Ricardo Saporta <
saporta at scarletmail.rutgers.edu> wrote:

> Hello,
>
> I found that when using `rbindlist` on a list of data.frames with factor
> columns, the factor column is getting concat'd as its numeric equivalent.
>
> This of course, does not happen when using a list of data.tables.
>
>     # sample data, using data.frame
>     sampleList.DF <- lapply(LETTERS[1:5], function(L)
>       data.frame(Val1=rnorm(3), Val2=runif(3), FactorCol=L) )
>
>     sampleList.DF <- lapply(sampleList.DF, function(x)
>       {x$StringCol <- as.character(x$FactorCol); x})
>
>     # sample data, using data.table
>     sampleList.DT <- lapply(LETTERS[1:5], function(L)
>       data.table(Val1=rnorm(3), Val2=runif(3), FactorCol=L) )
>     sampleList.DT <- lapply(sampleList.DT, function(x)
>        x[, StringCol := as.character(FactorCol)])
>
>
> # Compare the column `FactorCol`:
>
>     rbindlist(sampleList.DT)
>     rbindlist(sampleList.DF)
>     do.call(rbind, sampleList.DF)
>
> Interestingly, I originally thought it was levels dependent:
> (I would have expected, for example, the following to allow for the levels
> of the third list element, but it does not).
>
>     sampleList.DF[[1]][, "FactorCol"] <- factor(c("A", "C", "A"))
>
>     # all the levels in third element are present in the first
>     all(levels(sampleList.DF[[3]][, "FactorCol"])  %in%
>  levels(sampleList.DF[[1]][, "FactorCol"]))
>     # [1] TRUE
>
> But...
>
>     rbindlist(sampleList.DF)
>
> However:
>
>     sampleList.DF[[1]][, "FactorCol"] <- factor(c("C", "A", "A"),
> levels=c("C", "A"))
>     rbindlist(sampleList.DF)
>
> Is the above behavior intended?
>
> Cheers,
> Rick
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130328/355aa76a/attachment.html>

From mdowle at mdowle.plus.com  Fri Mar 29 02:04:32 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Fri, 29 Mar 2013 01:04:32 +0000
Subject: [datatable-help] rbindlist on list of data.frames with factor
 column
In-Reply-To: <CAE7Aa4RF6MTkpa1PFtwUApQBgNoXqcArX7K17EQ07fOFCqd8oQ@mail.gmail.com>
References: <CAE7Aa4SW6XpmhL5KJOT0PcZF54+WcY0-5anXodarC=k917on-g@mail.gmail.com>
 <CAE7Aa4RF6MTkpa1PFtwUApQBgNoXqcArX7K17EQ07fOFCqd8oQ@mail.gmail.com>
Message-ID: <c8ed8a4c1c70b77a58af57c029e5cef9@imap.plus.net>

 
Well spotted. Looking at the C source just now it looks like I never
considered factor columns in rbindlist(). At the time I needed
rbindlist, I needed it quickly for something I was doing, which didn't
use factor columns. 

Please file as a bug report. Should be fairly easy
to implement, and quick in C. It would populate the column as if it were
character (without actually converting to a new character vector for
each item l column) and then call factor() at R level afterwards to
refactor it. 

Matthew 

On 28.03.2013 17:34, Ricardo Saporta wrote: 

>
My apologies, I had a mistake in my previous email. (I forgot that
data.table does not coerce strings to factor) 
> It looks like the
`rbindlist` behavior observed occurs for _BOTH_, a list of data.tables
and a list of data.frames (assuming, of course, that there is factor
column present) 
> # sample data, using data.frame 
> set.seed(1) 
>
sampleList.DF 
> data.frame(Val1=rnorm(3), Val2=runif(3),
FactorCol=factor(L)) ) 
> sampleList.DF 
> {x$StringCol 
> # sample
data, using data.table 
> set.seed(1) 
> sampleList.DT 
>
data.table(Val1=rnorm(3), Val2=runif(3), FactorCol=factor(L)) ) 
>
sampleList.DT 
> x[, StringCol := as.character(FactorCol)]) 
> #
rbindlist results: 
> rbindlist(sampleList.DT) 
>
rbindlist(sampleList.DF) 
> # expected behavior similiar to
do.call(rbind, LIST) 
> do.call(rbind, sampleList.DF) 
> do.call(rbind,
sampleList.DT) 
> 
> On Thu, Mar 28, 2013 at 12:52 PM, Ricardo Saporta
<saporta at scarletmail.rutgers.edu [1]> wrote:
> 
>> Hello, 
>> I found
that when using `rbindlist` on a list of data.frames with factor
columns, the factor column is getting concat'd as its numeric
equivalent. 
>> This of course, does not happen when using a list of
data.tables. 
>> # sample data, using data.frame 
>> sampleList.DF 
>>
data.frame(Val1=rnorm(3), Val2=runif(3), FactorCol=L) ) 
>>
sampleList.DF 
>> {x$StringCol 
>> # sample data, using data.table 
>>
sampleList.DT 
>> data.table(Val1=rnorm(3), Val2=runif(3), FactorCol=L)
) 
>> sampleList.DT 
>> x[, StringCol := as.character(FactorCol)]) 
>> #
Compare the column `FactorCol`: 
>> rbindlist(sampleList.DT) 
>>
rbindlist(sampleList.DF) 
>> do.call(rbind, sampleList.DF) 
>>
Interestingly, I originally thought it was levels dependent: 
>> (I
would have expected, for example, the following to allow for the levels
of the third list element, but it does not). 
>> sampleList.DF[[1]][,
"FactorCol"] 
>> 
>> # all the levels in third element are present in
the first 
>> all(levels(sampleList.DF[[3]][, "FactorCol"]) %in%
levels(sampleList.DF[[1]][, "FactorCol"])) 
>> # [1] TRUE 
>> But... 
>>
rbindlist(sampleList.DF) 
>> However: 
>> sampleList.DF[[1]][,
"FactorCol"] 
>> rbindlist(sampleList.DF) 
>> 
>> Is the above behavior
intended? 
>> Cheers, 
>> Rick

 
Links:
------
[1]
mailto:saporta at scarletmail.rutgers.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130329/96904d46/attachment.html>