From saporta at scarletmail.rutgers.edu Sun Mar 3 19:34:05 2013 From: saporta at scarletmail.rutgers.edu (Ricardo Saporta) Date: Sun, 3 Mar 2013 13:34:05 -0500 Subject: [datatable-help] Benchmarks for reshaping data Message-ID: Hello All, There were some questions on SO today regarding reshaping data which provided good opportunities to run benchmarks. I'm sending the links here in case others are interested: http://bit.ly/ZZXA6X http://bit.ly/YkdapY Cheers, Rick -- Ricardo Saporta Graduate Student, Data Analytics Rutgers University, New Jersey e: saporta at rutgers.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From victor.kryukov at gmail.com Sun Mar 3 23:25:52 2013 From: victor.kryukov at gmail.com (Victor Kryukov) Date: Sun, 3 Mar 2013 14:25:52 -0800 Subject: [datatable-help] Error in a package that imports data.table Message-ID: <1527B53B-4FB5-4937-8FEA-867EDC588482@gmail.com> Hello, I'm developing an R package which will be used internally at my company, and I have troubles using data.table. I'm very new to package development and I'm not really sure whether the errors I see are related to data.table or not, but here it is anyway. I have a function that imports data from .csv files and cleans the data (subsets, converting fields to numeric etc.). As the end of the function definition, I convert the resulting data.frame to data.table and return the result: ProcessData <- function(?) { ... df <- data.table(df) df } When I use this function standalone, after library(data.package) everything works as expected. However, when I'm defining this function as a part of a package and later call it, I'm getting the following error: Error in rbind(deparse.level, ...) : could not find function ".rbind.data.table" Please note that in the package .R files, I'm not importing data.table directly with library(data.package) but rather have `import(data.table)` statement in my NAMESPACE, as recommended here https://github.com/hadley/devtools/wiki/Namespaces. When I import data.table directly with library(data.table) after importing my package, everything works as expected. I suspect there may be something going wrong with namespaces in data.table. My environment: I'm using R 2.15.3 on Mac and have tested the above on both data.table 1.8.6 and 1.8.7. Please let me know if I need to provide more info. Any help will be much appreciated! Regards, Victor From mdowle at mdowle.plus.com Mon Mar 4 00:03:01 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Sun, 03 Mar 2013 23:03:01 +0000 Subject: [datatable-help] Benchmarks for reshaping data In-Reply-To: References: Message-ID: <71673794ac7b88c1caeba6eef69edc32@imap.plus.net> Hi, Many thanks. I commented/answered there. Matthew On 03.03.2013 18:34, Ricardo Saporta wrote: > Hello All, > There were some questions on SO today regarding reshaping data which provided good opportunities to run benchmarks. > I'm sending the links here in case others are interested: > http://bit.ly/ZZXA6X [1] > http://bit.ly/YkdapY [2] > Cheers, > Rick -- > > Ricardo Saporta > Graduate Student, Data Analytics > Rutgers University, New Jersey > e: saporta at rutgers.edu [3] Links: ------ [1] http://bit.ly/ZZXA6X [2] http://bit.ly/YkdapY [3] mailto:saporta at rutgers.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Mon Mar 4 00:26:30 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Sun, 03 Mar 2013 23:26:30 +0000 Subject: [datatable-help] Error in a package that imports data.table In-Reply-To: <1527B53B-4FB5-4937-8FEA-867EDC588482@gmail.com> References: <1527B53B-4FB5-4937-8FEA-867EDC588482@gmail.com> Message-ID: Hi, Did you include data.table in either the Imports or Depends field of your package's DESCRIPTION file? I've just improved data.table FAQ 6.9 to make that clearer. If it still doesn't work, does your package fully pass "R CMD check"? Matthew On 03.03.2013 22:25, Victor Kryukov wrote: > Hello, > > I'm developing an R package which will be used internally at my > company, and I have troubles using data.table. I'm very new to > package > development and I'm not really sure whether the errors I see are > related to data.table or not, but here it is anyway. > > I have a function that imports data from .csv files and cleans the > data (subsets, converting fields to numeric etc.). As the end of the > function definition, I convert the resulting data.frame to data.table > and return the result: > > ProcessData <- function(?) { > ... > df <- data.table(df) > df > } > > When I use this function standalone, after > > library(data.package) > > everything works as expected. However, when I'm defining this > function as a part of a package and later call it, I'm getting the > following error: > > Error in rbind(deparse.level, ...) : > could not find function ".rbind.data.table" > > Please note that in the package .R files, I'm not importing > data.table directly with library(data.package) but rather have > `import(data.table)` statement in my NAMESPACE, as recommended here > https://github.com/hadley/devtools/wiki/Namespaces. > > When I import data.table directly with library(data.table) after > importing my package, everything works as expected. > > I suspect there may be something going wrong with namespaces in > data.table. > > My environment: I'm using R 2.15.3 on Mac and have tested the above > on both data.table 1.8.6 and 1.8.7. Please let me know if I need to > provide more info. Any help will be much appreciated! > > Regards, > Victor > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From michael.nelson at sydney.edu.au Mon Mar 4 00:47:29 2013 From: michael.nelson at sydney.edu.au (Michael Nelson) Date: Sun, 3 Mar 2013 23:47:29 +0000 Subject: [datatable-help] which parent.frame is more correct Message-ID: <6FB5193A6CDCDF499486A833B7AFBDCD5827DC71@EX-MBX-PRO-04.mcs.usyd.edu.au> In answering http://stackoverflow.com/a/15102156/1385941 I confidently stated at parent.frame(3) was the correct frame to use, and I stand by that, but am slightly confused over how parent.frame 1 and 3 differ in how they are evaluated. More specifically, I don't understand why `parent.frame(1)` works as it does. Take for example x <- 3:4 dt <- data.table(x = 1:5,y=5:1, key = 'x') foo <-function(){ x <- 1:2 for(n in 1:5) { print(dt[list(get('x',parent.frame(n)))]) } } foo() # n= 1 # uses parent.frame of foo # # x y # 1: 3 3 # 2: 4 2 # n= 2 # some kind of self join of data.table # output equivalent of (dt[dt[list(x)]]) # x y y.1 # 1: 1 5 5 # 2: 2 4 4 # 3: 3 3 3 # 4: 4 2 2 # 5: 5 1 1 # n= 3 # uses parent.frame of call to `[.data.table` # x y # 1: 1 5 # 2: 2 4 # n >= 4 # uses parent.frame of foo again (makes sense I think) # x y # 1: 5 1 # 2: 4 2 ________________________________________ From: datatable-help-bounces at lists.r-forge.r-project.org [datatable-help-bounces at lists.r-forge.r-project.org] on behalf of Matthew Dowle [mdowle at mdowle.plus.com] Sent: Monday, 4 March 2013 10:26 AM To: victor.kryukov at gmail.com Cc: datatable-help at lists.r-forge.r-project.org Subject: Re: [datatable-help] Error in a package that imports data.table Hi, Did you include data.table in either the Imports or Depends field of your package's DESCRIPTION file? I've just improved data.table FAQ 6.9 to make that clearer. If it still doesn't work, does your package fully pass "R CMD check"? Matthew On 03.03.2013 22:25, Victor Kryukov wrote: > Hello, > > I'm developing an R package which will be used internally at my > company, and I have troubles using data.table. I'm very new to > package > development and I'm not really sure whether the errors I see are > related to data.table or not, but here it is anyway. > > I have a function that imports data from .csv files and cleans the > data (subsets, converting fields to numeric etc.). As the end of the > function definition, I convert the resulting data.frame to data.table > and return the result: > > ProcessData <- function(?) { > ... > df <- data.table(df) > df > } > > When I use this function standalone, after > > library(data.package) > > everything works as expected. However, when I'm defining this > function as a part of a package and later call it, I'm getting the > following error: > > Error in rbind(deparse.level, ...) : > could not find function ".rbind.data.table" > > Please note that in the package .R files, I'm not importing > data.table directly with library(data.package) but rather have > `import(data.table)` statement in my NAMESPACE, as recommended here > https://github.com/hadley/devtools/wiki/Namespaces. > > When I import data.table directly with library(data.table) after > importing my package, everything works as expected. > > I suspect there may be something going wrong with namespaces in > data.table. > > My environment: I'm using R 2.15.3 on Mac and have tested the above > on both data.table 1.8.6 and 1.8.7. Please let me know if I need to > provide more info. Any help will be much appreciated! > > Regards, > Victor > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From mdowle at mdowle.plus.com Mon Mar 4 01:56:33 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Mon, 04 Mar 2013 00:56:33 +0000 Subject: [datatable-help] which parent.frame is more correct In-Reply-To: <6FB5193A6CDCDF499486A833B7AFBDCD5827DC71@EX-MBX-PRO-04.mcs.usyd.edu.au> References: <6FB5193A6CDCDF499486A833B7AFBDCD5827DC71@EX-MBX-PRO-04.mcs.usyd.edu.au> Message-ID: <20aad0cfa80570d56aacdf745fb14461@imap.plus.net> Hi, In general it's probably best to use a unique name like "..tmpvar" and let `i` or `j` find that via scope. Passing a specific n to parent.frame(n) might work now, but may be dependent on data.table internals not changing in the future. Also, note the Notes in ?parent.frame (i.e. user beware). With those caveats out of the way, some answers inline ... On 03.03.2013 23:47, Michael Nelson wrote: > In answering http://stackoverflow.com/a/15102156/1385941 > > I confidently stated at parent.frame(3) was the correct frame to use, > and I stand by that, but am slightly confused over how parent.frame 1 > and 3 differ in how they are evaluated. > > More specifically, I don't understand why `parent.frame(1)` works as > it does. > > Take for example > > x <- 3:4 > > dt <- data.table(x = 1:5,y=5:1, key = 'x') > > foo <-function(){ > x <- 1:2 > for(n in 1:5) { > print(dt[list(get('x',parent.frame(n)))]) > } > } > > foo() > > # n= 1 > # uses parent.frame of foo > # > # x y > # 1: 3 3 > # 2: 4 2 Setting get(...,inherits=FALSE) reveals x isn't in parent.frame(1). That's base::eval itself ([.data.table internals call eval directly). get(...,inherits=TRUE) then finds "x" in .GlobalEnv via search(). > > # n= 2 > # some kind of self join of data.table > # output equivalent of (dt[dt[list(x)]]) > > # x y y.1 > # 1: 1 5 5 > # 2: 2 4 4 > # 3: 3 3 3 > # 4: 4 2 2 > # 5: 5 1 1 That's picking up the "x" column name in dt. Because the eval inside [.data.table passes x to it. In other words, the normal place variables in i are looked for. > # n= 3 > # uses parent.frame of call to `[.data.table` Yes, I think so. I'm not always certain myself. I don't use parent.frame(3) but seems it works. > > # x y > # 1: 1 5 > # 2: 2 4 > > # n >= 4 > # uses parent.frame of foo again (makes sense I think) > # x y > # 1: 5 1 > # 2: 4 2 If you need to refer to a specific scope without using parent.frame(n), this might be safer : foo <-function(){ x <- 1:2 ..localenv = environment() print(dt[list(get('x',..localenv,inherits=FALSE))]) } which is what ..() is intended to do in future built in : foo <-function(){ x <- 1:2 print(dt[..(x)]) } But currently I prefer doing : foo <-function(){ ..x <- 1:2 print(dt[list(..x)]) } which is what I meant at the top: in general it's probably best to use a unique name like ..tmpvar and let `i` or `j` find that via scope. Matthew From victor.kryukov at gmail.com Mon Mar 4 07:32:36 2013 From: victor.kryukov at gmail.com (Victor Kryukov) Date: Sun, 3 Mar 2013 22:32:36 -0800 Subject: [datatable-help] Error in a package that imports data.table In-Reply-To: References: <1527B53B-4FB5-4937-8FEA-867EDC588482@gmail.com> Message-ID: <2DD70661-7319-4D42-A308-9F3A6EC50AD0@gmail.com> Hi Matthew, my DESCRIPTION file has the following section: Imports: data.table, lubridate and my (generated) NAMESPACE contains export(ProcessTransactionSurvey) import(data.table) import(lubridate) My R CMD CHECK (run with check() from devtools) mostly runs OK but fails at the end with the following error, which is expected since I haven't created any documentation yet. I'm not sure yet have to fix this LaTeX warning (I do have latex installed on my machine). * checking PDF version of manual ... WARNING LaTeX errors when creating PDF version. This typically indicates Rd problems. LaTeX errors found: * checking PDF version of manual without hyperrefs or index ... ERROR Error: Command failed (1) Anything else I should check? Victor On Mar 3, 2013, at 3:26 PM, Matthew Dowle wrote: > > Hi, > > Did you include data.table in either the Imports or Depends field of your package's DESCRIPTION file? > > I've just improved data.table FAQ 6.9 to make that clearer. > > If it still doesn't work, does your package fully pass "R CMD check"? > > Matthew > > > On 03.03.2013 22:25, Victor Kryukov wrote: >> Hello, >> >> I'm developing an R package which will be used internally at my >> company, and I have troubles using data.table. I'm very new to package >> development and I'm not really sure whether the errors I see are >> related to data.table or not, but here it is anyway. >> >> I have a function that imports data from .csv files and cleans the >> data (subsets, converting fields to numeric etc.). As the end of the >> function definition, I convert the resulting data.frame to data.table >> and return the result: >> >> ProcessData <- function(?) { >> ... >> df <- data.table(df) >> df >> } >> >> When I use this function standalone, after >> >> library(data.package) >> >> everything works as expected. However, when I'm defining this >> function as a part of a package and later call it, I'm getting the >> following error: >> >> Error in rbind(deparse.level, ...) : >> could not find function ".rbind.data.table" >> >> Please note that in the package .R files, I'm not importing >> data.table directly with library(data.package) but rather have >> `import(data.table)` statement in my NAMESPACE, as recommended here >> https://github.com/hadley/devtools/wiki/Namespaces. >> >> When I import data.table directly with library(data.table) after >> importing my package, everything works as expected. >> >> I suspect there may be something going wrong with namespaces in data.table. >> >> My environment: I'm using R 2.15.3 on Mac and have tested the above >> on both data.table 1.8.6 and 1.8.7. Please let me know if I need to >> provide more info. Any help will be much appreciated! >> >> Regards, >> Victor >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > From mdowle at mdowle.plus.com Mon Mar 4 08:35:15 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Mon, 04 Mar 2013 07:35:15 +0000 Subject: [datatable-help] Error in a package that imports data.table In-Reply-To: <2DD70661-7319-4D42-A308-9F3A6EC50AD0@gmail.com> References: <1527B53B-4FB5-4937-8FEA-867EDC588482@gmail.com> <2DD70661-7319-4D42-A308-9F3A6EC50AD0@gmail.com> Message-ID: <6f4c9b413e4fd40bb994370d8993be4e@imap.plus.net> Hi, I don't see what's wrong then. Can you whittle the package down to the essential code such that you can attach it and we can reproduce? Thanks, Matthew On 04.03.2013 06:32, Victor Kryukov wrote: > Hi Matthew, > > my DESCRIPTION file has the following section: > > Imports: > data.table, > lubridate > > and my (generated) NAMESPACE contains > > export(ProcessTransactionSurvey) > import(data.table) > import(lubridate) > > My R CMD CHECK (run with check() from devtools) mostly runs OK but > fails at the end with the following error, which is expected since I > haven't created any documentation yet. I'm not sure yet have to fix > this LaTeX warning (I do have latex installed on my machine). > > * checking PDF version of manual ... WARNING > LaTeX errors when creating PDF version. > This typically indicates Rd problems. > LaTeX errors found: > * checking PDF version of manual without hyperrefs or index ... ERROR > Error: Command failed (1) > > Anything else I should check? > > Victor > > > On Mar 3, 2013, at 3:26 PM, Matthew Dowle > wrote: > >> >> Hi, >> >> Did you include data.table in either the Imports or Depends field of >> your package's DESCRIPTION file? >> >> I've just improved data.table FAQ 6.9 to make that clearer. >> >> If it still doesn't work, does your package fully pass "R CMD >> check"? >> >> Matthew >> >> >> On 03.03.2013 22:25, Victor Kryukov wrote: >>> Hello, >>> >>> I'm developing an R package which will be used internally at my >>> company, and I have troubles using data.table. I'm very new to >>> package >>> development and I'm not really sure whether the errors I see are >>> related to data.table or not, but here it is anyway. >>> >>> I have a function that imports data from .csv files and cleans the >>> data (subsets, converting fields to numeric etc.). As the end of >>> the >>> function definition, I convert the resulting data.frame to >>> data.table >>> and return the result: >>> >>> ProcessData <- function(?) { >>> ... >>> df <- data.table(df) >>> df >>> } >>> >>> When I use this function standalone, after >>> >>> library(data.package) >>> >>> everything works as expected. However, when I'm defining this >>> function as a part of a package and later call it, I'm getting the >>> following error: >>> >>> Error in rbind(deparse.level, ...) : >>> could not find function ".rbind.data.table" >>> >>> Please note that in the package .R files, I'm not importing >>> data.table directly with library(data.package) but rather have >>> `import(data.table)` statement in my NAMESPACE, as recommended here >>> https://github.com/hadley/devtools/wiki/Namespaces. >>> >>> When I import data.table directly with library(data.table) after >>> importing my package, everything works as expected. >>> >>> I suspect there may be something going wrong with namespaces in >>> data.table. >>> >>> My environment: I'm using R 2.15.3 on Mac and have tested the above >>> on both data.table 1.8.6 and 1.8.7. Please let me know if I need to >>> provide more info. Any help will be much appreciated! >>> >>> Regards, >>> Victor >>> >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From mailinglist.honeypot at gmail.com Mon Mar 4 22:39:22 2013 From: mailinglist.honeypot at gmail.com (Steve Lianoglou) Date: Mon, 4 Mar 2013 16:39:22 -0500 Subject: [datatable-help] Error in a package that imports data.table In-Reply-To: <6f4c9b413e4fd40bb994370d8993be4e@imap.plus.net> References: <1527B53B-4FB5-4937-8FEA-867EDC588482@gmail.com> <2DD70661-7319-4D42-A308-9F3A6EC50AD0@gmail.com> <6f4c9b413e4fd40bb994370d8993be4e@imap.plus.net> Message-ID: I'm not sure if order matters in the NAMESPACE, but maybe you could try to write it manually and put the import statements up top? I haven't come across this problem, and I've got several packages that use data.table via importing it as you show here ... On Mon, Mar 4, 2013 at 2:35 AM, Matthew Dowle wrote: > > Hi, > > I don't see what's wrong then. > > Can you whittle the package down to the essential code such that you can > attach it and we can reproduce? > > Thanks, > Matthew > > > > On 04.03.2013 06:32, Victor Kryukov wrote: >> >> Hi Matthew, >> >> my DESCRIPTION file has the following section: >> >> Imports: >> data.table, >> lubridate >> >> and my (generated) NAMESPACE contains >> >> export(ProcessTransactionSurvey) >> import(data.table) >> import(lubridate) >> >> My R CMD CHECK (run with check() from devtools) mostly runs OK but >> fails at the end with the following error, which is expected since I >> haven't created any documentation yet. I'm not sure yet have to fix >> this LaTeX warning (I do have latex installed on my machine). >> >> * checking PDF version of manual ... WARNING >> LaTeX errors when creating PDF version. >> This typically indicates Rd problems. >> LaTeX errors found: >> * checking PDF version of manual without hyperrefs or index ... ERROR >> Error: Command failed (1) >> >> Anything else I should check? >> >> Victor >> >> >> On Mar 3, 2013, at 3:26 PM, Matthew Dowle wrote: >> >>> >>> Hi, >>> >>> Did you include data.table in either the Imports or Depends field of your >>> package's DESCRIPTION file? >>> >>> I've just improved data.table FAQ 6.9 to make that clearer. >>> >>> If it still doesn't work, does your package fully pass "R CMD check"? >>> >>> Matthew >>> >>> >>> On 03.03.2013 22:25, Victor Kryukov wrote: >>>> >>>> Hello, >>>> >>>> I'm developing an R package which will be used internally at my >>>> company, and I have troubles using data.table. I'm very new to package >>>> development and I'm not really sure whether the errors I see are >>>> related to data.table or not, but here it is anyway. >>>> >>>> I have a function that imports data from .csv files and cleans the >>>> data (subsets, converting fields to numeric etc.). As the end of the >>>> function definition, I convert the resulting data.frame to data.table >>>> and return the result: >>>> >>>> ProcessData <- function(?) { >>>> ... >>>> df <- data.table(df) >>>> df >>>> } >>>> >>>> When I use this function standalone, after >>>> >>>> library(data.package) >>>> >>>> everything works as expected. However, when I'm defining this >>>> function as a part of a package and later call it, I'm getting the >>>> following error: >>>> >>>> Error in rbind(deparse.level, ...) : >>>> could not find function ".rbind.data.table" >>>> >>>> Please note that in the package .R files, I'm not importing >>>> data.table directly with library(data.package) but rather have >>>> `import(data.table)` statement in my NAMESPACE, as recommended here >>>> https://github.com/hadley/devtools/wiki/Namespaces. >>>> >>>> When I import data.table directly with library(data.table) after >>>> importing my package, everything works as expected. >>>> >>>> I suspect there may be something going wrong with namespaces in >>>> data.table. >>>> >>>> My environment: I'm using R 2.15.3 on Mac and have tested the above >>>> on both data.table 1.8.6 and 1.8.7. Please let me know if I need to >>>> provide more info. Any help will be much appreciated! >>>> >>>> Regards, >>>> Victor >>>> >>>> _______________________________________________ >>>> datatable-help mailing list >>>> datatable-help at lists.r-forge.r-project.org >>>> >>>> >>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >>> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact From mdowle at mdowle.plus.com Wed Mar 6 09:47:08 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Wed, 06 Mar 2013 08:47:08 +0000 Subject: [datatable-help] v1.8.8 is now on CRAN Message-ID: <303fa2c18913ee4f367c1521e97117f0@imap.plus.net> Please see NEWS : https://r-forge.r-project.org/scm/viewvc.php/pkg/NEWS?view=markup&root=datatable and the new paragraphs at the top of ?fread : https://r-forge.r-project.org/scm/viewvc.php/pkg/man/fread.Rd?view=markup&root=datatable As normal it will take a few days to reach all mirrors. R-Forge has now bumped to 1.8.9. The idea of even numbers on CRAN is to make it impossible for anyone to be running a slightly different version of 1.8.8 other than the one on CRAN (1.8.8 has never been available from R-Forge, even fleetingly). Matthew From statquant at outlook.com Wed Mar 6 13:37:51 2013 From: statquant at outlook.com (stat quant) Date: Wed, 6 Mar 2013 13:37:51 +0100 Subject: [datatable-help] datatable-help Digest, Vol 37, Issue 4 In-Reply-To: References: Message-ID: Hello Matthew, many thanks for all the work and all the improvements on data.table. Just a practical question : looking on http://cran.r-project.org/web/packages/data.table/index.html I see that mac/win versions are still 1.8.6 unlike the sources to be built (tar.gz), is it an error or is it expected (I am not aware of what is requested by cran to package devs) Again many thanks 2013/3/6 > Send datatable-help mailing list submissions to > datatable-help at lists.r-forge.r-project.org > > To subscribe or unsubscribe via the World Wide Web, visit > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > or, via email, send a message with subject or body 'help' to > datatable-help-request at lists.r-forge.r-project.org > > You can reach the person managing the list at > datatable-help-owner at lists.r-forge.r-project.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of datatable-help digest..." > > > Today's Topics: > > 1. v1.8.8 is now on CRAN (Matthew Dowle) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Wed, 06 Mar 2013 08:47:08 +0000 > From: Matthew Dowle > To: > Subject: [datatable-help] v1.8.8 is now on CRAN > Message-ID: <303fa2c18913ee4f367c1521e97117f0 at imap.plus.net> > Content-Type: text/plain; charset=UTF-8; format=flowed > > > Please see NEWS : > > > > https://r-forge.r-project.org/scm/viewvc.php/pkg/NEWS?view=markup&root=datatable > > and the new paragraphs at the top of ?fread : > > > > https://r-forge.r-project.org/scm/viewvc.php/pkg/man/fread.Rd?view=markup&root=datatable > > As normal it will take a few days to reach all mirrors. > > R-Forge has now bumped to 1.8.9. The idea of even numbers on CRAN is to > make it > impossible for anyone to be running a slightly different version of > 1.8.8 other than > the one on CRAN (1.8.8 has never been available from R-Forge, even > fleetingly). > > Matthew > > > > > ------------------------------ > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > End of datatable-help Digest, Vol 37, Issue 4 > ********************************************* > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Wed Mar 6 14:02:44 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Wed, 06 Mar 2013 13:02:44 +0000 Subject: [datatable-help] datatable-help Digest, Vol 37, Issue 4 In-Reply-To: References: Message-ID: No problem. Yes that's normal. It just takes a few days for all CRAN web links to update. Some parts will update faster than others, too. Sometimes it might be our browsers as well, so a Ctrl+F5 to flush the browser's cache sometimes helps (or sometimes that's not even enough and full cache purge is needed). On the "CRAN checks:" page you'll see tests now passing OK for r-release (and r-oldrel) for Windows. That's the page I watch. Once those update to 1.8.8 (as they have) and say "OK" (as they have) it's usually not too long (within 24 hours) to update the .zip link. Those red ERRORs turning to black OK is (sadly) the exciting bit for me! On 06.03.2013 12:37, stat quant wrote: > Hello Matthew, > many thanks for all the work and all the improvements on data.table. > Just a practical question : > looking on http://cran.r-project.org/web/packages/data.table/index.html [12] I see that mac/win versions are still 1.8.6 unlike the sources to be built (tar.gz), is it an error or is it expected (I am not aware of what is requested by cran to package devs) > > Again many thanks > > 2013/3/6 > >> Send datatable-help mailing list submissions to >> datatable-help at lists.r-forge.r-project.org [1] >> >> To subscribe or unsubscribe via the World Wide Web, visit >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [2] >> >> or, via email, send a message with subject or body 'help' to >> datatable-help-request at lists.r-forge.r-project.org [3] >> >> You can reach the person managing the list at >> datatable-help-owner at lists.r-forge.r-project.org [4] >> >> When replying, please edit your Subject line so it is more specific >> than "Re: Contents of datatable-help digest..." >> >> Today's Topics: >> >> 1. v1.8.8 is now on CRAN (Matthew Dowle) >> >> ---------------------------------------------------------------------- >> >> Message: 1 >> Date: Wed, 06 Mar 2013 08:47:08 +0000 >> From: Matthew Dowle >> To: >> Subject: [datatable-help] v1.8.8 is now on CRAN >> Message-ID: <303fa2c18913ee4f367c1521e97117f0 at imap.plus.net [7]> >> Content-Type: text/plain; charset=UTF-8; format=flowed >> >> Please see NEWS : >> >> https://r-forge.r-project.org/scm/viewvc.php/pkg/NEWS?view=markup&root=datatable [8] >> >> and the new paragraphs at the top of ?fread : >> >> https://r-forge.r-project.org/scm/viewvc.php/pkg/man/fread.Rd?view=markup&root=datatable [9] >> >> As normal it will take a few days to reach all mirrors. >> >> R-Forge has now bumped to 1.8.9. The idea of even numbers on CRAN is to >> make it >> impossible for anyone to be running a slightly different version of >> 1.8.8 other than >> the one on CRAN (1.8.8 has never been available from R-Forge, even >> fleetingly). >> >> Matthew >> >> ------------------------------ >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org [10] >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [11] >> >> End of datatable-help Digest, Vol 37, Issue 4 >> ********************************************* Links: ------ [1] mailto:datatable-help at lists.r-forge.r-project.org [2] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [3] mailto:datatable-help-request at lists.r-forge.r-project.org [4] mailto:datatable-help-owner at lists.r-forge.r-project.org [5] mailto:mdowle at mdowle.plus.com [6] mailto:datatable-help at lists.r-forge.r-project.org [7] mailto:303fa2c18913ee4f367c1521e97117f0 at imap.plus.net [8] https://r-forge.r-project.org/scm/viewvc.php/pkg/NEWS?view=markup&root=datatable [9] https://r-forge.r-project.org/scm/viewvc.php/pkg/man/fread.Rd?view=markup&root=datatable [10] mailto:datatable-help at lists.r-forge.r-project.org [11] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [12] http://cran.r-project.org/web/packages/data.table/index.html [13] mailto:datatable-help-request at lists.r-forge.r-project.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From victor.kryukov at gmail.com Thu Mar 7 04:47:18 2013 From: victor.kryukov at gmail.com (Victor Kryukov) Date: Wed, 6 Mar 2013 19:47:18 -0800 Subject: [datatable-help] Error in a package that imports data.table In-Reply-To: References: <1527B53B-4FB5-4937-8FEA-867EDC588482@gmail.com> <2DD70661-7319-4D42-A308-9F3A6EC50AD0@gmail.com> <6f4c9b413e4fd40bb994370d8993be4e@imap.plus.net> Message-ID: Hello everyone, and thanks for your replys. It looks like the problem was in my use of lubridate with datatable. Removing lubridate from imports fixes it. Every time I import *both* of these packages in R session, I get the following: > library(lubridate) > library(data.table) data.table 1.8.8 For help type: help("data.table") Attaching package: ?data.table? The following object(s) are masked from ?package:lubridate?: hour, mday, month, quarter, wday, week, yday, year I haven't really paid attention to it, but now that I started investigating, I noticed that data.table also defines all this function as helpers to work with IDateTime. So there should be a name conflict somewhere. I'm puzzled about why data table would include this function/classes (isn't it better to leave data handling to specialized classes?), but I understand that there may be a good reason for that. Unfortunately, my code is using lubridate heavily (it's just too good...), which leaves me in a tough spot - I would like to use both. If I had to choose, I would be forced to replace all lubridate code with standard R, which is not fun, but I guess I have to bite the bullet. Regards, Victor Yours Sincerely, Victor Kryukov US cell: +1-650-733-6510 On Mon, Mar 4, 2013 at 1:39 PM, Steve Lianoglou < mailinglist.honeypot at gmail.com> wrote: > I'm not sure if order matters in the NAMESPACE, but maybe you could > try to write it manually and put the import statements up top? > > I haven't come across this problem, and I've got several packages that > use data.table via importing it as you show here ... > > On Mon, Mar 4, 2013 at 2:35 AM, Matthew Dowle > wrote: > > > > Hi, > > > > I don't see what's wrong then. > > > > Can you whittle the package down to the essential code such that you can > > attach it and we can reproduce? > > > > Thanks, > > Matthew > > > > > > > > On 04.03.2013 06:32, Victor Kryukov wrote: > >> > >> Hi Matthew, > >> > >> my DESCRIPTION file has the following section: > >> > >> Imports: > >> data.table, > >> lubridate > >> > >> and my (generated) NAMESPACE contains > >> > >> export(ProcessTransactionSurvey) > >> import(data.table) > >> import(lubridate) > >> > >> My R CMD CHECK (run with check() from devtools) mostly runs OK but > >> fails at the end with the following error, which is expected since I > >> haven't created any documentation yet. I'm not sure yet have to fix > >> this LaTeX warning (I do have latex installed on my machine). > >> > >> * checking PDF version of manual ... WARNING > >> LaTeX errors when creating PDF version. > >> This typically indicates Rd problems. > >> LaTeX errors found: > >> * checking PDF version of manual without hyperrefs or index ... ERROR > >> Error: Command failed (1) > >> > >> Anything else I should check? > >> > >> Victor > >> > >> > >> On Mar 3, 2013, at 3:26 PM, Matthew Dowle > wrote: > >> > >>> > >>> Hi, > >>> > >>> Did you include data.table in either the Imports or Depends field of > your > >>> package's DESCRIPTION file? > >>> > >>> I've just improved data.table FAQ 6.9 to make that clearer. > >>> > >>> If it still doesn't work, does your package fully pass "R CMD check"? > >>> > >>> Matthew > >>> > >>> > >>> On 03.03.2013 22:25, Victor Kryukov wrote: > >>>> > >>>> Hello, > >>>> > >>>> I'm developing an R package which will be used internally at my > >>>> company, and I have troubles using data.table. I'm very new to package > >>>> development and I'm not really sure whether the errors I see are > >>>> related to data.table or not, but here it is anyway. > >>>> > >>>> I have a function that imports data from .csv files and cleans the > >>>> data (subsets, converting fields to numeric etc.). As the end of the > >>>> function definition, I convert the resulting data.frame to data.table > >>>> and return the result: > >>>> > >>>> ProcessData <- function(?) { > >>>> ... > >>>> df <- data.table(df) > >>>> df > >>>> } > >>>> > >>>> When I use this function standalone, after > >>>> > >>>> library(data.package) > >>>> > >>>> everything works as expected. However, when I'm defining this > >>>> function as a part of a package and later call it, I'm getting the > >>>> following error: > >>>> > >>>> Error in rbind(deparse.level, ...) : > >>>> could not find function ".rbind.data.table" > >>>> > >>>> Please note that in the package .R files, I'm not importing > >>>> data.table directly with library(data.package) but rather have > >>>> `import(data.table)` statement in my NAMESPACE, as recommended here > >>>> https://github.com/hadley/devtools/wiki/Namespaces. > >>>> > >>>> When I import data.table directly with library(data.table) after > >>>> importing my package, everything works as expected. > >>>> > >>>> I suspect there may be something going wrong with namespaces in > >>>> data.table. > >>>> > >>>> My environment: I'm using R 2.15.3 on Mac and have tested the above > >>>> on both data.table 1.8.6 and 1.8.7. Please let me know if I need to > >>>> provide more info. Any help will be much appreciated! > >>>> > >>>> Regards, > >>>> Victor > >>>> > >>>> _______________________________________________ > >>>> datatable-help mailing list > >>>> datatable-help at lists.r-forge.r-project.org > >>>> > >>>> > >>>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > >>> > >>> > >> > >> _______________________________________________ > >> datatable-help mailing list > >> datatable-help at lists.r-forge.r-project.org > >> > >> > >> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > _______________________________________________ > > datatable-help mailing list > > datatable-help at lists.r-forge.r-project.org > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mailinglist.honeypot at gmail.com Thu Mar 7 05:09:29 2013 From: mailinglist.honeypot at gmail.com (Steve Lianoglou) Date: Wed, 6 Mar 2013 23:09:29 -0500 Subject: [datatable-help] Error in a package that imports data.table In-Reply-To: References: <1527B53B-4FB5-4937-8FEA-867EDC588482@gmail.com> <2DD70661-7319-4D42-A308-9F3A6EC50AD0@gmail.com> <6f4c9b413e4fd40bb994370d8993be4e@imap.plus.net> Message-ID: Hi, On Wed, Mar 6, 2013 at 10:47 PM, Victor Kryukov wrote: [snip] > I'm puzzled about why data table would include this function/classes (isn't > it better to leave data handling to specialized classes?), but I understand > that there may be a good reason for that. I became a data.table user after IDateTime was in there (and I don't ever use it, actually), but my *guess* would be that it's there to use dates as keys for data.table ... > Unfortunately, my code is using > lubridate heavily (it's just too good...), which leaves me in a tough spot - > I would like to use both. If I had to choose, I would be forced to replace > all lubridate code with standard R, which is not fun, but I guess I have to > bite the bullet. You don't have to choose one over the other. I suspect import order could do the trick. Perhaps import()-ing data.table first, then lubridate might be all you have to do. If not, I *think* if you define hour, mday, mont, etc. in your package code as: mday <- lubridate::mday hour <- lubridate::hour And ensure that those functions are loaded first (either by using Collate: and specifying that file first, or putting that in a function called aaa.R or something), perhaps your code will recover "just like that" If that doesn't work either, another option is that you just prefix every lubridate call in your package code with the lubridate package name, eg. instead of `year(whenever)` you do `lubridate::year(whenever)`. HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact From victor.kryukov at gmail.com Thu Mar 7 05:16:33 2013 From: victor.kryukov at gmail.com (Victor Kryukov) Date: Wed, 6 Mar 2013 20:16:33 -0800 Subject: [datatable-help] Error in a package that imports data.table In-Reply-To: References: <1527B53B-4FB5-4937-8FEA-867EDC588482@gmail.com> <2DD70661-7319-4D42-A308-9F3A6EC50AD0@gmail.com> <6f4c9b413e4fd40bb994370d8993be4e@imap.plus.net> Message-ID: Thanks Steve. That's a good suggestion regarding the order or fully specifying lubridate names. I actually downloaded data.table code and looked through it, and unless I'm missing something, IDateTime is totally separate from everything else. At least if you search for 'IDate' or 'hour' or 'minute', you don't find them mentioned in other .R or .c files besides IDateTime.R And yes - lubridate IS mentioned in IDateTime.R code :) ################################################################### # Date - time extraction functions # Adapted from Hadley Wickham's routines cited below to ensure # integer results. # http://gist.github.com/10238 # See also Hadley's more advanced and complex lubridate package: # http://github.com/hadley/lubridate # lubridate routines do not return integer values. ################################################################### On Wed, Mar 6, 2013 at 8:09 PM, Steve Lianoglou < mailinglist.honeypot at gmail.com> wrote: > Hi, > > On Wed, Mar 6, 2013 at 10:47 PM, Victor Kryukov > wrote: > [snip] > > I'm puzzled about why data table would include this function/classes > (isn't > > it better to leave data handling to specialized classes?), but I > understand > > that there may be a good reason for that. > > I became a data.table user after IDateTime was in there (and I don't > ever use it, actually), but my *guess* would be that it's there to use > dates as keys for data.table ... > > > Unfortunately, my code is using > > lubridate heavily (it's just too good...), which leaves me in a tough > spot - > > I would like to use both. If I had to choose, I would be forced to > replace > > all lubridate code with standard R, which is not fun, but I guess I have > to > > bite the bullet. > > You don't have to choose one over the other. > > I suspect import order could do the trick. Perhaps import()-ing > data.table first, then lubridate might be all you have to do. > > If not, I *think* if you define hour, mday, mont, etc. in your package > code as: > > mday <- lubridate::mday > hour <- lubridate::hour > > And ensure that those functions are loaded first (either by using > Collate: and specifying that file first, or putting that in a function > called aaa.R or something), perhaps your code will recover "just like > that" > > If that doesn't work either, another option is that you just prefix > every lubridate call in your package code with the lubridate package > name, eg. instead of `year(whenever)` you do > `lubridate::year(whenever)`. > > HTH, > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > -------------- next part -------------- An HTML attachment was scrubbed... URL: From victor.kryukov at gmail.com Thu Mar 7 06:22:08 2013 From: victor.kryukov at gmail.com (Victor Kryukov) Date: Wed, 6 Mar 2013 21:22:08 -0800 Subject: [datatable-help] Error in a package that imports data.table In-Reply-To: References: <1527B53B-4FB5-4937-8FEA-867EDC588482@gmail.com> <2DD70661-7319-4D42-A308-9F3A6EC50AD0@gmail.com> <6f4c9b413e4fd40bb994370d8993be4e@imap.plus.net> Message-ID: Update: it looks the order in NAMESPACE doesn't matter for that particular problem. I can confirm that when I change it the order of package loading changes, as it's either data.table or lubridate that warns about overwritting each other's functions, but the problem exists in either case. I think my next steps will be to perform a surgery on data.table by removing all IDateTime from my local copy - will see if it helps :). On Wed, Mar 6, 2013 at 8:16 PM, Victor Kryukov wrote: > Thanks Steve. That's a good suggestion regarding the order or fully > specifying lubridate names. > > I actually downloaded data.table code and looked through it, and unless > I'm missing something, IDateTime is totally separate from everything else. > At least if you search for 'IDate' or 'hour' or 'minute', you don't find > them mentioned in other .R or .c files besides IDateTime.R > > And yes - lubridate IS mentioned in IDateTime.R code :) > > ################################################################### > # Date - time extraction functions > # Adapted from Hadley Wickham's routines cited below to ensure > # integer results. > # http://gist.github.com/10238 > # See also Hadley's more advanced and complex lubridate package: > # http://github.com/hadley/lubridate > # lubridate routines do not return integer values. > ################################################################### > > > On Wed, Mar 6, 2013 at 8:09 PM, Steve Lianoglou < > mailinglist.honeypot at gmail.com> wrote: > >> Hi, >> >> On Wed, Mar 6, 2013 at 10:47 PM, Victor Kryukov >> wrote: >> [snip] >> > I'm puzzled about why data table would include this function/classes >> (isn't >> > it better to leave data handling to specialized classes?), but I >> understand >> > that there may be a good reason for that. >> >> I became a data.table user after IDateTime was in there (and I don't >> ever use it, actually), but my *guess* would be that it's there to use >> dates as keys for data.table ... >> >> > Unfortunately, my code is using >> > lubridate heavily (it's just too good...), which leaves me in a tough >> spot - >> > I would like to use both. If I had to choose, I would be forced to >> replace >> > all lubridate code with standard R, which is not fun, but I guess I >> have to >> > bite the bullet. >> >> You don't have to choose one over the other. >> >> I suspect import order could do the trick. Perhaps import()-ing >> data.table first, then lubridate might be all you have to do. >> >> If not, I *think* if you define hour, mday, mont, etc. in your package >> code as: >> >> mday <- lubridate::mday >> hour <- lubridate::hour >> >> And ensure that those functions are loaded first (either by using >> Collate: and specifying that file first, or putting that in a function >> called aaa.R or something), perhaps your code will recover "just like >> that" >> >> If that doesn't work either, another option is that you just prefix >> every lubridate call in your package code with the lubridate package >> name, eg. instead of `year(whenever)` you do >> `lubridate::year(whenever)`. >> >> HTH, >> -steve >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> | Memorial Sloan-Kettering Cancer Center >> | Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mailinglist.honeypot at gmail.com Thu Mar 7 06:40:11 2013 From: mailinglist.honeypot at gmail.com (Steve Lianoglou) Date: Thu, 7 Mar 2013 00:40:11 -0500 Subject: [datatable-help] Error in a package that imports data.table In-Reply-To: References: <1527B53B-4FB5-4937-8FEA-867EDC588482@gmail.com> <2DD70661-7319-4D42-A308-9F3A6EC50AD0@gmail.com> <6f4c9b413e4fd40bb994370d8993be4e@imap.plus.net> Message-ID: Hi, On Thu, Mar 7, 2013 at 12:22 AM, Victor Kryukov wrote: > Update: it looks the order in NAMESPACE doesn't matter for that particular > problem. I can confirm that when I change it the order of package loading > changes, as it's either data.table or lubridate that warns about > overwritting each other's functions, but the problem exists in either case. > > I think my next steps will be to perform a surgery on data.table by removing > all IDateTime from my local copy - will see if it helps :). It's your prerogative to do what you like, but I feel like the other two alternatives I gave are a bit less intense than what you are proposing, no? It also has the bonus feature of not requiring a non-standard data.table install, which is good if you expect anybody else to use your package. -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact From victor.kryukov at gmail.com Thu Mar 7 06:49:56 2013 From: victor.kryukov at gmail.com (Victor Kryukov) Date: Wed, 6 Mar 2013 21:49:56 -0800 Subject: [datatable-help] Building data.table 1.8.9 from sources fails Message-ID: Hello, when I'm trying to build freshly cloned data.table 1.8.9 from source via either 'R CMD build .' in pkg/ directory or calling devtools build(), it fails with the following error. I don't really understand what's going on, as in lines 16-17 in datatable-faq.Rnw we have if (!exists("data.table",.GlobalEnv)) library(data.table) # see Intro.Rnw for comments on these two lines rm(list=as.character(tables()$NAME),envir=.GlobalEnv) and first line should load data.table if not loaded. Even when I load it explicitly in line 16 via library(data.table), i.e. removing if(), it fails with the same error. Any ideas why? 'R CMD check .' finishes without any problems. My systemInfo() is below just in case. Regards, Victor ==== > build() '/usr/local/Cellar/r/2.15.3/R.framework/Resources/bin/R' --vanilla CMD build '/Users/victor/Documents/R/datatable/pkg' --no-manual --no-resave-data * checking for file '/Users/victor/Documents/R/datatable/pkg/DESCRIPTION' ... OK * preparing 'data.table': * checking DESCRIPTION meta-information ... OK * cleaning src * installing the package to re-build vignettes * creating vignettes ... ERROR Error: processing vignette 'datatable-faq.Rnw' failed with diagnostics: chunk 1 Error in rm(list = as.character(tables()$NAME), envir = .GlobalEnv) : could not find function "tables" Execution halted Error: Command failed (1) ==== > sessionInfo() R version 2.15.3 (2013-03-01) Platform: x86_64-apple-darwin12.2.1 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base -------------- next part -------------- An HTML attachment was scrubbed... URL: From victor.kryukov at gmail.com Thu Mar 7 06:55:26 2013 From: victor.kryukov at gmail.com (Victor Kryukov) Date: Wed, 6 Mar 2013 21:55:26 -0800 Subject: [datatable-help] Building data.table 1.8.9 from sources fails In-Reply-To: References: Message-ID: Please ignore this. I've accidentally replaced NAMESPACE by running document(). Everything builds fine from fresh source. I should really go home now... Yours Sincerely, Victor Kryukov On Wed, Mar 6, 2013 at 9:49 PM, Victor Kryukov wrote: > Hello, > > when I'm trying to build freshly cloned data.table 1.8.9 from source via > either 'R CMD build .' in pkg/ directory or calling devtools build(), it > fails with the following error. I don't really understand what's going on, > as in lines 16-17 in datatable-faq.Rnw we have > > if (!exists("data.table",.GlobalEnv)) library(data.table) # see Intro.Rnw > for comments on these two lines > rm(list=as.character(tables()$NAME),envir=.GlobalEnv) > > and first line should load data.table if not loaded. Even when I load it > explicitly in line 16 via library(data.table), i.e. removing if(), it fails > with the same error. > > Any ideas why? > > 'R CMD check .' finishes without any problems. My systemInfo() is below > just in case. > > Regards, > Victor > > ==== > > > build() > '/usr/local/Cellar/r/2.15.3/R.framework/Resources/bin/R' --vanilla CMD > build '/Users/victor/Documents/R/datatable/pkg' --no-manual > --no-resave-data > > * checking for file '/Users/victor/Documents/R/datatable/pkg/DESCRIPTION' > ... OK > * preparing 'data.table': > * checking DESCRIPTION meta-information ... OK > * cleaning src > * installing the package to re-build vignettes > * creating vignettes ... ERROR > > Error: processing vignette 'datatable-faq.Rnw' failed with diagnostics: > chunk 1 > Error in rm(list = as.character(tables()$NAME), envir = .GlobalEnv) : > could not find function "tables" > Execution halted > Error: Command failed (1) > > ==== > > > sessionInfo() > R version 2.15.3 (2013-03-01) > Platform: x86_64-apple-darwin12.2.1 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Thu Mar 7 09:55:25 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Thu, 07 Mar 2013 08:55:25 +0000 Subject: [datatable-help] Error in a package that imports data.table In-Reply-To: References: <1527B53B-4FB5-4937-8FEA-867EDC588482@gmail.com> <2DD70661-7319-4D42-A308-9F3A6EC50AD0@gmail.com> <6f4c9b413e4fd40bb994370d8993be4e@imap.plus.net> Message-ID: <0d3f4e3e8834acc589df3d422d71b13c@imap.plus.net> Victor, As Steve says you shouldn't need to do that. If it's just the mask warnings you're trying to suppress have you tried : suppressPackageStartupMessages({ library(...) library(...) }) I haven't used lubdridate before. I tried : > install.packages("lubdridate") Warning message: package ?lubdridate? is not available (for R version 2.15.3) > Seems odd. Anyway: is lubridate fast? As the code comment you pasted said, it stores Date as numeric (type double) doesn't it, as base R does? Won't that mean sorting won't be as fast on it? That's the reason IDate exists and what the I stands for. Matthew On 07.03.2013 05:40, Steve Lianoglou wrote: > Hi, > > On Thu, Mar 7, 2013 at 12:22 AM, Victor Kryukov > wrote: >> Update: it looks the order in NAMESPACE doesn't matter for that >> particular >> problem. I can confirm that when I change it the order of package >> loading >> changes, as it's either data.table or lubridate that warns about >> overwritting each other's functions, but the problem exists in >> either case. >> >> I think my next steps will be to perform a surgery on data.table by >> removing >> all IDateTime from my local copy - will see if it helps :). > > It's your prerogative to do what you like, but I feel like the other > two alternatives I gave are a bit less intense than what you are > proposing, no? > > It also has the bonus feature of not requiring a non-standard > data.table install, which is good if you expect anybody else to use > your package. > > -steve From statquant at outlook.com Thu Mar 7 13:49:51 2013 From: statquant at outlook.com (statquant3) Date: Thu, 7 Mar 2013 04:49:51 -0800 (PST) Subject: [datatable-help] Error in a package that imports data.table In-Reply-To: <0d3f4e3e8834acc589df3d422d71b13c@imap.plus.net> References: <2DD70661-7319-4D42-A308-9F3A6EC50AD0@gmail.com> <6f4c9b413e4fd40bb994370d8993be4e@imap.plus.net> <0d3f4e3e8834acc589df3d422d71b13c@imap.plus.net> Message-ID: <1362660591788-4660598.post@n4.nabble.com> Hello, I do not think lubridate is fast, It just acts as a syntax sweetener. Here is the description of the package by its author : Quote: Lubridate makes it easier to work with dates and times by providing functions to identify and parse date-time data,extract and modify components of a datetime (years, months,days, hours, minutes, and seconds), perform accurate math on date-times, handle time zones and Daylight Savings Time. Lubridate has a consistent, memorable syntax, that makes working with dates fun instead of frustrating. As far as I know no package provide a quickest datetime handle, even if data.table proposes IDate, ITime... that stores integer for fast sorting. (Problem is for me that it does not support sub-second but we spoke about it already) If any package provides a faster implementation for datetimes I'll be glag to hear about it. -- View this message in context: http://r.789695.n4.nabble.com/Error-in-a-package-that-imports-data-table-tp4660173p4660598.html Sent from the datatable-help mailing list archive at Nabble.com. From saporta at scarletmail.rutgers.edu Thu Mar 7 17:45:02 2013 From: saporta at scarletmail.rutgers.edu (Ricardo Saporta) Date: Thu, 7 Mar 2013 11:45:02 -0500 Subject: [datatable-help] unique, by full row (not just key) Message-ID: I have a keyed data.table, DT, with 800k rows, of which about 0.5% are duplicates that need to removed. Using unique(DT) of course widdles down the whole table to one row per key. I would like to get results similar to unique.data.frame(DT) Two problems with using unique.data.frame: (1) Speed (2) loss of key(DT) So instead Im using a wrapper that (1) caches key(DT) (2) removes the key (3) calls unique on DT (4) then repplies the key However, this is convoluted (and also requires modifying setkey(.) and getdots(.)). It occurs to me that I might be overlooking a simpler alternative. anythoughts? Thanks, Rick _Here is what I am using_: uniqueRows <- function(DT) { # If already keyed (or not a DT), use regular unique(DT) if (!haskey(DT) || !is.data.table(x) ) return(unique(DT)) .key <- key(DT) setkey(DT, NULL) setkeyE(unique(DT), eval(.key)) } getdotsWithEval <- function () { dots <- as.character(match.call(sys.function(-1), call = sys.call(-1), expand.dots = FALSE)$...) if (grepl("^eval\\(", dots) && grepl("\\)$", dots)) return(eval(parse(text=dots))) return(dots) } setkeyE <- function (x, ..., verbose = getOption("datatable.verbose")) { # SAME AS setkey(.) WITH ADDITION THAT # IF KEY IS WRAPPED IN eval(.) IT WILL BE PARSED if (is.character(x)) stop("x may no longer be the character name of the data.table. The possibility was undocumented and has been removed.") #** THIS IS THE MODIFIED LINE **# # OLD**: cols = getdots() cols <- getdotsWithEval() if (!length(cols)) cols = colnames(x) else if (identical(cols, "NULL")) cols = NULL setkeyv(x, cols, verbose = verbose) } -- Ricardo Saporta Graduate Student, Data Analytics Rutgers University, New Jersey e: saporta at rutgers.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Thu Mar 7 18:03:00 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Thu, 07 Mar 2013 17:03:00 +0000 Subject: [datatable-help] unique, by full row (not just key) In-Reply-To: References: Message-ID: <0e76e44156d43f0cef683fd5ec170873@imap.plus.net> Hi, Are the duplicates next to each other in the table? Or could duplicates be within each key, separated by other rows? If duplicates are together, calling data.table:::duplist directly should do it. (see source of data.table:::unique.data.table). It loops through the rows by column and works like diff(x)==0 would i.e. looking at the previous row only, but does compare all columns. If a subset of columns are needed, then maybe a data.table:::shallow followed by column removal of the ones you don't need on that shallow copy (the shallow copy and column removal being instant). Just because duplist doesn't accept a subset of the list of columns it is passed. shallow() is on the agenda to be exported for user use (so suggesting it is an excuse to get you to test it!). Hadn't thought about duplist but could do, too. They are both relied on internally, so should be reliable. But as soon as they're exported we can't make non-backwards compatible changes to them. Matthew On 07.03.2013 16:45, Ricardo Saporta wrote: > I have a keyed data.table, DT, with 800k rows, of which about 0.5% are duplicates that need to removed. > Using unique(DT) of course widdles down the whole table to one row per key. > I would like to get results similar to unique.data.frame(DT) > Two problems with using unique.data.frame: (1) Speed (2) loss of key(DT) > So instead Im using a wrapper that > (1) caches key(DT) (2) removes the key (3) calls unique on DT (4) then repplies the key > However, this is convoluted (and also requires modifying setkey(.) and getdots(.)). > It occurs to me that I might be overlooking a simpler alternative. > anythoughts? > Thanks, > Rick > _Here is what I am using_: > uniqueRows > # If already keyed (or not a DT), use regular unique(DT) > if (!haskey(DT) || !is.data.table(x) ) > return(unique(DT)) > .key > setkey(DT, NULL) > setkeyE(unique(DT), eval(.key)) > } > getdotsWithEval > dots > as.character(match.call(sys.function(-1), call = sys.call(-1), > expand.dots = FALSE)$...) > if (grepl("^eval\(", dots) && grepl("\)$", dots)) > return(eval(parse(text=dots))) > return(dots) > } > setkeyE > # SAME AS setkey(.) WITH ADDITION THAT > # IF KEY IS WRAPPED IN eval(.) IT WILL BE PARSED > if (is.character(x)) > stop("x may no longer be the character name of the data.table. The possibility was undocumented and has been removed.") > #** THIS IS THE MODIFIED LINE **# > # OLD**: cols = getdots() > cols > if (!length(cols)) > cols = colnames(x) > else if (identical(cols, "NULL")) > cols = NULL > setkeyv(x, cols, verbose = verbose) > } -- > > Ricardo Saporta > Graduate Student, Data Analytics > Rutgers University, New Jersey > e: saporta at rutgers.edu [1] Links: ------ [1] mailto:saporta at rutgers.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Thu Mar 7 18:07:28 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Thu, 07 Mar 2013 17:07:28 +0000 Subject: [datatable-help] unique, by full row (not just key) In-Reply-To: <0e76e44156d43f0cef683fd5ec170873@imap.plus.net> References: <0e76e44156d43f0cef683fd5ec170873@imap.plus.net> Message-ID: <62cd21531cb6e8953cd952a2283e8693@imap.plus.net> Which means that unique.data.table itself can be improved internally, in the way I just suggested using shallow() ... Most of the time the key will be small so that copy of the key columns to pass to duplist won't be huge, but, still a copy. And could slow down key only tables most, relatively. On 07.03.2013 17:03, Matthew Dowle wrote: > Hi, > > Are the duplicates next to each other in the table? Or could duplicates be within each key, separated by other rows? > > If duplicates are together, calling data.table:::duplist directly should do it. (see source of data.table:::unique.data.table). It loops through the rows by column and works like diff(x)==0 would i.e. looking at the previous row only, but does compare all columns. If a subset of columns are needed, then maybe a data.table:::shallow followed by column removal of the ones you don't need on that shallow copy (the shallow copy and column removal being instant). Just because duplist doesn't accept a subset of the list of columns it is passed. > > shallow() is on the agenda to be exported for user use (so suggesting it is an excuse to get you to test it!). Hadn't thought about duplist but could do, too. They are both relied on internally, so should be reliable. But as soon as they're exported we can't make non-backwards compatible changes to them. > > Matthew > > On 07.03.2013 16:45, Ricardo Saporta wrote: > >> I have a keyed data.table, DT, with 800k rows, of which about 0.5% are duplicates that need to removed. >> Using unique(DT) of course widdles down the whole table to one row per key. >> I would like to get results similar to unique.data.frame(DT) >> Two problems with using unique.data.frame: (1) Speed (2) loss of key(DT) >> So instead Im using a wrapper that >> (1) caches key(DT) (2) removes the key (3) calls unique on DT (4) then repplies the key >> However, this is convoluted (and also requires modifying setkey(.) and getdots(.)). >> It occurs to me that I might be overlooking a simpler alternative. >> anythoughts? >> Thanks, >> Rick >> _Here is what I am using_: >> uniqueRows >> # If already keyed (or not a DT), use regular unique(DT) >> if (!haskey(DT) || !is.data.table(x) ) >> return(unique(DT)) >> .key >> setkey(DT, NULL) >> setkeyE(unique(DT), eval(.key)) >> } >> getdotsWithEval >> dots >> as.character(match.call(sys.function(-1), call = sys.call(-1), >> expand.dots = FALSE)$...) >> if (grepl("^eval\(", dots) && grepl("\)$", dots)) >> return(eval(parse(text=dots))) >> return(dots) >> } >> setkeyE >> # SAME AS setkey(.) WITH ADDITION THAT >> # IF KEY IS WRAPPED IN eval(.) IT WILL BE PARSED >> if (is.character(x)) >> stop("x may no longer be the character name of the data.table. The possibility was undocumented and has been removed.") >> #** THIS IS THE MODIFIED LINE **# >> # OLD**: cols = getdots() >> cols >> if (!length(cols)) >> cols = colnames(x) >> else if (identical(cols, "NULL")) >> cols = NULL >> setkeyv(x, cols, verbose = verbose) >> } -- >> >> Ricardo Saporta >> Graduate Student, Data Analytics >> Rutgers University, New Jersey >> e: saporta at rutgers.edu [1] Links: ------ [1] mailto:saporta at rutgers.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From victor.kryukov at gmail.com Thu Mar 7 18:25:31 2013 From: victor.kryukov at gmail.com (Victor Kryukov) Date: Thu, 7 Mar 2013 09:25:31 -0800 Subject: [datatable-help] Error in a package that imports data.table In-Reply-To: <0d3f4e3e8834acc589df3d422d71b13c@imap.plus.net> References: <1527B53B-4FB5-4937-8FEA-867EDC588482@gmail.com> <2DD70661-7319-4D42-A308-9F3A6EC50AD0@gmail.com> <6f4c9b413e4fd40bb994370d8993be4e@imap.plus.net> <0d3f4e3e8834acc589df3d422d71b13c@imap.plus.net> Message-ID: OK, I think I have solved it. The problem seemed to be related to FAQ 2.23. When I was *importing* data.table with 'Imports:', I think what was going on is that R was making functions from data.table's namespace available to my package, but the data.table package itself was not loaded. As a consequence, .onLoad was never called and hense FAQ 2.23's magic never happened. Now my depends section in DESCRIPTION looks like this: Depends: data.table, lubridate and everything seems to work - no error messages about .rbind.data.table not available, and lubridate's hour, minute etc. mask data.table's, which is what expected. The order does matter in that case. Thanks for Matthew and Steve for providing support. At least I had a reason to downloaded data.table and poke around its sources; wish it was available on github... Regards, Victor On Thu, Mar 7, 2013 at 12:55 AM, Matthew Dowle wrote: > > Victor, > > As Steve says you shouldn't need to do that. > > If it's just the mask warnings you're trying to suppress have you tried : > > suppressPackageStartupMessages**({ > library(...) > library(...) > }) > > I haven't used lubdridate before. I tried : > > install.packages("lubdridate") >> > Warning message: > package ?lubdridate? is not available (for R version 2.15.3) > >> >> > Seems odd. Anyway: is lubridate fast? As the code comment you pasted > said, it stores Date as numeric (type double) doesn't it, as base R does? > Won't that mean sorting won't be as fast on it? That's the reason IDate > exists and what the I stands for. > > Matthew > > > > On 07.03.2013 05:40, Steve Lianoglou wrote: > >> Hi, >> >> On Thu, Mar 7, 2013 at 12:22 AM, Victor Kryukov >> wrote: >> >>> Update: it looks the order in NAMESPACE doesn't matter for that >>> particular >>> problem. I can confirm that when I change it the order of package loading >>> changes, as it's either data.table or lubridate that warns about >>> overwritting each other's functions, but the problem exists in either >>> case. >>> >>> I think my next steps will be to perform a surgery on data.table by >>> removing >>> all IDateTime from my local copy - will see if it helps :). >>> >> >> It's your prerogative to do what you like, but I feel like the other >> two alternatives I gave are a bit less intense than what you are >> proposing, no? >> >> It also has the bonus feature of not requiring a non-standard >> data.table install, which is good if you expect anybody else to use >> your package. >> >> -steve >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Thu Mar 7 18:57:55 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Thu, 07 Mar 2013 17:57:55 +0000 Subject: [datatable-help] Error in a package that imports data.table In-Reply-To: References: <1527B53B-4FB5-4937-8FEA-867EDC588482@gmail.com> <2DD70661-7319-4D42-A308-9F3A6EC50AD0@gmail.com> <6f4c9b413e4fd40bb994370d8993be4e@imap.plus.net> <0d3f4e3e8834acc589df3d422d71b13c@imap.plus.net> Message-ID: Interesting, thanks for update. That's news to me. But then how do the datatable options get set if it's just Imported ? .onLoad sets those options too. Does any other function get run when a package is imported. Is there a .onImport ? That can't be right, otherwise how do datatable options get set for the 3 packages on CRAN that Import data.table? Hm... Just to check, you know you can poke around the source (updated in real time) online too : https://r-forge.r-project.org/scm/viewvc.php/?root=datatable [3] Not as pretty as github but just checking you know you can browse there. Matthew On 07.03.2013 17:25, Victor Kryukov wrote: > OK, I think I have solved it. The problem seemed to be related to FAQ 2.23. > > When I was *importing* data.table with 'Imports:', I think what was going on is that R was making functions from data.table's namespace available to my package, but the data.table package itself was not loaded. As a consequence, .onLoad was never called and hense FAQ 2.23's magic never happened. > > Now my depends section in DESCRIPTION looks like this: > > Depends: > data.table, > lubridate > > and everything seems to work - no error messages about .rbind.data.table not available, and lubridate's hour, minute etc. mask data.table's, which is what expected. The order does matter in that case. > > Thanks for Matthew and Steve for providing support. At least I had a reason to downloaded data.table and poke around its sources; wish it was available on github... > > Regards, > Victor > > On Thu, Mar 7, 2013 at 12:55 AM, Matthew Dowle wrote: > >> Victor, >> >> As Steve says you shouldn't need to do that. >> >> If it's just the mask warnings you're trying to suppress have you tried : >> >> suppressPackageStartupMessages({ >> library(...) >> library(...) >> }) >> >> I haven't used lubdridate before. I tried : >> >>> install.packages("lubdridate") >> Warning message: >> package 'lubdridate' is not available (for R version 2.15.3) >> >> Seems odd. Anyway: is lubridate fast? As the code comment you pasted said, it stores Date as numeric (type double) doesn't it, as base R does? Won't that mean sorting won't be as fast on it? That's the reason IDate exists and what the I stands for. >> >> Matthew >> >> On 07.03.2013 05:40, Steve Lianoglou wrote: >> >>> Hi, >>> >>> On Thu, Mar 7, 2013 at 12:22 AM, Victor Kryukov >>> wrote: >>> >>>> Update: it looks the order in NAMESPACE doesn't matter for that particular >>>> problem. I can confirm that when I change it the order of package loading >>>> changes, as it's either data.table or lubridate that warns about >>>> overwritting each other's functions, but the problem exists in either case. >>>> >>>> I think my next steps will be to perform a surgery on data.table by removing >>>> all IDateTime from my local copy - will see if it helps :). >>> >>> It's your prerogative to do what you like, but I feel like the other >>> two alternatives I gave are a bit less intense than what you are >>> proposing, no? >>> >>> It also has the bonus feature of not requiring a non-standard >>> data.table install, which is good if you expect anybody else to use >>> your package. >>> >>> -steve Links: ------ [1] mailto:victor.kryukov at gmail.com [2] mailto:mdowle at mdowle.plus.com [3] https://r-forge.r-project.org/scm/viewvc.php/?root=datatable -------------- next part -------------- An HTML attachment was scrubbed... URL: From lambandme at gmail.com Fri Mar 8 06:25:12 2013 From: lambandme at gmail.com (Yi Yuan) Date: Fri, 8 Mar 2013 00:25:12 -0500 Subject: [datatable-help] How to do binary search on integer key? Message-ID: Hi, all: I know someone has asked exactly the same question and there's even an answer. But I think the answer is wrong. Following is the url of that question http://r.789695.n4.nabble.com/Binary-search-with-integer-key-td3705686.html so if the key is integer and I would like to select all records where the key=654, how do I do that? suppose the data table is named table, key variable's name is id I know you can do it by writing: table[id==645,], but R will conduct vector search this way and is a lot slower than binary search. So how can I do binary search on integer key?? Thanks!! -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.nelson at sydney.edu.au Fri Mar 8 06:42:00 2013 From: michael.nelson at sydney.edu.au (Michael Nelson) Date: Fri, 8 Mar 2013 05:42:00 +0000 Subject: [datatable-help] How to do binary search on integer key? In-Reply-To: References: Message-ID: <6FB5193A6CDCDF499486A833B7AFBDCD5827E395@EX-MBX-PRO-04.mcs.usyd.edu.au> you just need to wrap the values in `list()` or `.() eg table[list(645)] table[.(645)] ________________________________ From: datatable-help-bounces at lists.r-forge.r-project.org [datatable-help-bounces at lists.r-forge.r-project.org] on behalf of Yi Yuan [lambandme at gmail.com] Sent: Friday, 8 March 2013 4:25 PM To: datatable-help at lists.r-forge.r-project.org Subject: [datatable-help] How to do binary search on integer key? Hi, all: I know someone has asked exactly the same question and there's even an answer. But I think the answer is wrong. Following is the url of that question http://r.789695.n4.nabble.com/Binary-search-with-integer-key-td3705686.html so if the key is integer and I would like to select all records where the key=654, how do I do that? suppose the data table is named table, key variable's name is id I know you can do it by writing: table[id==645,], but R will conduct vector search this way and is a lot slower than binary search. So how can I do binary search on integer key?? Thanks!! -------------- next part -------------- An HTML attachment was scrubbed... URL: From lambandme at gmail.com Fri Mar 8 06:48:08 2013 From: lambandme at gmail.com (Yi Yuan) Date: Fri, 8 Mar 2013 00:48:08 -0500 Subject: [datatable-help] How to do binary search on integer key? In-Reply-To: <6FB5193A6CDCDF499486A833B7AFBDCD5827E395@EX-MBX-PRO-04.mcs.usyd.edu.au> References: <6FB5193A6CDCDF499486A833B7AFBDCD5827E395@EX-MBX-PRO-04.mcs.usyd.edu.au> Message-ID: I tried table[list(645)] table[.(645)] table[J(45)] they're all returning 78 records when in fact there should only be 18 records related to key 645. However if I use table[id==645,], I get the right result. On Fri, Mar 8, 2013 at 12:42 AM, Michael Nelson < michael.nelson at sydney.edu.au> wrote: > you just need to wrap the values in `list()` or `.() > > > eg > > table[list(645)] > > table[.(645)] > ------------------------------ > *From:* datatable-help-bounces at lists.r-forge.r-project.org [ > datatable-help-bounces at lists.r-forge.r-project.org] on behalf of Yi Yuan [ > lambandme at gmail.com] > *Sent:* Friday, 8 March 2013 4:25 PM > *To:* datatable-help at lists.r-forge.r-project.org > *Subject:* [datatable-help] How to do binary search on integer key? > > Hi, all: > I know someone has asked exactly the same question and there's even an > answer. But I think the answer is wrong. Following is the url of that > question > http://r.789695.n4.nabble.com/Binary-search-with-integer-key-td3705686.html > > so if the key is integer and I would like to select all records where > the key=654, how do I do that? > suppose the data table is named table, key variable's name is id > > I know you can do it by writing: table[id==645,], but R will conduct > vector search this way and is a lot slower than binary search. > > So how can I do binary search on integer key?? > > Thanks!! > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Fri Mar 8 09:10:40 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Fri, 8 Mar 2013 08:10:40 -0000 Subject: [datatable-help] How to do binary search on integer key? In-Reply-To: References: <6FB5193A6CDCDF499486A833B7AFBDCD5827E395@EX-MBX-PRO-04.mcs.usyd.edu.au> Message-ID: <0d9770e89cac54ee8504c4fb4f20a0e2.squirrel@webmail.plus.net> I assume that 45 is a typo and should be 645. All I can think is that you're using an architecture not covered by CRAN. Only 32bit and 64bit on Unix, Mac or Windows is covered, not anything else. Please provide the output of : sessionInfo() test.data.table() Also try again in fresh R session. After any slight memory corruption in any package, strange things can happen. Finally, do make sure to use the latest version of data.table (1.8.8) to save us time in supporting you. Only if what you said isn't true and in fact the key column is double not integer, and, there are NA in it too can I guess that the bug fixes in 1.8.8 would be in play. Matthew > I tried > > table[list(645)] > > table[.(645)] > > table[J(45)] > > they're all returning 78 records when in fact there should only be 18 > records related to key 645. However if I use table[id==645,], I get the > right result. > > On Fri, Mar 8, 2013 at 12:42 AM, Michael Nelson < > michael.nelson at sydney.edu.au> wrote: > >> you just need to wrap the values in `list()` or `.() >> >> >> eg >> >> table[list(645)] >> >> table[.(645)] >> ------------------------------ >> *From:* datatable-help-bounces at lists.r-forge.r-project.org [ >> datatable-help-bounces at lists.r-forge.r-project.org] on behalf of Yi Yuan >> [ >> lambandme at gmail.com] >> *Sent:* Friday, 8 March 2013 4:25 PM >> *To:* datatable-help at lists.r-forge.r-project.org >> *Subject:* [datatable-help] How to do binary search on integer key? >> >> Hi, all: >> I know someone has asked exactly the same question and there's even an >> answer. But I think the answer is wrong. Following is the url of that >> question >> http://r.789695.n4.nabble.com/Binary-search-with-integer-key-td3705686.html >> >> so if the key is integer and I would like to select all records where >> the key=654, how do I do that? >> suppose the data table is named table, key variable's name is id >> >> I know you can do it by writing: table[id==645,], but R will conduct >> vector search this way and is a lot slower than binary search. >> >> So how can I do binary search on integer key?? >> >> Thanks!! >> > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From statquant at outlook.com Mon Mar 11 13:40:26 2013 From: statquant at outlook.com (stat quant) Date: Mon, 11 Mar 2013 13:40:26 +0100 Subject: [datatable-help] fread suggestion Message-ID: Hello list, We like FREAD because it is very fast, yet sometimes files are huge and R cannot handle that much data, some packages handle this limitation but they do not provide a similar to fread function. Yet sometimes only subsets of a file is really needed, subsets that could fit into RAM. So what about adding a grep option to fread that would allow to load only lines that matches a regular expression? I'll add a request if you think the idea is worth implementing. Cheers -------------- next part -------------- An HTML attachment was scrubbed... URL: From micheledemeo at gmail.com Mon Mar 11 13:53:19 2013 From: micheledemeo at gmail.com (MICHELE DE MEO) Date: Mon, 11 Mar 2013 13:53:19 +0100 Subject: [datatable-help] fread suggestion In-Reply-To: References: Message-ID: Very interesting request. I also would be interested in this possibility. Cheers 2013/3/11 stat quant > Hello list, > We like FREAD because it is very fast, yet sometimes files are huge and R > cannot handle that much data, some packages handle this limitation but they > do not provide a similar to fread function. > Yet sometimes only subsets of a file is really needed, subsets that could > fit into RAM. > > So what about adding a grep option to fread that would allow to load only > lines that matches a regular expression? > > I'll add a request if you think the idea is worth implementing. > > Cheers > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -- *************************************************************** *Michele De Meo, Ph.D* *Statistical and data mining solutions http://micheledemeo.blogspot.com/ skype: demeo.michele* * * -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Mon Mar 11 14:09:29 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Mon, 11 Mar 2013 13:09:29 +0000 Subject: [datatable-help] fread suggestion In-Reply-To: References: Message-ID: Good idea statquant, please file it then. How about something more general e.g. fread(input, chunk.nrows=10000, chunk.filter = ) That could be grep() or any expression of column names. It wouldn't be efficient to call that for every row one by one and similarly couldn't be called for the whole DT, since the point is that DT is greater than RAM. So some batch size need be defined hence chunk.nrows=10000. That filter would then be called for each chunk and any rows passing would make it into the final table. read.ffdf has something like this I believe, and Jens already suggested that when I ran the timings in example(fread) past him. We should probably follow his lead on that in terms of argument names etc. Perhaps chunk should be defined in terms of RAM e.g. chunk=100MB. Since that is how it needs to be internally, in terms of number of pages to map. Or maybe both as nrows or MB would be acceptable. Ultimately (maybe in 5 years!) we're heading towards fread reading into on-disk tables rather than RAM. Filtering in chunks will always be a good option to have though, even then, as you might want to filter what makes it to the on-disk table. Matthew On 11.03.2013 12:53, MICHELE DE MEO wrote: > Very interesting request. I also would be interested in this possibility. > Cheers > > 2013/3/11 stat quant > >> Hello list, >> We like FREAD because it is very fast, yet sometimes files are huge and R cannot handle that much data, some packages handle this limitation but they do not provide a similar to fread function. >> Yet sometimes only subsets of a file is really needed, subsets that could fit into RAM. >> >> So what about adding a grep option to fread that would allow to load only lines that matches a regular expression? >> >> I'll add a request if you think the idea is worth implementing. >> >> Cheers >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org [1] >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [2] > > -- > > _*************************************************************_ > _MICHELE DE MEO, PH.D_ > Statistical and data mining solutions > http://micheledemeo.blogspot.com/ [4] > skype: demeo.michele Links: ------ [1] mailto:datatable-help at lists.r-forge.r-project.org [2] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [3] mailto:statquant at outlook.com [4] http://micheledemeo.blogspot.com/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From statquant at outlook.com Mon Mar 11 15:12:23 2013 From: statquant at outlook.com (stat quant) Date: Mon, 11 Mar 2013 15:12:23 +0100 Subject: [datatable-help] fread suggestion In-Reply-To: References: Message-ID: Filled as #2605 About your ultimate goal... why would you want on-disk tables rather than RAM (apart from being able to read >RAM limit file) ? Wouldnt RAM always be quicker ? I think data.table::fread is priceless because it is way faster than any other read function. I just benchmarked fread reading a csv file against R loading its own .RData binary format, and shockingly fread is much faster! I think it is too bad R doesn't provide a very fast way of loading objects saved from a previous R session (well why don't I do it if it is so easy...) 2013/3/11 stat quant > On my way to fill it in. > > About your ultimate goal... why would you want on-disk tables rather than > RAM (apart from being able to read >RAM limit file) ? Wouldnt RAM always be > quicker ? > > I think data.table::fread is priceless because it is way faster than any > other read function. > I just benchmarked fread reading a csv file against R loading its own > .RData binary format, and shockingly fread is much faster! > I think it is too bad R doesn't provide a very fast way of loading objects > saved from a previous R session (well why don't I do it if it is so easy...) > > > > 2013/3/11 Matthew Dowle > >> ** >> >> >> >> Good idea statquant, please file it then. How about something more >> general e.g. >> >> fread(input, chunk.nrows=10000, chunk.filter = > to i of DT[i]>) >> >> That could be grep() or any expression of column names. It >> wouldn't be efficient to call that for every row one by one and similarly >> couldn't be called for the whole DT, since the point is that DT is greater >> than RAM. So some batch size need be defined hence chunk.nrows=10000. >> That filter would then be called for each chunk and any rows passing would >> make it into the final table. >> >> read.ffdf has something like this I believe, and Jens already suggested >> that when I ran the timings in example(fread) past him. We should probably >> follow his lead on that in terms of argument names etc. >> >> Perhaps chunk should be defined in terms of RAM e.g. chunk=100MB. Since >> that is how it needs to be internally, in terms of number of pages to map. >> Or maybe both as nrows or MB would be acceptable. >> >> Ultimately (maybe in 5 years!) we're heading towards fread reading into >> on-disk tables rather than RAM. Filtering in chunks will always be a good >> option to have though, even then, as you might want to filter what makes it >> to the on-disk table. >> >> Matthew >> >> >> >> On 11.03.2013 12:53, MICHELE DE MEO wrote: >> >> Very interesting request. I also would be interested in this possibility. >> Cheers >> >> >> 2013/3/11 stat quant >> >>> Hello list, >>> We like FREAD because it is very fast, yet sometimes files are huge and >>> R cannot handle that much data, some packages handle this limitation but >>> they do not provide a similar to fread function. >>> Yet sometimes only subsets of a file is really needed, subsets that >>> could fit into RAM. >>> >>> So what about adding a grep option to fread that would allow to load >>> only lines that matches a regular expression? >>> >>> I'll add a request if you think the idea is worth implementing. >>> >>> Cheers >>> >>> >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> >> -- >> *************************************************************** >> *Michele De Meo, Ph.D* >> *Statistical and data mining solutions >> http://micheledemeo.blogspot.com/ >> skype: demeo.michele* >> * >> * >> >> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Mon Mar 11 15:51:01 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Mon, 11 Mar 2013 14:51:01 +0000 Subject: [datatable-help] fread suggestion In-Reply-To: References: Message-ID: <711bb8b59cd6c3abfa5fb135e79461b6@imap.plus.net> Exactly RAM would always be quicker. But maybe you want to read data from on-disk data.table using data.table syntax, rather than some other database or flat text file. i.e. on-disk data.table would not need to fit in RAM. Benchmark sounds intriguing. Please share if you can. compress=TRUE by default so maybe the decompress takes time, though. On 11.03.2013 14:12, stat quant wrote: > Filled as #2605 > About your ultimate goal... why would you want on-disk tables rather than RAM (apart from being able to read >RAM limit file) ? Wouldnt RAM always be quicker ? > I think data.table::fread is priceless because it is way faster than any other read function. > I just benchmarked fread reading a csv file against R loading its own .RData binary format, and shockingly fread is much faster! > I think it is too bad R doesn't provide a very fast way of loading objects saved from a previous R session (well why don't I do it if it is so easy...) > > 2013/3/11 stat quant > >> On my way to fill it in. >> >> About your ultimate goal... why would you want on-disk tables rather than RAM (apart from being able to read >RAM limit file) ? Wouldnt RAM always be quicker ? >> >> I think data.table::fread is priceless because it is way faster than any other read function. >> I just benchmarked fread reading a csv file against R loading its own .RData binary format, and shockingly fread is much faster! >> I think it is too bad R doesn't provide a very fast way of loading objects saved from a previous R session (well why don't I do it if it is so easy...) >> >> 2013/3/11 Matthew Dowle >> >>> Good idea statquant, please file it then. How about something more general e.g. >>> >>> fread(input, chunk.nrows=10000, chunk.filter =) >>> >>> Thatcould be grep() or any expression of column names. It wouldn't be efficient to call that for every row one by one and similarly couldn't be called for the whole DT, since the point is that DT is greater than RAM. So some batch size need be defined hence chunk.nrows=10000. That filter would then be called for each chunk and any rows passing would make it into the final table. >>> >>> read.ffdf has something like this I believe, and Jens already suggested that when I ran the timings in example(fread) past him. We should probably follow his lead on that in terms of argument names etc. >>> >>> Perhaps chunk should be defined in terms of RAM e.g. chunk=100MB. Since that is how it needs to be internally, in terms of number of pages to map. Or maybe both as nrows or MB would be acceptable. >>> >>> Ultimately (maybe in 5 years!) we're heading towards fread reading into on-disk tables rather than RAM. Filtering in chunks will always be a good option to have though, even then, as you might want to filter what makes it to the on-disk table. >>> >>> Matthew >>> >>> On 11.03.2013 12:53, MICHELE DE MEO wrote: >>> >>>> Very interesting request. I also would be interested in this possibility. >>>> Cheers >>>> >>>> 2013/3/11 stat quant >>>> >>>>> Hello list, >>>>> We like FREAD because it is very fast, yet sometimes files are huge and R cannot handle that much data, some packages handle this limitation but they do not provide a similar to fread function. >>>>> Yet sometimes only subsets of a file is really needed, subsets that could fit into RAM. >>>>> >>>>> So what about adding a grep option to fread that would allow to load only lines that matches a regular expression? >>>>> >>>>> I'll add a request if you think the idea is worth implementing. >>>>> >>>>> Cheers >>>>> >>>>> _______________________________________________ >>>>> datatable-help mailing list >>>>> datatable-help at lists.r-forge.r-project.org [1] >>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [2] >>>> >>>> -- >>>> >>>> _*************************************************************_ >>>> _MICHELE DE MEO, PH.D_ >>>> Statistical and data mining solutions >>>> http://micheledemeo.blogspot.com/ [4] >>>> skype: demeo.michele Links: ------ [1] mailto:datatable-help at lists.r-forge.r-project.org [2] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [3] mailto:statquant at outlook.com [4] http://micheledemeo.blogspot.com/ [5] mailto:mdowle at mdowle.plus.com [6] mailto:mail.statquant at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Mon Mar 11 16:10:32 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Mon, 11 Mar 2013 15:10:32 +0000 Subject: [datatable-help] fread suggestion In-Reply-To: <711bb8b59cd6c3abfa5fb135e79461b6@imap.plus.net> References: <711bb8b59cd6c3abfa5fb135e79461b6@imap.plus.net> Message-ID: Also, fread works by first memory mapping the file. The first time it does this for a particular file is therefore slower (you may have noticed the longer pause the first time before the percentage counter starts). The time to memory map is reported when verbose=TRUE (but you need the formatting fix in v1.8.9 to see that time as the formatted number is messed up in v1.8.8). If you repeat the same fread call again it won't spend as long memory mapping since it's already mapped, depending on if you did anything else memory intensive on this computer/server in the meantime. I don't know if base R's load() memory maps, but if it doesn't it'll need to read from disk each time. So be strictly fair, the time to compare is a "cold" read after a reboot and the first run only of fread. But in practice we often do tend to read the same file several times, so fread benefits from this. The OS caches the file in RAM for you, basically. It might do this anyway. It's all very OS and usage dependent! It may also depend on how your particular R environment has been compiled. I don't think a fresh R session is enough to reproduce this effect. You need a reboot as it's the OS that caches/maps the file, not R/data.table. So in short - to report the very fast time along with the time to memory map file from cold, would be the fairest and most complete way to compare. Matthew On 11.03.2013 14:51, Matthew Dowle wrote: > Exactly RAM would always be quicker. But maybe you want to read data from on-disk data.table using data.table syntax, rather than some other database or flat text file. i.e. on-disk data.table would not need to fit in RAM. > > Benchmark sounds intriguing. Please share if you can. compress=TRUE by default so maybe the decompress takes time, though. > > On 11.03.2013 14:12, stat quant wrote: > >> Filled as #2605 >> About your ultimate goal... why would you want on-disk tables rather than RAM (apart from being able to read >RAM limit file) ? Wouldnt RAM always be quicker ? >> I think data.table::fread is priceless because it is way faster than any other read function. >> I just benchmarked fread reading a csv file against R loading its own .RData binary format, and shockingly fread is much faster! >> I think it is too bad R doesn't provide a very fast way of loading objects saved from a previous R session (well why don't I do it if it is so easy...) >> >> 2013/3/11 stat quant >> >>> On my way to fill it in. >>> >>> About your ultimate goal... why would you want on-disk tables rather than RAM (apart from being able to read >RAM limit file) ? Wouldnt RAM always be quicker ? >>> >>> I think data.table::fread is priceless because it is way faster than any other read function. >>> I just benchmarked fread reading a csv file against R loading its own .RData binary format, and shockingly fread is much faster! >>> I think it is too bad R doesn't provide a very fast way of loading objects saved from a previous R session (well why don't I do it if it is so easy...) >>> >>> 2013/3/11 Matthew Dowle >>> >>>> Good idea statquant, please file it then. How about something more general e.g. >>>> >>>> fread(input, chunk.nrows=10000, chunk.filter =) >>>> >>>> Thatcould be grep() or any expression of column names. It wouldn't be efficient to call that for every row one by one and similarly couldn't be called for the whole DT, since the point is that DT is greater than RAM. So some batch size need be defined hence chunk.nrows=10000. That filter would then be called for each chunk and any rows passing would make it into the final table. >>>> >>>> read.ffdf has something like this I believe, and Jens already suggested that when I ran the timings in example(fread) past him. We should probably follow his lead on that in terms of argument names etc. >>>> >>>> Perhaps chunk should be defined in terms of RAM e.g. chunk=100MB. Since that is how it needs to be internally, in terms of number of pages to map. Or maybe both as nrows or MB would be acceptable. >>>> >>>> Ultimately (maybe in 5 years!) we're heading towards fread reading into on-disk tables rather than RAM. Filtering in chunks will always be a good option to have though, even then, as you might want to filter what makes it to the on-disk table. >>>> >>>> Matthew >>>> >>>> On 11.03.2013 12:53, MICHELE DE MEO wrote: >>>> >>>>> Very interesting request. I also would be interested in this possibility. >>>>> Cheers >>>>> >>>>> 2013/3/11 stat quant >>>>> >>>>>> Hello list, >>>>>> We like FREAD because it is very fast, yet sometimes files are huge and R cannot handle that much data, some packages handle this limitation but they do not provide a similar to fread function. >>>>>> Yet sometimes only subsets of a file is really needed, subsets that could fit into RAM. >>>>>> >>>>>> So what about adding a grep option to fread that would allow to load only lines that matches a regular expression? >>>>>> >>>>>> I'll add a request if you think the idea is worth implementing. >>>>>> >>>>>> Cheers >>>>>> >>>>>> _______________________________________________ >>>>>> datatable-help mailing list >>>>>> datatable-help at lists.r-forge.r-project.org [1] >>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [2] >>>>> >>>>> -- >>>>> >>>>> _*************************************************************_ >>>>> _MICHELE DE MEO, PH.D_ >>>>> Statistical and data mining solutions >>>>> http://micheledemeo.blogspot.com/ [4] >>>>> skype: demeo.michele Links: ------ [1] mailto:datatable-help at lists.r-forge.r-project.org [2] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [3] mailto:statquant at outlook.com [4] http://micheledemeo.blogspot.com/ [5] mailto:mdowle at mdowle.plus.com [6] mailto:mail.statquant at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Tue Mar 19 17:51:45 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 19 Mar 2013 16:51:45 +0000 Subject: [datatable-help] =?utf-8?q?See_you_at_R/Finance_2013=3F?= Message-ID: <0303e9151b7b69d88f742ea4375e5b58@imap.plus.net> Dear datatablers, I'll be giving a 1 hour tutorial on the Friday and a lightening talk on the Saturday. http://www.rinfinance.com/agenda/ Hope to see you there! Matthew From statquant at outlook.com Wed Mar 20 16:46:16 2013 From: statquant at outlook.com (stat quant) Date: Wed, 20 Mar 2013 16:46:16 +0100 Subject: [datatable-help] datatable-help Digest, Vol 37, Issue 13 In-Reply-To: References: Message-ID: Too bad I can't be there, hopefully we'll have a video ! Best of luck for the presentation (but no pressure ;)) 2013/3/20 > Send datatable-help mailing list submissions to > datatable-help at lists.r-forge.r-project.org > > To subscribe or unsubscribe via the World Wide Web, visit > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > or, via email, send a message with subject or body 'help' to > datatable-help-request at lists.r-forge.r-project.org > > You can reach the person managing the list at > datatable-help-owner at lists.r-forge.r-project.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of datatable-help digest..." > > > Today's Topics: > > 1. See you at R/Finance 2013? (Matthew Dowle) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Tue, 19 Mar 2013 16:51:45 +0000 > From: Matthew Dowle > To: > Subject: [datatable-help] See you at R/Finance 2013? > Message-ID: <0303e9151b7b69d88f742ea4375e5b58 at imap.plus.net> > Content-Type: text/plain; charset=UTF-8; format=flowed > > > Dear datatablers, > > I'll be giving a 1 hour tutorial on the Friday and a lightening talk on > the Saturday. > > http://www.rinfinance.com/agenda/ > > Hope to see you there! > > Matthew > > > > > ------------------------------ > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > End of datatable-help Digest, Vol 37, Issue 13 > ********************************************** > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ekbrown at ksu.edu Fri Mar 22 03:39:43 2013 From: ekbrown at ksu.edu (ekbrown) Date: Thu, 21 Mar 2013 19:39:43 -0700 (PDT) Subject: [datatable-help] Quicker w/o keys set Message-ID: <1363919983128-4662157.post@n4.nabble.com> Hello. I'm new to data.table(). I am apparently not setting the keys correctly to get the increase in speed talked about in the vignettes, as I get a (much) quicker time *without* keys set. Take a look at the following benchmarking tests. Any ideas? Thanks. Earl Brown > library("data.table") > library("rbenchmark") > > # generates random data > num.files <- 2000 > num.words <- 1000000 > logical.vector <- sample(c(TRUE, FALSE), num.words, replace=T) > file.names <- rep(1:num.files, length.out=num.words) > > # defines functions > benDTNoKey <- function(aa, bb) { + dt <- data.table(as.numeric(aa), bb) + dt[,sum(V1), by = bb][,V1] + } > > benDTWithKey <- function(aa, bb) { + dt <- data.table(as.numeric(aa), bb) + setkey(dt) + dt[,sum(V1), by = bb][,V1] + } > > benTapply <- function(aa, bb) tapply(aa, bb, sum) > > # runs benchmarking > benchmark(benTapply(logical.vector, file.names), > benDTWithKey(logical.vector, file.names), benDTNoKey(logical.vector, > file.names), replications = 10, columns = c("test", "replications", > "elapsed")) test replications elapsed 3 benDTNoKey(logical.vector, file.names) 10 *0.753* 2 benDTWithKey(logical.vector, file.names) 10 *4.776* 1 benTapply(logical.vector, file.names) 10 6.218 > > # tests for sameness among results > one <- benTapply(logical.vector, file.names) > two <- benDTWithKey(logical.vector, file.names) > three <- benDTNoKey(logical.vector, file.names) > identical(as.integer(one), as.integer(two)) [1] TRUE > identical(as.integer(two), as.integer(three)) [1] TRUE -- View this message in context: http://r.789695.n4.nabble.com/Quicker-w-o-keys-set-tp4662157.html Sent from the datatable-help mailing list archive at Nabble.com. From michael.nelson at sydney.edu.au Fri Mar 22 03:43:11 2013 From: michael.nelson at sydney.edu.au (Michael Nelson) Date: Fri, 22 Mar 2013 02:43:11 +0000 Subject: [datatable-help] Quicker w/o keys set In-Reply-To: <1363919983128-4662157.post@n4.nabble.com> References: <1363919983128-4662157.post@n4.nabble.com> Message-ID: <6FB5193A6CDCDF499486A833B7AFBDCD62B73A09@EX-MBX-PRO-03.mcs.usyd.edu.au> Don't include the key setting within the benchmark. ________________________________________ From: datatable-help-bounces at lists.r-forge.r-project.org [datatable-help-bounces at lists.r-forge.r-project.org] on behalf of ekbrown [ekbrown at ksu.edu] Sent: Friday, 22 March 2013 1:39 PM To: datatable-help at lists.r-forge.r-project.org Subject: [datatable-help] Quicker w/o keys set Hello. I'm new to data.table(). I am apparently not setting the keys correctly to get the increase in speed talked about in the vignettes, as I get a (much) quicker time *without* keys set. Take a look at the following benchmarking tests. Any ideas? Thanks. Earl Brown > library("data.table") > library("rbenchmark") > > # generates random data > num.files <- 2000 > num.words <- 1000000 > logical.vector <- sample(c(TRUE, FALSE), num.words, replace=T) > file.names <- rep(1:num.files, length.out=num.words) > > # defines functions > benDTNoKey <- function(aa, bb) { + dt <- data.table(as.numeric(aa), bb) + dt[,sum(V1), by = bb][,V1] + } > > benDTWithKey <- function(aa, bb) { + dt <- data.table(as.numeric(aa), bb) + setkey(dt) + dt[,sum(V1), by = bb][,V1] + } > > benTapply <- function(aa, bb) tapply(aa, bb, sum) > > # runs benchmarking > benchmark(benTapply(logical.vector, file.names), > benDTWithKey(logical.vector, file.names), benDTNoKey(logical.vector, > file.names), replications = 10, columns = c("test", "replications", > "elapsed")) test replications elapsed 3 benDTNoKey(logical.vector, file.names) 10 *0.753* 2 benDTWithKey(logical.vector, file.names) 10 *4.776* 1 benTapply(logical.vector, file.names) 10 6.218 > > # tests for sameness among results > one <- benTapply(logical.vector, file.names) > two <- benDTWithKey(logical.vector, file.names) > three <- benDTNoKey(logical.vector, file.names) > identical(as.integer(one), as.integer(two)) [1] TRUE > identical(as.integer(two), as.integer(three)) [1] TRUE -- View this message in context: http://r.789695.n4.nabble.com/Quicker-w-o-keys-set-tp4662157.html Sent from the datatable-help mailing list archive at Nabble.com. _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From saporta at scarletmail.rutgers.edu Fri Mar 22 05:31:01 2013 From: saporta at scarletmail.rutgers.edu (Ricardo Saporta) Date: Fri, 22 Mar 2013 00:31:01 -0400 Subject: [datatable-help] Quicker w/o keys set In-Reply-To: <1363919983128-4662157.post@n4.nabble.com> References: <1363919983128-4662157.post@n4.nabble.com> Message-ID: When you set the key, it sorts the table -- this is part of what allows for the speed. This initial sorting is what is slowing down your benchmarks. While it makes sense to compare the initial sort time if you are trying to get a 'full' comparison, in most practice applications, you will only be setting the key once. Therefore, if you want to see what sort of speed increases you are actually getting, create your DT's first, then benchmark the specific operations of interest. Also, searching stackoverflow for [r] data.table and benchmarks will produce several useful results Cheers Rick On Thursday, March 21, 2013, ekbrown wrote: > Hello. I'm new to data.table(). I am apparently not setting the keys > correctly to get the increase in speed talked about in the vignettes, as I > get a (much) quicker time *without* keys set. Take a look at the following > benchmarking tests. Any ideas? Thanks. Earl Brown > > > library("data.table") > > library("rbenchmark") > > > > # generates random data > > num.files <- 2000 > > num.words <- 1000000 > > logical.vector <- sample(c(TRUE, FALSE), num.words, replace=T) > > file.names <- rep(1:num.files, length.out=num.words) > > > > # defines functions > > benDTNoKey <- function(aa, bb) { > + dt <- data.table(as.numeric(aa), bb) > + dt[,sum(V1), by = bb][,V1] > + } > > > > benDTWithKey <- function(aa, bb) { > + dt <- data.table(as.numeric(aa), bb) > + setkey(dt) > + dt[,sum(V1), by = bb][,V1] > + } > > > > benTapply <- function(aa, bb) tapply(aa, bb, sum) > > > > # runs benchmarking > > benchmark(benTapply(logical.vector, file.names), > > benDTWithKey(logical.vector, file.names), benDTNoKey(logical.vector, > > file.names), replications = 10, columns = c("test", "replications", > > "elapsed")) > test replications elapsed > 3 benDTNoKey(logical.vector, file.names) 10 *0.753* > 2 benDTWithKey(logical.vector, file.names) 10 *4.776* > 1 benTapply(logical.vector, file.names) 10 6.218 > > > > # tests for sameness among results > > one <- benTapply(logical.vector, file.names) > > two <- benDTWithKey(logical.vector, file.names) > > three <- benDTNoKey(logical.vector, file.names) > > identical(as.integer(one), as.integer(two)) > [1] TRUE > > identical(as.integer(two), as.integer(three)) > [1] TRUE > > > > -- > View this message in context: > http://r.789695.n4.nabble.com/Quicker-w-o-keys-set-tp4662157.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -- Ricardo Saporta Graduate Student, Data Analytics Rutgers University, New Jersey e: saporta at rutgers.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Fri Mar 22 12:05:18 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Fri, 22 Mar 2013 11:05:18 +0000 Subject: [datatable-help] Quicker w/o keys set In-Reply-To: References: <1363919983128-4662157.post@n4.nabble.com> Message-ID: <2e6e5ee66ef4b0c66dfb789f33a83e67@imap.plus.net> Whilst what Rick and Michael said is very true, I suspect that you've found that setting a key on a *numeric* type column is much slower than setkey on an *integer* column. There was an awful (but correct) benchmark on S.O. recently and that's what I replied, but I can't find it now. All I can think is that the OP deleted the question, which would be a shame. If that OP is watching, and that is what happened, please can they undelete it. Also you have a setkey(DT) there, with no columns specified. In that case, it will key all the columns; think key only table. But if you have numeric value columns in there as well, or any non-key columns at all, then that will be wasteful. Anyway, in the code you posted, try changing as.numeric(aa) to as.integer(aa) and you should see setkey run dramatically faster. Then what Rick and Michael said applies from there. Matthew On 22.03.2013 04:31, Ricardo Saporta wrote: > When you set the key, it sorts the table -- this is part of what allows for the speed. > This initial sorting is what is slowing down your benchmarks. > > While it makes sense to compare the initial sort time if you are trying to get a 'full' comparison, in most practice applications, you will only be setting the key once. > > Therefore, if you want to see what sort of speed increases you are actually getting, create your DT's first, then benchmark the specific operations of interest. > > Also, searching stackoverflow for [r] data.table and benchmarks will produce several useful results > > Cheers > Rick > > On Thursday, March 21, 2013, ekbrown wrote: > >> Hello. I'm new to data.table(). I am apparently not setting the keys >> correctly to get the increase in speed talked about in the vignettes, as I >> get a (much) quicker time *without* keys set. Take a look at the following >> benchmarking tests. Any ideas? Thanks. Earl Brown >> >> > library("data.table") >> > library("rbenchmark") >> > >> > # generates random data >> > num.files > num.words > logical.vector > file.names > >> > # defines functions >> > benDTNoKey + dt + dt[,sum(V1), by = bb][,V1] >> + } >> > >> > benDTWithKey + dt + setkey(dt) >> + dt[,sum(V1), by = bb][,V1] >> + } >> > >> > benTapply > >> > # runs benchmarking >> > benchmark(benTapply(logical.vector, file.names), >> > benDTWithKey(logical.vector, file.names), benDTNoKey(logical.vector, >> > file.names), replications = 10, columns = c("test", "replications", >> > "elapsed")) >> test replications elapsed >> 3 benDTNoKey(logical.vector, file.names) 10 *0.753* >> 2 benDTWithKey(logical.vector, file.names) 10 *4.776* >> 1 benTapply(logical.vector, file.names) 10 6.218 >> > >> > # tests for sameness among results >> > one > two > three > identical(as.integer(one), as.integer(two)) >> [1] TRUE >> > identical(as.integer(two), as.integer(three)) >> [1] TRUE >> >> -- >> View this message in context: http://r.789695.n4.nabble.com/Quicker-w-o-keys-set-tp4662157.html [1] >> Sent from the datatable-help mailing list archive at Nabble.com. >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [2] > > -- > > Ricardo Saporta > Graduate Student, Data Analytics > Rutgers University, New Jersey > e: saporta at rutgers.edu [3] Links: ------ [1] http://r.789695.n4.nabble.com/Quicker-w-o-keys-set-tp4662157.html [2] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [3] mailto:saporta at rutgers.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Fri Mar 22 13:01:06 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Fri, 22 Mar 2013 12:01:06 +0000 Subject: [datatable-help] Quicker w/o keys set In-Reply-To: <2e6e5ee66ef4b0c66dfb789f33a83e67@imap.plus.net> References: <1363919983128-4662157.post@n4.nabble.com> <2e6e5ee66ef4b0c66dfb789f33a83e67@imap.plus.net> Message-ID: <90d30047cccb59d3a651e70402cebf65@imap.plus.net> And this nice answer by Michael might be of interest too : http://stackoverflow.com/a/13694673/403310 On 22.03.2013 11:05, Matthew Dowle wrote: > Whilst what Rick and Michael said is very true, I suspect that you've found that setting a key on a *numeric* type column is much slower than setkey on an *integer* column. There was an awful (but correct) benchmark on S.O. recently and that's what I replied, but I can't find it now. All I can think is that the OP deleted the question, which would be a shame. If that OP is watching, and that is what happened, please can they undelete it. > > Also you have a setkey(DT) there, with no columns specified. In that case, it will key all the columns; think key only table. But if you have numeric value columns in there as well, or any non-key columns at all, then that will be wasteful. > > Anyway, in the code you posted, try changing > > as.numeric(aa) > > to > > as.integer(aa) > > and you should see setkey run dramatically faster. Then what Rick and Michael said applies from there. > > Matthew > > On 22.03.2013 04:31, Ricardo Saporta wrote: > >> When you set the key, it sorts the table -- this is part of what allows for the speed. >> This initial sorting is what is slowing down your benchmarks. >> >> While it makes sense to compare the initial sort time if you are trying to get a 'full' comparison, in most practice applications, you will only be setting the key once. >> >> Therefore, if you want to see what sort of speed increases you are actually getting, create your DT's first, then benchmark the specific operations of interest. >> >> Also, searching stackoverflow for [r] data.table and benchmarks will produce several useful results >> >> Cheers >> Rick >> >> On Thursday, March 21, 2013, ekbrown wrote: >> >>> Hello. I'm new to data.table(). I am apparently not setting the keys >>> correctly to get the increase in speed talked about in the vignettes, as I >>> get a (much) quicker time *without* keys set. Take a look at the following >>> benchmarking tests. Any ideas? Thanks. Earl Brown >>> >>> > library("data.table") >>> > library("rbenchmark") >>> > >>> > # generates random data >>> > num.files > num.words > logical.vector > file.names > >>> > # defines functions >>> > benDTNoKey + dt + dt[,sum(V1), by = bb][,V1] >>> + } >>> > >>> > benDTWithKey + dt + setkey(dt) >>> + dt[,sum(V1), by = bb][,V1] >>> + } >>> > >>> > benTapply > >>> > # runs benchmarking >>> > benchmark(benTapply(logical.vector, file.names), >>> > benDTWithKey(logical.vector, file.names), benDTNoKey(logical.vector, >>> > file.names), replications = 10, columns = c("test", "replications", >>> > "elapsed")) >>> test replications elapsed >>> 3 benDTNoKey(logical.vector, file.names) 10 *0.753* >>> 2 benDTWithKey(logical.vector, file.names) 10 *4.776* >>> 1 benTapply(logical.vector, file.names) 10 6.218 >>> > >>> > # tests for sameness among results >>> > one > two > three > identical(as.integer(one), as.integer(two)) >>> [1] TRUE >>> > identical(as.integer(two), as.integer(three)) >>> [1] TRUE >>> >>> -- >>> View this message in context: http://r.789695.n4.nabble.com/Quicker-w-o-keys-set-tp4662157.html [1] >>> Sent from the datatable-help mailing list archive at Nabble.com. >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [2] >> >> -- >> >> Ricardo Saporta >> Graduate Student, Data Analytics >> Rutgers University, New Jersey >> e: saporta at rutgers.edu [3] Links: ------ [1] http://r.789695.n4.nabble.com/Quicker-w-o-keys-set-tp4662157.html [2] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [3] mailto:saporta at rutgers.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From s_milberg at hotmail.com Fri Mar 22 23:23:06 2013 From: s_milberg at hotmail.com (Sadao Milberg) Date: Fri, 22 Mar 2013 18:23:06 -0400 Subject: [datatable-help] data.table and cbind() Message-ID: I've recently discovered the dramatic performance improvements data.table provides over ddply() and merge(), and I'm looking forward to integrating it into my work. While messing around with benchmarks, I ran into an unexpected outcome with cbind(), where operations are actually much faster with data frames than data tables. Don't ask my why I'd ever do the following, but I am curious as to why it is so much slower: USArrests.dt <- data.table(USArrests) lst.USArrests <- replicate(1000, USArrests, simplify=FALSE) lst.USArrests.dt <- replicate(1000, USArrests.dt, simplify=FALSE) microbenchmark(do.call(cbind, lst.USArrests), do.call(cbind, lst.USArrests.dt), times=10) Unit: milliseconds expr min lq median uq max neval do.call(cbind, lst.USArrests) 42.26891 47.70086 48.71271 49.88542 51.25453 10 do.call(cbind, lst.USArrests.dt) 750.70469 761.70511 773.91232 816.85707 880.45896 10 This is run on an Ubuntu system. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Sat Mar 23 02:39:28 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Sat, 23 Mar 2013 01:39:28 +0000 Subject: [datatable-help] data.table and cbind() In-Reply-To: References: Message-ID: Interesting. Well asked. On my netbook : > Rprof() > system.time(do.call(cbind, lst.USArrests.dt)) user system elapsed 4.008 0.000 4.012 > Rprof(NULL) > summaryRprof() $by.self self.time self.pct total.time total.pct "make.names" 1.82 44.39 1.82 44.39 "data.table" 1.74 42.44 4.00 97.56 "[[.data.frame" 0.12 2.93 0.26 6.34 "gc" 0.10 2.44 0.10 2.44 "match" 0.08 1.95 0.10 2.44 "length" 0.06 1.46 0.06 1.46 "[[" 0.04 0.98 0.30 7.32 "%in%" 0.04 0.98 0.14 3.41 "NROW" 0.02 0.49 0.12 2.93 "is.data.frame" 0.02 0.49 0.02 0.49 "names" 0.02 0.49 0.02 0.49 "paste" 0.02 0.49 0.02 0.49 "sys.call" 0.02 0.49 0.02 0.49 So almost half of it is in make.names() [notice that cbind.data.frame calls data.frame with check.names=FALSE] and the other half in data.table() but not sure exactly where. So we can do better, or maybe we need a cbindlist (analogous to the existing rbindlist). But as you allude, we've spent most effort on := and set() to add columns by reference rather than copying using cbind(). I've added a feature request to tackle this anyway. Thanks for highlighting, great test. https://r-forge.r-project.org/tracker/?group_id=240&atid=978&func=detail&aid=2636 Matthew On 22.03.2013 22:23, Sadao Milberg wrote: > I've recently discovered the dramatic performance improvements data.table provides over ddply() and merge(), and I'm looking forward to integrating it into my work. While messing around with benchmarks, I ran into an unexpected outcome with cbind(), where operations are actually much faster with data frames than data tables. Don't ask my why I'd ever do the following, but I am curious as to why it is so much slower: > > USArrests.dt > lst.USArrests > lst.USArrests.dt > > microbenchmark(do.call(cbind, lst.USArrests), > do.call(cbind, lst.USArrests.dt), > times=10) > > Unit: milliseconds > expr min lq median uq max neval > do.call(cbind, lst.USArrests) 42.26891 47.70086 48.71271 49.88542 51.25453 10 > do.call(cbind, lst.USArrests.dt) 750.70469 761.70511 773.91232 816.85707 880.45896 10 > > This is run on an Ubuntu system. -------------- next part -------------- An HTML attachment was scrubbed... URL: From gaizoule at gmail.com Sat Mar 23 08:06:33 2013 From: gaizoule at gmail.com (gaizoule) Date: Sat, 23 Mar 2013 00:06:33 -0700 (PDT) Subject: [datatable-help] Suggestion on ITime class implementing. Message-ID: <1364022393122-4662281.post@n4.nabble.com> Hi, everyone, data.table is really a fantastic package, I have become accustomed to using it and saved a lot of time. In my daily work, I need to analysis lots of tick data, the IDateTime is very useful for me. However, ITime class can not handle Millisecond. I suggest using the numbers of milliseconds to represents the introday time, for example, for time "11:00:00.000", using integer 11 * 60 * 60 * 1000 to represent it. I have used kdb+/q , kdb+/q just do with time by that way. best regards, gaizoule -- View this message in context: http://r.789695.n4.nabble.com/Suggestion-on-ITime-class-implementing-tp4662281.html Sent from the datatable-help mailing list archive at Nabble.com. From statquant at outlook.com Sun Mar 24 10:38:39 2013 From: statquant at outlook.com (statquant3) Date: Sun, 24 Mar 2013 02:38:39 -0700 (PDT) Subject: [datatable-help] Suggestion on ITime class implementing. In-Reply-To: <1364022393122-4662281.post@n4.nabble.com> References: <1364022393122-4662281.post@n4.nabble.com> Message-ID: <1364117919491-4662320.post@n4.nabble.com> I wrote almost the same message few month ago (so that Matthew knows that I am not duplicating ids to trick him into implementing this :)) More seriously I recently discovered that R itself handles tdatetime very wrongly. It has the nice POSIXlt which stores date and time as a list (on 40 bytes which is why data.table do not handle it) R> lt = as.POSIXlt("2011-01-01 12:32.234354") R> attributes(lt) $names [1] "sec" "min" "hour" "mday" "mon" "year" "wday" "yday" "isdst" $class [1] "POSIXlt" "POSIXt" It has POSIXct which stores the datetime as a double but very often displays the datetime wrongly See my SO post http://stackoverflow.com/questions/15383057/accurately-converting-from-character-posixct-character-with-sub-millisecond-da and the one it links to http://stackoverflow.com/questions/7726034/how-r-formats-posixct-with-fractional-seconds Dirk Edd. wrote something (then deleted it) stating that Windows could not handle more than milli-second datetimes and Linux "almost" micros. I never understood this... He is developing a RccpBDT that expose Boost::Datetime class to R that I would like to try but it not mature enough (according to him). So at the moment datetime are really a nasty thing that is not handled as accurately as it should -- View this message in context: http://r.789695.n4.nabble.com/Suggestion-on-ITime-class-implementing-tp4662281p4662320.html Sent from the datatable-help mailing list archive at Nabble.com. From gaizoule at gmail.com Sun Mar 24 12:31:26 2013 From: gaizoule at gmail.com (gaizoule) Date: Sun, 24 Mar 2013 04:31:26 -0700 (PDT) Subject: [datatable-help] Suggestion on ITime class implementing. In-Reply-To: <1364117919491-4662320.post@n4.nabble.com> References: <1364022393122-4662281.post@n4.nabble.com> <1364117919491-4662320.post@n4.nabble.com> Message-ID: <1364124686513-4662322.post@n4.nabble.com> I've met the same problem which caused by the POSIXct. I think POSIXlt's storage is wasting a lot of space and R should support the intraday time handling. Thank you for your useful comments, I am reading the StackOverflow. As for the "Windows could not handle more than milli-second datetimes and Linux "almost" micros", this is decided by the OS standard, not caused by other thing. And so, I think millisecond is enough for me. -- View this message in context: http://r.789695.n4.nabble.com/Suggestion-on-ITime-class-implementing-tp4662281p4662322.html Sent from the datatable-help mailing list archive at Nabble.com. From jholtman at gmail.com Sun Mar 24 15:12:30 2013 From: jholtman at gmail.com (Jim Holtman) Date: Sun, 24 Mar 2013 10:12:30 -0400 Subject: [datatable-help] Suggestion on ITime class implementing. Message-ID: One thing to remember about POSIXct is that with floating point you only have about 15 digits of accuracy. ? With 1970 as the base year there are about 12 digits used to get seconds so you only have 3 digits for the subseconds so milliseconds is the limit.? Sent from my Verizon Wireless 4G LTE Smartphone -------- Original message -------- From: gaizoule Date: 03/24/2013 07:31 (GMT-05:00) To: datatable-help at lists.r-forge.r-project.org Subject: Re: [datatable-help] Suggestion on ITime class implementing. I've met the? same problem which caused by the POSIXct. I think POSIXlt's storage is wasting a lot of space? and R should support? the intraday time handling.? Thank you for your useful comments,? I am reading the StackOverflow. As for the "Windows could not handle more than milli-second datetimes and Linux "almost" micros",? this is decided by the OS standard,? not caused by other thing.? And so,? I think millisecond is enough for me. -- View this message in context: http://r.789695.n4.nabble.com/Suggestion-on-ITime-class-implementing-tp4662281p4662322.html Sent from the datatable-help mailing list archive at Nabble.com. _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From statquant at outlook.com Mon Mar 25 09:52:15 2013 From: statquant at outlook.com (stat quant) Date: Mon, 25 Mar 2013 09:52:15 +0100 Subject: [datatable-help] Fwd: Does it worth a feature request ? In-Reply-To: References: Message-ID: Hello data.tablers, I am aware of binary search using J in data.table selects, this works for "AND" if your table is keyed by 2 columns like setkey(DT,x,y) DT[J('A',23),] <=> DT[x=='A' & y==23] #but binary search is much faster for big/large tables But does it work with "OR" ?? There is a post on SO along those lines http://stackoverflow.com/questions/15597971/can-we-do-binary-search-in-data-table-with-or-select-queries What about a feature request ? Cheers Colin -------------- next part -------------- An HTML attachment was scrubbed... URL: From gaizoule at gmail.com Tue Mar 26 02:54:34 2013 From: gaizoule at gmail.com (gaizoule) Date: Mon, 25 Mar 2013 18:54:34 -0700 (PDT) Subject: [datatable-help] Suggestion on ITime class implementing. In-Reply-To: References: <1364022393122-4662281.post@n4.nabble.com> Message-ID: <1364262874029-4662450.post@n4.nabble.com> Thank you for your insight comments, make me come to the essential of the problem. -- View this message in context: http://r.789695.n4.nabble.com/Suggestion-on-ITime-class-implementing-tp4662281p4662450.html Sent from the datatable-help mailing list archive at Nabble.com. From mdowle at mdowle.plus.com Tue Mar 26 11:58:09 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 26 Mar 2013 10:58:09 +0000 Subject: [datatable-help] Suggestion on ITime class implementing. In-Reply-To: <1364022393122-4662281.post@n4.nabble.com> References: <1364022393122-4662281.post@n4.nabble.com> Message-ID: <2ddc85c35b817f1aaeeca5a9dbb0f0f3@imap.plus.net> Hi, An alternative to POSIXct is integer time : 12:34:56.789 => 123456789L which I do quite a bit. And integer dates: 26 Mar 2013 => 20130326L. You can get quite far with two integer columns: date and time. Quite often I don't use any DateTime class at all. Each column is 4 bytes and `roll=TRUE` then only rolls within the same day which is what I usually want. But, yes ITime should be in milliseconds. I couldn't find this on the tracker so have now filed it here : https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2644&group_id=240&atid=978 If any links to posts or S.O. questions are not reachable from there, please add. For micro (and nanosecond, why not) then perhaps we could use integer64 to avoid any floating point issues. 24*60*60*1e9 * 365*100 == 3.15e18 which fits in 2^63 (9.2e18), if I've got the arithmetic right. The nano timestamp could be +/- 292 years of precise nanoseconds around the epoch. And/or, for time only with no date, it could go to picoseconds : 24*60*60*1e12 = 8.6e16 < 2^63 All that would be required is availability of integer64, which is pretty standard (even on 32bit machines). Matthew On 23.03.2013 07:06, gaizoule wrote: > Hi, everyone, > data.table is really a fantastic package, I have become accustomed > to using > it and saved a lot of time. > In my daily work, I need to analysis lots of tick data, the > IDateTime is > very useful for me. However, ITime class can not handle Millisecond. > I > suggest using the numbers of milliseconds to represents the introday > time, > for example, for time "11:00:00.000", using integer 11 * 60 * 60 * > 1000 > to represent it. I have used kdb+/q , kdb+/q just do with time by > that > way. > > best regards, > > gaizoule > > > > > -- > View this message in context: > > http://r.789695.n4.nabble.com/Suggestion-on-ITime-class-implementing-tp4662281.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From mdowle at mdowle.plus.com Tue Mar 26 12:11:52 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 26 Mar 2013 11:11:52 +0000 Subject: [datatable-help] =?utf-8?q?Fwd=3A_Does_it_worth_a_feature_request?= =?utf-8?q?_=3F?= In-Reply-To: References: Message-ID: Hi, Yes, please file it. Auto converting x=='A' & y==23 to the relevant join syntax internally might be possible in the distant furture as well (declarative i rather than imperative i). And might be needed sooner rather than later depending on how we implement the syntax for joining using secondary keys (creating set2key is the easy part). Matthew On 25.03.2013 08:52, stat quant wrote: > Hello data.tablers, > I am aware of binary search using J in data.table selects, this works for "AND" if your table is keyed by 2 columns like > > setkey(DT,x,y) > DT[J('A',23),] DT[x=='A' & y==23] #but binary search is much faster for big/large tables > > But does it work with "OR" ?? > There is a post on SO along those lines http://stackoverflow.com/questions/15597971/can-we-do-binary-search-in-data-table-with-or-select-queries [1] > What about a feature request ? > > Cheers > Colin Links: ------ [1] http://stackoverflow.com/questions/15597971/can-we-do-binary-search-in-data-table-with-or-select-queries -------------- next part -------------- An HTML attachment was scrubbed... URL: From timothee.carayol at gmail.com Wed Mar 27 20:45:05 2013 From: timothee.carayol at gmail.com (=?ISO-8859-1?Q?Timoth=E9e_Carayol?=) Date: Wed, 27 Mar 2013 19:45:05 +0000 Subject: [datatable-help] fread(character string) limited to strings less than 4096 long? Message-ID: Hi, I have an example of a string of 4097 characters which can't be parsed by fread; however, if I remove any character, it can be parsed just fine. Is that a known limitation? (If I write the string to a file and then fread the file name, it works too.) Let me know if you need the string and/or a bug report. Thanks Timoth?e -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Wed Mar 27 22:23:41 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Wed, 27 Mar 2013 21:23:41 -0000 Subject: [datatable-help] fread(character string) limited to strings less than 4096 long? In-Reply-To: References: Message-ID: <83979bd1fc26d19625910fd1ad31f0e4.squirrel@webmail.plus.net> Hi, Nice to hear from you. Nope not known to me. Obviously 4096 is 4k, is that the R limit for a character string length? What happens at 4097? Matthew > Hi, > > I have an example of a string of 4097 characters which can't be parsed by > fread; however, if I remove any character, it can be parsed just fine. Is > that a known limitation? > > (If I write the string to a file and then fread the file name, it works > too.) > > Let me know if you need the string and/or a bug report. > > Thanks > Timoth?e > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From mhwaliji at google.com Wed Mar 27 22:51:46 2013 From: mhwaliji at google.com (Muhammad Waliji) Date: Wed, 27 Mar 2013 14:51:46 -0700 Subject: [datatable-help] fread(character string) limited to strings less than 4096 long? In-Reply-To: <83979bd1fc26d19625910fd1ad31f0e4.squirrel@webmail.plus.net> References: <83979bd1fc26d19625910fd1ad31f0e4.squirrel@webmail.plus.net> Message-ID: R is happy with strings of length 4097: > paste(rep("a", 4097), collapse="") On Wed, Mar 27, 2013 at 2:23 PM, Matthew Dowle wrote: > Hi, > Nice to hear from you. Nope not known to me. Obviously 4096 is 4k, is that > the R limit for a character string length? What happens at 4097? > Matthew > > > Hi, > > > > I have an example of a string of 4097 characters which can't be parsed by > > fread; however, if I remove any character, it can be parsed just fine. Is > > that a known limitation? > > > > (If I write the string to a file and then fread the file name, it works > > too.) > > > > Let me know if you need the string and/or a bug report. > > > > Thanks > > Timoth?e > > _______________________________________________ > > datatable-help mailing list > > datatable-help at lists.r-forge.r-project.org > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From timothee.carayol at gmail.com Wed Mar 27 23:49:32 2013 From: timothee.carayol at gmail.com (=?ISO-8859-1?Q?Timoth=E9e_Carayol?=) Date: Wed, 27 Mar 2013 22:49:32 +0000 Subject: [datatable-help] fread(character string) limited to strings less than 4096 long? In-Reply-To: <83979bd1fc26d19625910fd1ad31f0e4.squirrel@webmail.plus.net> References: <83979bd1fc26d19625910fd1ad31f0e4.squirrel@webmail.plus.net> Message-ID: Agree with Muhammad, longer character strings are definitely permitted in R. A minimal example that show something strange happening with fread: for (n in c(1023:1025, 10000)) { A <- fread( paste( rep('a\tb\n', n), collapse='' ), sep='\t' ) print(nrow(A)) } On my computer, I obtain: [1] 1022 [1] 1023 [1] 1023 [1] 1023 Hope this helps Timoth?e On Wed, Mar 27, 2013 at 9:23 PM, Matthew Dowle wrote: > Hi, > Nice to hear from you. Nope not known to me. Obviously 4096 is 4k, is that > the R limit for a character string length? What happens at 4097? > Matthew > > > Hi, > > > > I have an example of a string of 4097 characters which can't be parsed by > > fread; however, if I remove any character, it can be parsed just fine. Is > > that a known limitation? > > > > (If I write the string to a file and then fread the file name, it works > > too.) > > > > Let me know if you need the string and/or a bug report. > > > > Thanks > > Timoth?e > > _______________________________________________ > > datatable-help mailing list > > datatable-help at lists.r-forge.r-project.org > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Thu Mar 28 15:31:56 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Thu, 28 Mar 2013 14:31:56 +0000 Subject: [datatable-help] =?utf-8?q?fread=28character_string=29_limited_to?= =?utf-8?q?_strings_less_than_4096_long=3F?= In-Reply-To: References: <83979bd1fc26d19625910fd1ad31f0e4.squirrel@webmail.plus.net> Message-ID: <2c2af8789733127541fe78c1ccde5412@imap.plus.net> Interesting, what's your sessionInfo() please? For me it seems to work ok : [1] 1022 [1] 1023 [1] 1024 [1] 9999 > sessionInfo() R version 2.15.2 (2012-10-26) Platform: x86_64-w64-mingw32/x64 (64-bit) On 27.03.2013 22:49, Timoth?e Carayol wrote: > Agree with Muhammad, longer character strings are definitely permitted in R. > A minimal example that show something strange happening with fread: > > for (n in c(1023:1025, 10000)) { > A > paste( > rep('atbn', n), > collapse='' > ), > sep='t' > ) > print(nrow(A)) > } > On my computer, I obtain: > > [1] 1022 > [1] 1023 > [1] 1023 > [1] 1023 > Hope this helps > Timoth?e > > On Wed, Mar 27, 2013 at 9:23 PM, Matthew Dowle wrote: > >> Hi, >> Nice to hear from you. Nope not known to me. Obviously 4096 is 4k, is that >> the R limit for a character string length? What happens at 4097? >> Matthew >> >> > Hi, >> > >> > I have an example of a string of 4097 characters which can't be parsed by >> > fread; however, if I remove any character, it can be parsed just fine. Is >> > that a known limitation? >> > >> > (If I write the string to a file and then fread the file name, it works >> > too.) >> > >> > Let me know if you need the string and/or a bug report. >> > >> > Thanks >> > Timoth?e > _______________________________________________ >> > datatable-help mailing list >> > datatable-help at lists.r-forge.r-project.org [1] >> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [2] Links: ------ [1] mailto:datatable-help at lists.r-forge.r-project.org [2] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [3] mailto:mdowle at mdowle.plus.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From timothee.carayol at gmail.com Thu Mar 28 15:38:37 2013 From: timothee.carayol at gmail.com (=?ISO-8859-1?Q?Timoth=E9e_Carayol?=) Date: Thu, 28 Mar 2013 14:38:37 +0000 Subject: [datatable-help] fread(character string) limited to strings less than 4096 long? In-Reply-To: <2c2af8789733127541fe78c1ccde5412@imap.plus.net> References: <83979bd1fc26d19625910fd1ad31f0e4.squirrel@webmail.plus.net> <2c2af8789733127541fe78c1ccde5412@imap.plus.net> Message-ID: Curiouser and curiouser.. I can reproduce on two computers with different versions of R and of data.table. Computer 1 (it says unknown-linux but is actually ubuntu): R version 2.15.3 (2013-03-01) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 LC_PAPER=C LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] bit64_0.9-2 bit_1.1-10 data.table_1.8.9 colorout_1.0-0 Computer 2: R version 2.15.2 (2012-10-26) Platform: x86_64-redhat-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] data.table_1.8.8 loaded via a namespace (and not attached): [1] tools_2.15.2 On Thu, Mar 28, 2013 at 2:31 PM, Matthew Dowle wrote: > ** > > > > Interesting, what's your sessionInfo() please? > > For me it seems to work ok : > > [1] 1022 > [1] 1023 > [1] 1024 > [1] 9999 > > > sessionInfo() > R version 2.15.2 (2012-10-26) > Platform: x86_64-w64-mingw32/x64 (64-bit) > > > > On 27.03.2013 22:49, Timoth?e Carayol wrote: > > Agree with Muhammad, longer character strings are definitely permitted > in R. > A minimal example that show something strange happening with fread: > for (n in c(1023:1025, 10000)) { > A > paste( > rep('a\tb\n', n), > collapse='' > ), > sep='\t' > ) > print(nrow(A)) > } > On my computer, I obtain: > [1] 1022 > [1] 1023 > [1] 1023 > [1] 1023 > Hope this helps > Timoth?e > > > On Wed, Mar 27, 2013 at 9:23 PM, Matthew Dowle wrote: > >> Hi, >> Nice to hear from you. Nope not known to me. Obviously 4096 is 4k, is that >> the R limit for a character string length? What happens at 4097? >> Matthew >> >> > Hi, >> > >> > I have an example of a string of 4097 characters which can't be parsed >> by >> > fread; however, if I remove any character, it can be parsed just fine. >> Is >> > that a known limitation? >> > >> > (If I write the string to a file and then fread the file name, it works >> > too.) >> > >> > Let me know if you need the string and/or a bug report. >> > >> > Thanks >> > Timoth?e >> > _______________________________________________ >> > datatable-help mailing list >> > datatable-help at lists.r-forge.r-project.org >> > >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Thu Mar 28 15:55:17 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Thu, 28 Mar 2013 14:55:17 +0000 Subject: [datatable-help] =?utf-8?q?fread=28character_string=29_limited_to?= =?utf-8?q?_strings_less_than_4096_long=3F?= In-Reply-To: References: <83979bd1fc26d19625910fd1ad31f0e4.squirrel@webmail.plus.net> <2c2af8789733127541fe78c1ccde5412@imap.plus.net> Message-ID: Hm this is odd. Could you run the following and paste back the (verbose) results please. for (n in c(1023:1025, 10000)) { input = paste( rep('atbn', n), collapse='') A = fread(input,verbose=TRUE) cat(nchar(input), nrow(A), "n") } On 28.03.2013 14:38, Timoth?e Carayol wrote: > Curiouser and curiouser.. > > I can reproduce on two computers with different versions of R and of data.table. > > Computer 1 (it says unknown-linux but is actually ubuntu): > > R version 2.15.3 (2013-03-01) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 > LC_MESSAGES=en_GB.UTF-8 LC_PAPER=C LC_NAME=C LC_ADDRESS=C > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] bit64_0.9-2 bit_1.1-10 data.table_1.8.9 colorout_1.0-0 > Computer 2: > > R version 2.15.2 (2012-10-26) > Platform: x86_64-redhat-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 > [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] data.table_1.8.8 > > loaded via a namespace (and not attached): > [1] tools_2.15.2 > > On Thu, Mar 28, 2013 at 2:31 PM, Matthew Dowle wrote: > >> Interesting, what's your sessionInfo() please? >> >> For me it seems to work ok : >> >> [1] 1022 >> [1] 1023 >> [1] 1024 >> [1] 9999 >> >>> sessionInfo() >> R version 2.15.2 (2012-10-26) >> Platform: x86_64-w64-mingw32/x64 (64-bit) >> >> On 27.03.2013 22:49, Timoth?e Carayol wrote: >> >>> Agree with Muhammad, longer character strings are definitely permitted in R. >>> A minimal example that show something strange happening with fread: >>> >>> for (n in c(1023:1025, 10000)) { >>> A >>> >>> paste( >>> rep('atbn', n), >>> collapse='' >>> ), >>> sep='t' >>> ) >>> print(nrow(A)) >>> } >>> On my computer, I obtain: >>> >>> [1] 1022 >>> [1] 1023 >>> [1] 1023 >>> [1] 1023 >>> Hope this helps >>> Timoth?e >>> >>> On Wed, Mar 27, 2013 at 9:23 PM, Matthew Dowle wrote: >>> >>>> Hi, >>>> Nice to hear from you. Nope not known to me. Obviously 4096 is 4k, is that >>>> the R limit for a character string length? What happens at 4097? >>>> Matthew >>>> >>>> > Hi, >>>> > >>>> > I have an example of a string of 4097 characters which can't be parsed by >>>> > fread; however, if I remove any character, it can be parsed just fine. Is >>>> > that a known limitation? >>>> > >>>> > (If I write the string to a file and then fread the file name, it works >>>> > too.) >>>> > >>>> > Let me know if you need the string and/or a bug report. >>>> > >>>> > Thanks >>>> > Timoth?e > _______________________________________________ >>>> > datatable-help mailing list >>>> > datatable-help at lists.r-forge.r-project.org [1] >>>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [2] Links: ------ [1] mailto:datatable-help at lists.r-forge.r-project.org [2] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [3] mailto:mdowle at mdowle.plus.com [4] mailto:mdowle at mdowle.plus.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From timothee.carayol at gmail.com Thu Mar 28 15:58:37 2013 From: timothee.carayol at gmail.com (=?ISO-8859-1?Q?Timoth=E9e_Carayol?=) Date: Thu, 28 Mar 2013 14:58:37 +0000 Subject: [datatable-help] fread(character string) limited to strings less than 4096 long? In-Reply-To: References: <83979bd1fc26d19625910fd1ad31f0e4.squirrel@webmail.plus.net> <2c2af8789733127541fe78c1ccde5412@imap.plus.net> Message-ID: Input contains a \n (or is ""), taking this to be text input (not a filename) Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. Using line 30 to detect sep (the last non blank line in the first 30) ... '\t' Found 2 columns First row with 2 fields occurs on line 1 (either column names or first row of data) All the fields on line 1 are character fields. Treating as the column names. Count of eol after first data row: 1023 Subtracted 1 for last eol and any trailing empty lines, leaving 1022 data rows Type codes: 33 (first 5 rows) Type codes: 33 (+middle 5 rows) Type codes: 33 (+last 5 rows) 0.000s (-nan%) Memory map (rerun may be quicker) 0.000s (-nan%) sep and header detection 0.000s (-nan%) Count rows (wc -l) 0.000s (-nan%) Column type detection (first, middle and last 5 rows) 0.000s (-nan%) Allocation of 1022x2 result (xMB) in RAM 0.000s (-nan%) Reading data 0.000s (-nan%) Allocation for type bumps (if any), including gc time if triggered 0.000s (-nan%) Coercing data already read in type bumps (if any) 0.000s (-nan%) Changing na.strings to NA 0.000s Total 4092 1022 Input contains a \n (or is ""), taking this to be text input (not a filename) Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. Using line 30 to detect sep (the last non blank line in the first 30) ... '\t' Found 2 columns First row with 2 fields occurs on line 1 (either column names or first row of data) All the fields on line 1 are character fields. Treating as the column names. Count of eol after first data row: 1023 Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data rows Type codes: 33 (first 5 rows) Type codes: 33 (+middle 5 rows) Type codes: 33 (+last 5 rows) 0.000s (-nan%) Memory map (rerun may be quicker) 0.000s (-nan%) sep and header detection 0.000s (-nan%) Count rows (wc -l) 0.000s (-nan%) Column type detection (first, middle and last 5 rows) 0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM 0.000s (-nan%) Reading data 0.000s (-nan%) Allocation for type bumps (if any), including gc time if triggered 0.000s (-nan%) Coercing data already read in type bumps (if any) 0.000s (-nan%) Changing na.strings to NA 0.000s Total 4096 1023 Input contains a \n (or is ""), taking this to be text input (not a filename) Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. Using line 30 to detect sep (the last non blank line in the first 30) ... '\t' Found 2 columns First row with 2 fields occurs on line 1 (either column names or first row of data) All the fields on line 1 are character fields. Treating as the column names. Count of eol after first data row: 1023 Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data rows Type codes: 33 (first 5 rows) Type codes: 33 (+middle 5 rows) Type codes: 33 (+last 5 rows) 0.000s (-nan%) Memory map (rerun may be quicker) 0.000s (-nan%) sep and header detection 0.000s (-nan%) Count rows (wc -l) 0.000s (-nan%) Column type detection (first, middle and last 5 rows) 0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM 0.000s (-nan%) Reading data 0.000s (-nan%) Allocation for type bumps (if any), including gc time if triggered 0.000s (-nan%) Coercing data already read in type bumps (if any) 0.000s (-nan%) Changing na.strings to NA 0.000s Total 4100 1023 Input contains a \n (or is ""), taking this to be text input (not a filename) Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. Using line 30 to detect sep (the last non blank line in the first 30) ... '\t' Found 2 columns First row with 2 fields occurs on line 1 (either column names or first row of data) All the fields on line 1 are character fields. Treating as the column names. Count of eol after first data row: 1023 Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data rows Type codes: 33 (first 5 rows) Type codes: 33 (+middle 5 rows) Type codes: 33 (+last 5 rows) 0.000s (-nan%) Memory map (rerun may be quicker) 0.000s (-nan%) sep and header detection 0.000s (-nan%) Count rows (wc -l) 0.000s (-nan%) Column type detection (first, middle and last 5 rows) 0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM 0.000s (-nan%) Reading data 0.000s (-nan%) Allocation for type bumps (if any), including gc time if triggered 0.000s (-nan%) Coercing data already read in type bumps (if any) 0.000s (-nan%) Changing na.strings to NA 0.000s Total 40000 1023 On Thu, Mar 28, 2013 at 2:55 PM, Matthew Dowle wrote: > ** > > > > Hm this is odd. > > Could you run the following and paste back the (verbose) results please. > > for (n in c(1023:1025, 10000)) { > input = paste( rep('a\tb\n', n), collapse='') > A = fread(input,verbose=TRUE) > cat(nchar(input), nrow(A), "\n") > } > > > > > > On 28.03.2013 14:38, Timoth?e Carayol wrote: > > Curiouser and curiouser.. > > I can reproduce on two computers with different versions of R and of > data.table. > > > > Computer 1 (it says unknown-linux but is actually ubuntu): > > R version 2.15.3 (2013-03-01) > > Platform: x86_64-unknown-linux-gnu (64-bit) > > > > locale: > > [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C > LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 > LC_MONETARY=en_GB.UTF-8 > LC_MESSAGES=en_GB.UTF-8 LC_PAPER=C LC_NAME=C > LC_ADDRESS=C > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 > LC_IDENTIFICATION=C > > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > > > other attached packages: > > [1] bit64_0.9-2 bit_1.1-10 data.table_1.8.9 colorout_1.0-0 > > Computer 2: > R version 2.15.2 (2012-10-26) > Platform: x86_64-redhat-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 > [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] data.table_1.8.8 > > loaded via a namespace (and not attached): > [1] tools_2.15.2 > > > On Thu, Mar 28, 2013 at 2:31 PM, Matthew Dowle wrote: > >> >> >> Interesting, what's your sessionInfo() please? >> >> For me it seems to work ok : >> >> [1] 1022 >> [1] 1023 >> [1] 1024 >> [1] 9999 >> >> > sessionInfo() >> R version 2.15.2 (2012-10-26) >> Platform: x86_64-w64-mingw32/x64 (64-bit) >> >> >> >> On 27.03.2013 22:49, Timoth?e Carayol wrote: >> >> Agree with Muhammad, longer character strings are definitely permitted >> in R. >> A minimal example that show something strange happening with fread: >> for (n in c(1023:1025, 10000)) { >> A >> paste( >> rep('a\tb\n', n), >> collapse='' >> ), >> sep='\t' >> ) >> print(nrow(A)) >> } >> On my computer, I obtain: >> [1] 1022 >> [1] 1023 >> [1] 1023 >> [1] 1023 >> Hope this helps >> Timoth?e >> >> >> On Wed, Mar 27, 2013 at 9:23 PM, Matthew Dowle wrote: >> >>> Hi, >>> Nice to hear from you. Nope not known to me. Obviously 4096 is 4k, is >>> that >>> the R limit for a character string length? What happens at 4097? >>> Matthew >>> >>> > Hi, >>> > >>> > I have an example of a string of 4097 characters which can't be parsed >>> by >>> > fread; however, if I remove any character, it can be parsed just fine. >>> Is >>> > that a known limitation? >>> > >>> > (If I write the string to a file and then fread the file name, it works >>> > too.) >>> > >>> > Let me know if you need the string and/or a bug report. >>> > >>> > Thanks >>> > Timoth?e >>> > _______________________________________________ >>> > datatable-help mailing list >>> > datatable-help at lists.r-forge.r-project.org >>> > >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >>> >>> >> >> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Thu Mar 28 16:19:52 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Thu, 28 Mar 2013 15:19:52 +0000 Subject: [datatable-help] =?utf-8?q?fread=28character_string=29_limited_to?= =?utf-8?q?_strings_less_than_4096_long=3F?= In-Reply-To: References: <83979bd1fc26d19625910fd1ad31f0e4.squirrel@webmail.plus.net> <2c2af8789733127541fe78c1ccde5412@imap.plus.net> Message-ID: <230b0040889556349b21822824a5fb7e@imap.plus.net> Hi, Thanks. That was from v1.8.8 on computer 2 I hope. Computer 1 with v1.8.9 should have the -nan% problem fixed. I'm a bit stumped for the moment. I've filed a bug report. Probably, if I still can't reproduce my end, I'll add some more detailed tracing to verbose output and ask you to try again next week if that's ok. Thanks for reporting! Matthew On 28.03.2013 14:58, Timoth?e Carayol wrote: > Input contains a n (or is ""), taking this to be text input (not a filename) > Detected eol as n only (no r afterwards), the UNIX and Mac standard. > Using line 30 to detect sep (the last non blank line in the first 30) ... 't' > Found 2 columns > First row with 2 fields occurs on line 1 (either column names or first row of data) > All the fields on line 1 are character fields. Treating as the column names. > Count of eol after first data row: 1023 > Subtracted 1 for last eol and any trailing empty lines, leaving 1022 data rows > Type codes: 33 (first 5 rows) > Type codes: 33 (+middle 5 rows) > Type codes: 33 (+last 5 rows) > 0.000s (-nan%) Memory map (rerun may be quicker) > 0.000s (-nan%) sep and header detection > 0.000s (-nan%) Count rows (wc -l) > 0.000s (-nan%) Column type detection (first, middle and last 5 rows) > 0.000s (-nan%) Allocation of 1022x2 result (xMB) in RAM > 0.000s (-nan%) Reading data > 0.000s (-nan%) Allocation for type bumps (if any), including gc time if triggered > 0.000s (-nan%) Coercing data already read in type bumps (if any) > 0.000s (-nan%) Changing na.strings to NA > 0.000s Total > 4092 1022 > Input contains a n (or is ""), taking this to be text input (not a filename) > Detected eol as n only (no r afterwards), the UNIX and Mac standard. > Using line 30 to detect sep (the last non blank line in the first 30) ... 't' > Found 2 columns > First row with 2 fields occurs on line 1 (either column names or first row of data) > All the fields on line 1 are character fields. Treating as the column names. > Count of eol after first data row: 1023 > Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data rows > Type codes: 33 (first 5 rows) > Type codes: 33 (+middle 5 rows) > Type codes: 33 (+last 5 rows) > 0.000s (-nan%) Memory map (rerun may be quicker) > 0.000s (-nan%) sep and header detection > 0.000s (-nan%) Count rows (wc -l) > 0.000s (-nan%) Column type detection (first, middle and last 5 rows) > 0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM > 0.000s (-nan%) Reading data > 0.000s (-nan%) Allocation for type bumps (if any), including gc time if triggered > 0.000s (-nan%) Coercing data already read in type bumps (if any) > 0.000s (-nan%) Changing na.strings to NA > 0.000s Total > 4096 1023 > Input contains a n (or is ""), taking this to be text input (not a filename) > Detected eol as n only (no r afterwards), the UNIX and Mac standard. > Using line 30 to detect sep (the last non blank line in the first 30) ... 't' > Found 2 columns > First row with 2 fields occurs on line 1 (either column names or first row of data) > All the fields on line 1 are character fields. Treating as the column names. > Count of eol after first data row: 1023 > Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data rows > Type codes: 33 (first 5 rows) > Type codes: 33 (+middle 5 rows) > Type codes: 33 (+last 5 rows) > 0.000s (-nan%) Memory map (rerun may be quicker) > 0.000s (-nan%) sep and header detection > 0.000s (-nan%) Count rows (wc -l) > 0.000s (-nan%) Column type detection (first, middle and last 5 rows) > 0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM > 0.000s (-nan%) Reading data > 0.000s (-nan%) Allocation for type bumps (if any), including gc time if triggered > 0.000s (-nan%) Coercing data already read in type bumps (if any) > 0.000s (-nan%) Changing na.strings to NA > 0.000s Total > 4100 1023 > Input contains a n (or is ""), taking this to be text input (not a filename) > Detected eol as n only (no r afterwards), the UNIX and Mac standard. > Using line 30 to detect sep (the last non blank line in the first 30) ... 't' > Found 2 columns > First row with 2 fields occurs on line 1 (either column names or first row of data) > All the fields on line 1 are character fields. Treating as the column names. > Count of eol after first data row: 1023 > Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data rows > Type codes: 33 (first 5 rows) > Type codes: 33 (+middle 5 rows) > Type codes: 33 (+last 5 rows) > 0.000s (-nan%) Memory map (rerun may be quicker) > 0.000s (-nan%) sep and header detection > 0.000s (-nan%) Count rows (wc -l) > 0.000s (-nan%) Column type detection (first, middle and last 5 rows) > 0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM > 0.000s (-nan%) Reading data > 0.000s (-nan%) Allocation for type bumps (if any), including gc time if triggered > 0.000s (-nan%) Coercing data already read in type bumps (if any) > 0.000s (-nan%) Changing na.strings to NA > 0.000s Total > 40000 1023 > > On Thu, Mar 28, 2013 at 2:55 PM, Matthew Dowle wrote: > >> Hm this is odd. >> >> Could you run the following and paste back the (verbose) results please. >> for (n in c(1023:1025, 10000)) { >> >> input = paste( rep('atbn', n), collapse='') >> A = fread(input,verbose=TRUE) >> cat(nchar(input), nrow(A), "n") >> } >> >> On 28.03.2013 14:38, Timoth?e Carayol wrote: >> >>> Curiouser and curiouser.. >>> >>> I can reproduce on two computers with different versions of R and of data.table. >>> >>> Computer 1 (it says unknown-linux but is actually ubuntu): >>> >>> R version 2.15.3 (2013-03-01) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 >>> LC_MESSAGES=en_GB.UTF-8 LC_PAPER=C LC_NAME=C LC_ADDRESS=C >>> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] bit64_0.9-2 bit_1.1-10 data.table_1.8.9 colorout_1.0-0 >>> Computer 2: >>> >>> R version 2.15.2 (2012-10-26) >>> Platform: x86_64-redhat-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 >>> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 >>> [7] LC_PAPER=C LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] data.table_1.8.8 >>> >>> loaded via a namespace (and not attached): >>> [1] tools_2.15.2 >>> >>> On Thu, Mar 28, 2013 at 2:31 PM, Matthew Dowle wrote: >>> >>>> Interesting, what's your sessionInfo() please? >>>> >>>> For me it seems to work ok : >>>> >>>> [1] 1022 >>>> [1] 1023 >>>> [1] 1024 >>>> [1] 9999 >>>> >>>>> sessionInfo() >>>> R version 2.15.2 (2012-10-26) >>>> Platform: x86_64-w64-mingw32/x64 (64-bit) >>>> >>>> On 27.03.2013 22:49, Timoth?e Carayol wrote: >>>> >>>>> Agree with Muhammad, longer character strings are definitely permitted in R. >>>>> A minimal example that show something strange happening with fread: >>>>> >>>>> for (n in c(1023:1025, 10000)) { >>>>> A >>>>> >>>>> paste( >>>>> rep('atbn', n), >>>>> collapse='' >>>>> ), >>>>> sep='t' >>>>> ) >>>>> print(nrow(A)) >>>>> } >>>>> On my computer, I obtain: >>>>> >>>>> [1] 1022 >>>>> [1] 1023 >>>>> [1] 1023 >>>>> [1] 1023 >>>>> Hope this helps >>>>> Timoth?e >>>>> >>>>> On Wed, Mar 27, 2013 at 9:23 PM, Matthew Dowle wrote: >>>>> >>>>>> Hi, >>>>>> Nice to hear from you. Nope not known to me. Obviously 4096 is 4k, is that >>>>>> the R limit for a character string length? What happens at 4097? >>>>>> Matthew >>>>>> >>>>>> > Hi, >>>>>> > >>>>>> > I have an example of a string of 4097 characters which can't be parsed by >>>>>> > fread; however, if I remove any character, it can be parsed just fine. Is >>>>>> > that a known limitation? >>>>>> > >>>>>> > (If I write the string to a file and then fread the file name, it works >>>>>> > too.) >>>>>> > >>>>>> > Let me know if you need the string and/or a bug report. >>>>>> > >>>>>> > Thanks >>>>>> > Timoth?e > _______________________________________________ >>>>>> > datatable-help mailing list >>>>>> > datatable-help at lists.r-forge.r-project.org [1] >>>>>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [2] Links: ------ [1] mailto:datatable-help at lists.r-forge.r-project.org [2] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [3] mailto:mdowle at mdowle.plus.com [4] mailto:mdowle at mdowle.plus.com [5] mailto:mdowle at mdowle.plus.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From gsee000 at gmail.com Thu Mar 28 16:23:34 2013 From: gsee000 at gmail.com (G See) Date: Thu, 28 Mar 2013 10:23:34 -0500 Subject: [datatable-help] fread(character string) limited to strings less than 4096 long? In-Reply-To: <230b0040889556349b21822824a5fb7e@imap.plus.net> References: <83979bd1fc26d19625910fd1ad31f0e4.squirrel@webmail.plus.net> <2c2af8789733127541fe78c1ccde5412@imap.plus.net> <230b0040889556349b21822824a5fb7e@imap.plus.net> Message-ID: FWIW, on mac: > for (n in c(1023:1025, 10000)) { + A <- fread( + paste( + rep('a\tb\n', n), + collapse='' + ), + sep='\t' + ) + print(nrow(A)) + } [1] 255 [1] 255 [1] 255 [1] 255 > sessionInfo() R version 2.15.2 (2012-10-26) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] data.table_1.8.9 ####### and with verbose > for (n in c(1023:1025, 10000)) { + input = paste( rep('a\tb\n', n), collapse='') + A = fread(input,verbose=TRUE) + cat(nchar(input), nrow(A), "\n") + } Input contains a \n (or is ""), taking this to be text input (not a filename) Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. Using line 30 to detect sep (the last non blank line in the first 30) ... '\t' Found 2 columns First row with 2 fields occurs on line 1 (either column names or first row of data) All the fields on line 1 are character fields. Treating as the column names. Count of eol after first data row: 255 Subtracted 0 for last eol and any trailing empty lines, leaving 255 data rows Type codes: 33 (first 5 rows) Type codes: 33 (+middle 5 rows) Type codes: 33 (+last 5 rows) 0.000s ( 14%) Memory map (rerun may be quicker) 0.000s ( 25%) sep and header detection 0.000s ( 8%) Count rows (wc -l) 0.000s ( 24%) Column type detection (first, middle and last 5 rows) 0.000s ( 6%) Allocation of 255x2 result (xMB) in RAM 0.000s ( 22%) Reading data 0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered 0.000s ( 0%) Coercing data already read in type bumps (if any) 0.000s ( 1%) Changing na.strings to NA 0.000s Total 4092 255 Input contains a \n (or is ""), taking this to be text input (not a filename) Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. Using line 30 to detect sep (the last non blank line in the first 30) ... '\t' Found 2 columns First row with 2 fields occurs on line 1 (either column names or first row of data) All the fields on line 1 are character fields. Treating as the column names. Count of eol after first data row: 255 Subtracted 0 for last eol and any trailing empty lines, leaving 255 data rows Type codes: 33 (first 5 rows) Type codes: 33 (+middle 5 rows) Type codes: 33 (+last 5 rows) 0.000s ( 10%) Memory map (rerun may be quicker) 0.000s ( 21%) sep and header detection 0.000s ( 10%) Count rows (wc -l) 0.000s ( 28%) Column type detection (first, middle and last 5 rows) 0.000s ( 3%) Allocation of 255x2 result (xMB) in RAM 0.000s ( 26%) Reading data 0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered 0.000s ( 0%) Coercing data already read in type bumps (if any) 0.000s ( 2%) Changing na.strings to NA 0.000s Total 4096 255 Input contains a \n (or is ""), taking this to be text input (not a filename) Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. Using line 30 to detect sep (the last non blank line in the first 30) ... '\t' Found 2 columns First row with 2 fields occurs on line 1 (either column names or first row of data) All the fields on line 1 are character fields. Treating as the column names. Count of eol after first data row: 255 Subtracted 0 for last eol and any trailing empty lines, leaving 255 data rows Type codes: 33 (first 5 rows) Type codes: 33 (+middle 5 rows) Type codes: 33 (+last 5 rows) 0.000s ( 10%) Memory map (rerun may be quicker) 0.000s ( 21%) sep and header detection 0.000s ( 10%) Count rows (wc -l) 0.000s ( 27%) Column type detection (first, middle and last 5 rows) 0.000s ( 3%) Allocation of 255x2 result (xMB) in RAM 0.000s ( 27%) Reading data 0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered 0.000s ( 0%) Coercing data already read in type bumps (if any) 0.000s ( 1%) Changing na.strings to NA 0.000s Total 4100 255 Input contains a \n (or is ""), taking this to be text input (not a filename) Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. Using line 30 to detect sep (the last non blank line in the first 30) ... '\t' Found 2 columns First row with 2 fields occurs on line 1 (either column names or first row of data) All the fields on line 1 are character fields. Treating as the column names. Count of eol after first data row: 255 Subtracted 0 for last eol and any trailing empty lines, leaving 255 data rows Type codes: 33 (first 5 rows) Type codes: 33 (+middle 5 rows) Type codes: 33 (+last 5 rows) 0.000s ( 10%) Memory map (rerun may be quicker) 0.000s ( 23%) sep and header detection 0.000s ( 10%) Count rows (wc -l) 0.000s ( 25%) Column type detection (first, middle and last 5 rows) 0.000s ( 3%) Allocation of 255x2 result (xMB) in RAM 0.000s ( 26%) Reading data 0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered 0.000s ( 0%) Coercing data already read in type bumps (if any) 0.000s ( 3%) Changing na.strings to NA 0.000s Total 40000 255 Best, Garrett On Thu, Mar 28, 2013 at 10:19 AM, Matthew Dowle wrote: > > > Hi, > > Thanks. That was from v1.8.8 on computer 2 I hope. Computer 1 with v1.8.9 > should have the -nan% problem fixed. > > I'm a bit stumped for the moment. I've filed a bug report. Probably, if I > still can't reproduce my end, I'll add some more detailed tracing to verbose > output and ask you to try again next week if that's ok. > > Thanks for reporting! > > Matthew > > > > On 28.03.2013 14:58, Timoth?e Carayol wrote: > > Input contains a \n (or is ""), taking this to be text input (not a > filename) > Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. > Using line 30 to detect sep (the last non blank line in the first 30) ... > '\t' > Found 2 columns > First row with 2 fields occurs on line 1 (either column names or first row > of data) > All the fields on line 1 are character fields. Treating as the column names. > Count of eol after first data row: 1023 > Subtracted 1 for last eol and any trailing empty lines, leaving 1022 data > rows > Type codes: 33 (first 5 rows) > Type codes: 33 (+middle 5 rows) > Type codes: 33 (+last 5 rows) > 0.000s (-nan%) Memory map (rerun may be quicker) > 0.000s (-nan%) sep and header detection > 0.000s (-nan%) Count rows (wc -l) > 0.000s (-nan%) Column type detection (first, middle and last 5 rows) > 0.000s (-nan%) Allocation of 1022x2 result (xMB) in RAM > 0.000s (-nan%) Reading data > 0.000s (-nan%) Allocation for type bumps (if any), including gc time if > triggered > 0.000s (-nan%) Coercing data already read in type bumps (if any) > 0.000s (-nan%) Changing na.strings to NA > 0.000s Total > 4092 1022 > Input contains a \n (or is ""), taking this to be text input (not a > filename) > Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. > Using line 30 to detect sep (the last non blank line in the first 30) ... > '\t' > Found 2 columns > First row with 2 fields occurs on line 1 (either column names or first row > of data) > All the fields on line 1 are character fields. Treating as the column names. > Count of eol after first data row: 1023 > Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data > rows > Type codes: 33 (first 5 rows) > Type codes: 33 (+middle 5 rows) > Type codes: 33 (+last 5 rows) > 0.000s (-nan%) Memory map (rerun may be quicker) > 0.000s (-nan%) sep and header detection > 0.000s (-nan%) Count rows (wc -l) > 0.000s (-nan%) Column type detection (first, middle and last 5 rows) > 0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM > 0.000s (-nan%) Reading data > 0.000s (-nan%) Allocation for type bumps (if any), including gc time if > triggered > 0.000s (-nan%) Coercing data already read in type bumps (if any) > 0.000s (-nan%) Changing na.strings to NA > 0.000s Total > 4096 1023 > Input contains a \n (or is ""), taking this to be text input (not a > filename) > Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. > Using line 30 to detect sep (the last non blank line in the first 30) ... > '\t' > Found 2 columns > First row with 2 fields occurs on line 1 (either column names or first row > of data) > All the fields on line 1 are character fields. Treating as the column names. > Count of eol after first data row: 1023 > Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data > rows > Type codes: 33 (first 5 rows) > Type codes: 33 (+middle 5 rows) > Type codes: 33 (+last 5 rows) > 0.000s (-nan%) Memory map (rerun may be quicker) > 0.000s (-nan%) sep and header detection > 0.000s (-nan%) Count rows (wc -l) > 0.000s (-nan%) Column type detection (first, middle and last 5 rows) > 0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM > 0.000s (-nan%) Reading data > 0.000s (-nan%) Allocation for type bumps (if any), including gc time if > triggered > 0.000s (-nan%) Coercing data already read in type bumps (if any) > 0.000s (-nan%) Changing na.strings to NA > 0.000s Total > 4100 1023 > Input contains a \n (or is ""), taking this to be text input (not a > filename) > Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. > Using line 30 to detect sep (the last non blank line in the first 30) ... > '\t' > Found 2 columns > First row with 2 fields occurs on line 1 (either column names or first row > of data) > All the fields on line 1 are character fields. Treating as the column names. > Count of eol after first data row: 1023 > Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data > rows > Type codes: 33 (first 5 rows) > Type codes: 33 (+middle 5 rows) > Type codes: 33 (+last 5 rows) > 0.000s (-nan%) Memory map (rerun may be quicker) > 0.000s (-nan%) sep and header detection > 0.000s (-nan%) Count rows (wc -l) > 0.000s (-nan%) Column type detection (first, middle and last 5 rows) > 0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM > 0.000s (-nan%) Reading data > 0.000s (-nan%) Allocation for type bumps (if any), including gc time if > triggered > 0.000s (-nan%) Coercing data already read in type bumps (if any) > 0.000s (-nan%) Changing na.strings to NA > 0.000s Total > 40000 1023 > > > On Thu, Mar 28, 2013 at 2:55 PM, Matthew Dowle > wrote: >> >> >> >> Hm this is odd. >> >> Could you run the following and paste back the (verbose) results please. >> >> for (n in c(1023:1025, 10000)) { >> >> input = paste( rep('a\tb\n', n), collapse='') >> A = fread(input,verbose=TRUE) >> cat(nchar(input), nrow(A), "\n") >> } >> >> >> >> >> >> On 28.03.2013 14:38, Timoth?e Carayol wrote: >> >> Curiouser and curiouser.. >> >> I can reproduce on two computers with different versions of R and of >> data.table. >> >> >> >> Computer 1 (it says unknown-linux but is actually ubuntu): >> >> R version 2.15.3 (2013-03-01) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C >> LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 >> LC_MONETARY=en_GB.UTF-8 >> LC_MESSAGES=en_GB.UTF-8 LC_PAPER=C LC_NAME=C >> LC_ADDRESS=C >> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 >> LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] bit64_0.9-2 bit_1.1-10 data.table_1.8.9 colorout_1.0-0 >> Computer 2: >> R version 2.15.2 (2012-10-26) >> Platform: x86_64-redhat-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 >> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] data.table_1.8.8 >> >> loaded via a namespace (and not attached): >> [1] tools_2.15.2 >> >> >> On Thu, Mar 28, 2013 at 2:31 PM, Matthew Dowle >> wrote: >>> >>> >>> >>> Interesting, what's your sessionInfo() please? >>> >>> For me it seems to work ok : >>> >>> [1] 1022 >>> [1] 1023 >>> [1] 1024 >>> [1] 9999 >>> >>> > sessionInfo() >>> R version 2.15.2 (2012-10-26) >>> Platform: x86_64-w64-mingw32/x64 (64-bit) >>> >>> >>> >>> On 27.03.2013 22:49, Timoth?e Carayol wrote: >>> >>> Agree with Muhammad, longer character strings are definitely permitted in >>> R. >>> A minimal example that show something strange happening with fread: >>> for (n in c(1023:1025, 10000)) { >>> A >>> paste( >>> rep('a\tb\n', n), >>> collapse='' >>> ), >>> sep='\t' >>> ) >>> print(nrow(A)) >>> } >>> On my computer, I obtain: >>> [1] 1022 >>> [1] 1023 >>> [1] 1023 >>> [1] 1023 >>> Hope this helps >>> Timoth?e >>> >>> >>> On Wed, Mar 27, 2013 at 9:23 PM, Matthew Dowle >>> wrote: >>>> >>>> Hi, >>>> Nice to hear from you. Nope not known to me. Obviously 4096 is 4k, is >>>> that >>>> the R limit for a character string length? What happens at 4097? >>>> Matthew >>>> >>>> > Hi, >>>> > >>>> > I have an example of a string of 4097 characters which can't be parsed >>>> > by >>>> > fread; however, if I remove any character, it can be parsed just fine. >>>> > Is >>>> > that a known limitation? >>>> > >>>> > (If I write the string to a file and then fread the file name, it >>>> > works >>>> > too.) >>>> > >>>> > Let me know if you need the string and/or a bug report. >>>> > >>>> > Thanks >>>> > Timoth?e >>>> > _______________________________________________ >>>> > datatable-help mailing list >>>> > datatable-help at lists.r-forge.r-project.org >>>> > >>>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>>> >>>> >>> >>> >>> >> >> >> >> > > > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From timothee.carayol at gmail.com Thu Mar 28 16:26:38 2013 From: timothee.carayol at gmail.com (=?ISO-8859-1?Q?Timoth=E9e_Carayol?=) Date: Thu, 28 Mar 2013 15:26:38 +0000 Subject: [datatable-help] fread(character string) limited to strings less than 4096 long? In-Reply-To: <230b0040889556349b21822824a5fb7e@imap.plus.net> References: <83979bd1fc26d19625910fd1ad31f0e4.squirrel@webmail.plus.net> <2c2af8789733127541fe78c1ccde5412@imap.plus.net> <230b0040889556349b21822824a5fb7e@imap.plus.net> Message-ID: Of course, I'll be happy to help! By the way the verbose output was actually from computer 1 (with 1.8.9) so it seems like the -nan% problem is maybe still there? Cheers Timoth?e On Thu, Mar 28, 2013 at 3:19 PM, Matthew Dowle wrote: > ** > > > > Hi, > > Thanks. That was from v1.8.8 on computer 2 I hope. Computer 1 with > v1.8.9 should have the -nan% problem fixed. > > I'm a bit stumped for the moment. I've filed a bug report. Probably, if > I still can't reproduce my end, I'll add some more detailed tracing to > verbose output and ask you to try again next week if that's ok. > > Thanks for reporting! > > Matthew > > > > On 28.03.2013 14:58, Timoth?e Carayol wrote: > > Input contains a \n (or is ""), taking this to be text input (not a > filename) > Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. > > Using line 30 to detect sep (the last non blank line in the first 30) ... > '\t' > Found 2 columns > > First row with 2 fields occurs on line 1 (either column names or first row > of data) > All the fields on line 1 are character fields. Treating as the column > names. > Count of eol after first data row: 1023 > > Subtracted 1 for last eol and any trailing empty lines, leaving 1022 data > rows > Type codes: 33 (first 5 rows) > > Type codes: 33 (+middle 5 rows) > > Type codes: 33 (+last 5 rows) > > 0.000s (-nan%) Memory map (rerun may be quicker) > > 0.000s (-nan%) sep and header detection > > 0.000s (-nan%) Count rows (wc -l) > > 0.000s (-nan%) Column type detection (first, middle and last 5 rows) > > 0.000s (-nan%) Allocation of 1022x2 result (xMB) in RAM > > 0.000s (-nan%) Reading data > > 0.000s (-nan%) Allocation for type bumps (if any), including gc time if > triggered > 0.000s (-nan%) Coercing data already read in type bumps (if any) > > 0.000s (-nan%) Changing na.strings to NA > > 0.000s Total > > 4092 1022 > > Input contains a \n (or is ""), taking this to be text input (not a > filename) > Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. > > Using line 30 to detect sep (the last non blank line in the first 30) ... > '\t' > Found 2 columns > > First row with 2 fields occurs on line 1 (either column names or first row > of data) > All the fields on line 1 are character fields. Treating as the column > names. > Count of eol after first data row: 1023 > > Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data > rows > Type codes: 33 (first 5 rows) > > Type codes: 33 (+middle 5 rows) > > Type codes: 33 (+last 5 rows) > > 0.000s (-nan%) Memory map (rerun may be quicker) > > 0.000s (-nan%) sep and header detection > > 0.000s (-nan%) Count rows (wc -l) > > 0.000s (-nan%) Column type detection (first, middle and last 5 rows) > > 0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM > > 0.000s (-nan%) Reading data > > 0.000s (-nan%) Allocation for type bumps (if any), including gc time if > triggered > 0.000s (-nan%) Coercing data already read in type bumps (if any) > > 0.000s (-nan%) Changing na.strings to NA > > 0.000s Total > > 4096 1023 > > Input contains a \n (or is ""), taking this to be text input (not a > filename) > Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. > > Using line 30 to detect sep (the last non blank line in the first 30) ... > '\t' > Found 2 columns > > First row with 2 fields occurs on line 1 (either column names or first row > of data) > All the fields on line 1 are character fields. Treating as the column > names. > Count of eol after first data row: 1023 > > Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data > rows > Type codes: 33 (first 5 rows) > > Type codes: 33 (+middle 5 rows) > > Type codes: 33 (+last 5 rows) > > 0.000s (-nan%) Memory map (rerun may be quicker) > > 0.000s (-nan%) sep and header detection > > 0.000s (-nan%) Count rows (wc -l) > > 0.000s (-nan%) Column type detection (first, middle and last 5 rows) > > 0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM > > 0.000s (-nan%) Reading data > > 0.000s (-nan%) Allocation for type bumps (if any), including gc time if > triggered > 0.000s (-nan%) Coercing data already read in type bumps (if any) > > 0.000s (-nan%) Changing na.strings to NA > > 0.000s Total > > 4100 1023 > > Input contains a \n (or is ""), taking this to be text input (not a > filename) > Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. > > Using line 30 to detect sep (the last non blank line in the first 30) ... > '\t' > Found 2 columns > > First row with 2 fields occurs on line 1 (either column names or first row > of data) > All the fields on line 1 are character fields. Treating as the column > names. > Count of eol after first data row: 1023 > > Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data > rows > Type codes: 33 (first 5 rows) > > Type codes: 33 (+middle 5 rows) > > Type codes: 33 (+last 5 rows) > > 0.000s (-nan%) Memory map (rerun may be quicker) > > 0.000s (-nan%) sep and header detection > > 0.000s (-nan%) Count rows (wc -l) > > 0.000s (-nan%) Column type detection (first, middle and last 5 rows) > > 0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM > > 0.000s (-nan%) Reading data > > 0.000s (-nan%) Allocation for type bumps (if any), including gc time if > triggered > 0.000s (-nan%) Coercing data already read in type bumps (if any) > > 0.000s (-nan%) Changing na.strings to NA > > 0.000s Total > > 40000 1023 > > > > On Thu, Mar 28, 2013 at 2:55 PM, Matthew Dowle wrote: > >> >> >> Hm this is odd. >> >> Could you run the following and paste back the (verbose) results please. >> for (n in c(1023:1025, 10000)) { >> >> input = paste( rep('a\tb\n', n), collapse='') >> A = fread(input,verbose=TRUE) >> cat(nchar(input), nrow(A), "\n") >> } >> >> >> >> >> >> On 28.03.2013 14:38, Timoth?e Carayol wrote: >> >> Curiouser and curiouser.. >> >> I can reproduce on two computers with different versions of R and of >> data.table. >> >> >> >> Computer 1 (it says unknown-linux but is actually ubuntu): >> >> R version 2.15.3 (2013-03-01) >> >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> >> >> locale: >> >> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C >> LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 >> LC_MONETARY=en_GB.UTF-8 >> LC_MESSAGES=en_GB.UTF-8 LC_PAPER=C LC_NAME=C >> LC_ADDRESS=C >> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 >> LC_IDENTIFICATION=C >> >> >> >> attached base packages: >> >> [1] stats graphics grDevices utils datasets methods base >> >> >> >> other attached packages: >> >> [1] bit64_0.9-2 bit_1.1-10 data.table_1.8.9 colorout_1.0-0 >> >> Computer 2: >> R version 2.15.2 (2012-10-26) >> Platform: x86_64-redhat-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 >> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] data.table_1.8.8 >> >> loaded via a namespace (and not attached): >> [1] tools_2.15.2 >> >> >> On Thu, Mar 28, 2013 at 2:31 PM, Matthew Dowle wrote: >> >>> >>> >>> Interesting, what's your sessionInfo() please? >>> >>> For me it seems to work ok : >>> >>> [1] 1022 >>> [1] 1023 >>> [1] 1024 >>> [1] 9999 >>> >>> > sessionInfo() >>> R version 2.15.2 (2012-10-26) >>> Platform: x86_64-w64-mingw32/x64 (64-bit) >>> >>> >>> >>> On 27.03.2013 22:49, Timoth?e Carayol wrote: >>> >>> Agree with Muhammad, longer character strings are definitely permitted >>> in R. >>> A minimal example that show something strange happening with fread: >>> for (n in c(1023:1025, 10000)) { >>> A >>> paste( >>> rep('a\tb\n', n), >>> collapse='' >>> ), >>> sep='\t' >>> ) >>> print(nrow(A)) >>> } >>> On my computer, I obtain: >>> [1] 1022 >>> [1] 1023 >>> [1] 1023 >>> [1] 1023 >>> Hope this helps >>> Timoth?e >>> >>> >>> On Wed, Mar 27, 2013 at 9:23 PM, Matthew Dowle wrote: >>> >>>> Hi, >>>> Nice to hear from you. Nope not known to me. Obviously 4096 is 4k, is >>>> that >>>> the R limit for a character string length? What happens at 4097? >>>> Matthew >>>> >>>> > Hi, >>>> > >>>> > I have an example of a string of 4097 characters which can't be >>>> parsed by >>>> > fread; however, if I remove any character, it can be parsed just >>>> fine. Is >>>> > that a known limitation? >>>> > >>>> > (If I write the string to a file and then fread the file name, it >>>> works >>>> > too.) >>>> > >>>> > Let me know if you need the string and/or a bug report. >>>> > >>>> > Thanks >>>> > Timoth?e >>>> > _______________________________________________ >>>> > datatable-help mailing list >>>> > datatable-help at lists.r-forge.r-project.org >>>> > >>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>>> >>>> >>>> >>> >>> >> >> >> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From saporta at scarletmail.rutgers.edu Thu Mar 28 17:52:57 2013 From: saporta at scarletmail.rutgers.edu (Ricardo Saporta) Date: Thu, 28 Mar 2013 12:52:57 -0400 Subject: [datatable-help] rbindlist on list of data.frames with factor column Message-ID: Hello, I found that when using `rbindlist` on a list of data.frames with factor columns, the factor column is getting concat'd as its numeric equivalent. This of course, does not happen when using a list of data.tables. # sample data, using data.frame sampleList.DF <- lapply(LETTERS[1:5], function(L) data.frame(Val1=rnorm(3), Val2=runif(3), FactorCol=L) ) sampleList.DF <- lapply(sampleList.DF, function(x) {x$StringCol <- as.character(x$FactorCol); x}) # sample data, using data.table sampleList.DT <- lapply(LETTERS[1:5], function(L) data.table(Val1=rnorm(3), Val2=runif(3), FactorCol=L) ) sampleList.DT <- lapply(sampleList.DT, function(x) x[, StringCol := as.character(FactorCol)]) # Compare the column `FactorCol`: rbindlist(sampleList.DT) rbindlist(sampleList.DF) do.call(rbind, sampleList.DF) Interestingly, I originally thought it was levels dependent: (I would have expected, for example, the following to allow for the levels of the third list element, but it does not). sampleList.DF[[1]][, "FactorCol"] <- factor(c("A", "C", "A")) # all the levels in third element are present in the first all(levels(sampleList.DF[[3]][, "FactorCol"]) %in% levels(sampleList.DF[[1]][, "FactorCol"])) # [1] TRUE But... rbindlist(sampleList.DF) However: sampleList.DF[[1]][, "FactorCol"] <- factor(c("C", "A", "A"), levels=c("C", "A")) rbindlist(sampleList.DF) Is the above behavior intended? Cheers, Rick -------------- next part -------------- An HTML attachment was scrubbed... URL: From saporta at scarletmail.rutgers.edu Thu Mar 28 18:34:29 2013 From: saporta at scarletmail.rutgers.edu (Ricardo Saporta) Date: Thu, 28 Mar 2013 13:34:29 -0400 Subject: [datatable-help] rbindlist on list of data.frames with factor column In-Reply-To: References: Message-ID: My apologies, I had a mistake in my previous email. (I forgot that data.table does not coerce strings to factor) It looks like the `rbindlist` behavior observed occurs for *both*, a list of data.tables and a list of data.frames (assuming, of course, that there is factor column present) # sample data, using data.frame set.seed(1) sampleList.DF <- lapply(LETTERS[1:5], function(L) data.frame(Val1=rnorm(3), Val2=runif(3), FactorCol=factor(L)) ) sampleList.DF <- lapply(sampleList.DF, function(x) {x$StringCol <- as.character(x$FactorCol); x}) # sample data, using data.table set.seed(1) sampleList.DT <- lapply(LETTERS[1:5], function(L) data.table(Val1=rnorm(3), Val2=runif(3), FactorCol=factor(L)) ) sampleList.DT <- lapply(sampleList.DT, function(x) x[, StringCol := as.character(FactorCol)]) # rbindlist results: rbindlist(sampleList.DT) rbindlist(sampleList.DF) # expected behavior similiar to do.call(rbind, LIST) do.call(rbind, sampleList.DF) do.call(rbind, sampleList.DT) On Thu, Mar 28, 2013 at 12:52 PM, Ricardo Saporta < saporta at scarletmail.rutgers.edu> wrote: > Hello, > > I found that when using `rbindlist` on a list of data.frames with factor > columns, the factor column is getting concat'd as its numeric equivalent. > > This of course, does not happen when using a list of data.tables. > > # sample data, using data.frame > sampleList.DF <- lapply(LETTERS[1:5], function(L) > data.frame(Val1=rnorm(3), Val2=runif(3), FactorCol=L) ) > > sampleList.DF <- lapply(sampleList.DF, function(x) > {x$StringCol <- as.character(x$FactorCol); x}) > > # sample data, using data.table > sampleList.DT <- lapply(LETTERS[1:5], function(L) > data.table(Val1=rnorm(3), Val2=runif(3), FactorCol=L) ) > sampleList.DT <- lapply(sampleList.DT, function(x) > x[, StringCol := as.character(FactorCol)]) > > > # Compare the column `FactorCol`: > > rbindlist(sampleList.DT) > rbindlist(sampleList.DF) > do.call(rbind, sampleList.DF) > > Interestingly, I originally thought it was levels dependent: > (I would have expected, for example, the following to allow for the levels > of the third list element, but it does not). > > sampleList.DF[[1]][, "FactorCol"] <- factor(c("A", "C", "A")) > > # all the levels in third element are present in the first > all(levels(sampleList.DF[[3]][, "FactorCol"]) %in% > levels(sampleList.DF[[1]][, "FactorCol"])) > # [1] TRUE > > But... > > rbindlist(sampleList.DF) > > However: > > sampleList.DF[[1]][, "FactorCol"] <- factor(c("C", "A", "A"), > levels=c("C", "A")) > rbindlist(sampleList.DF) > > Is the above behavior intended? > > Cheers, > Rick > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Fri Mar 29 02:04:32 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Fri, 29 Mar 2013 01:04:32 +0000 Subject: [datatable-help] rbindlist on list of data.frames with factor column In-Reply-To: References: Message-ID: Well spotted. Looking at the C source just now it looks like I never considered factor columns in rbindlist(). At the time I needed rbindlist, I needed it quickly for something I was doing, which didn't use factor columns. Please file as a bug report. Should be fairly easy to implement, and quick in C. It would populate the column as if it were character (without actually converting to a new character vector for each item l column) and then call factor() at R level afterwards to refactor it. Matthew On 28.03.2013 17:34, Ricardo Saporta wrote: > My apologies, I had a mistake in my previous email. (I forgot that data.table does not coerce strings to factor) > It looks like the `rbindlist` behavior observed occurs for _BOTH_, a list of data.tables and a list of data.frames (assuming, of course, that there is factor column present) > # sample data, using data.frame > set.seed(1) > sampleList.DF > data.frame(Val1=rnorm(3), Val2=runif(3), FactorCol=factor(L)) ) > sampleList.DF > {x$StringCol > # sample data, using data.table > set.seed(1) > sampleList.DT > data.table(Val1=rnorm(3), Val2=runif(3), FactorCol=factor(L)) ) > sampleList.DT > x[, StringCol := as.character(FactorCol)]) > # rbindlist results: > rbindlist(sampleList.DT) > rbindlist(sampleList.DF) > # expected behavior similiar to do.call(rbind, LIST) > do.call(rbind, sampleList.DF) > do.call(rbind, sampleList.DT) > > On Thu, Mar 28, 2013 at 12:52 PM, Ricardo Saporta wrote: > >> Hello, >> I found that when using `rbindlist` on a list of data.frames with factor columns, the factor column is getting concat'd as its numeric equivalent. >> This of course, does not happen when using a list of data.tables. >> # sample data, using data.frame >> sampleList.DF >> data.frame(Val1=rnorm(3), Val2=runif(3), FactorCol=L) ) >> sampleList.DF >> {x$StringCol >> # sample data, using data.table >> sampleList.DT >> data.table(Val1=rnorm(3), Val2=runif(3), FactorCol=L) ) >> sampleList.DT >> x[, StringCol := as.character(FactorCol)]) >> # Compare the column `FactorCol`: >> rbindlist(sampleList.DT) >> rbindlist(sampleList.DF) >> do.call(rbind, sampleList.DF) >> Interestingly, I originally thought it was levels dependent: >> (I would have expected, for example, the following to allow for the levels of the third list element, but it does not). >> sampleList.DF[[1]][, "FactorCol"] >> >> # all the levels in third element are present in the first >> all(levels(sampleList.DF[[3]][, "FactorCol"]) %in% levels(sampleList.DF[[1]][, "FactorCol"])) >> # [1] TRUE >> But... >> rbindlist(sampleList.DF) >> However: >> sampleList.DF[[1]][, "FactorCol"] >> rbindlist(sampleList.DF) >> >> Is the above behavior intended? >> Cheers, >> Rick Links: ------ [1] mailto:saporta at scarletmail.rutgers.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: