From Veronica_Vaca at inec.gob.ec Wed Jan 8 20:27:04 2014 From: Veronica_Vaca at inec.gob.ec (=?windows-1258?Q?INEC_Ver=F3nica_Vaca?=) Date: Wed, 8 Jan 2014 19:27:04 +0000 Subject: [datatable-help] Reading spss files Message-ID: <1FEA41707A83D642857978B6EC6C698831495CD0@servernmail.pcentral.inec.gov.ec> Dear Members of the list I am trying to read some large spss files but I only found ways to read them and converting them in data frames. Is there a function in the package data.table that allow import files.sav??? Or a way to combine this package with the other packages for importing data?? Greetings, Ver?nica Vaca NORMATIVAS Y METODOLOG?AS DEL SEN INSTITUTO NACIONAL DE ESTAD?STICA Y CENSOS (INEC) ? Juan Larrea N15-36 y Jos? Riofr?o ? Telf.: (593 2) 2544326/2544561 Ext. 1302 www.ecuadorencifras.gob.ec Quito ? Ecuador [pie_firma] ________________________________ Somos responsables por la protecci?n del Medio Ambiente. Antes de Imprimir este mail confirme que sea necesario. Gracias ?Cl?usula de Confidencialidad: La informaci?n contenida en el presente mensaje es confidencial, esta dirigida exclusivamente a su destinatario y no puede ser vinculante. El INEC no se responsabiliza por su uso y deja expresa constancia que en los registros de la Instituci?n consta la informaci?n originalmente enviada. Este mensaje esta protegido por la Ley de Propiedad Intelectual, Ley de Comercio Electr?nico, Firmas y Mensajes de datos, reglamentos y acuerdos internacionales relacionados. Si usted no es el destinatario de este mensaje, recomendamos su eliminaci?n inmediata. La distribuci?n o copia del mismo, esta prohibida y sera sancionada de acuerdo al C?digo Penal y dem?s normas aplicables. La transmisi?n de informaci?n por correo electr?nico, no garantiza que la misma sea segura o este libre de error, por consiguiente, se recomienda su verificaci?n. Toda solicitud de informaci?n requerida de manera oficial al INEC debe ser ingresada por Archivo General y dirigida a la m?xima autoridad de la Instituci?n, conforme a la Ley y dem?s normas vigentes." -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.jpg Type: image/jpeg Size: 6465 bytes Desc: image001.jpg URL: From lianoglou.steve at gene.com Wed Jan 8 20:40:08 2014 From: lianoglou.steve at gene.com (Steve Lianoglou) Date: Wed, 8 Jan 2014 11:40:08 -0800 Subject: [datatable-help] Reading spss files In-Reply-To: <1FEA41707A83D642857978B6EC6C698831495CD0@servernmail.pcentral.inec.gov.ec> References: <1FEA41707A83D642857978B6EC6C698831495CD0@servernmail.pcentral.inec.gov.ec> Message-ID: Hi Veronica, 2014/1/8 INEC Ver?nica Vaca > Dear Members of the list > > I am trying to read some large spss files but I only found ways to read > them and converting them in data frames. > > Is there a function in the package data.table that allow import > files.sav??? > > Or a way to combine this package with the other packages for importing > data?? > I've never used it, but it looks like the package 'foreign' can read (some types of?) SPSS data and import into R: http://cran.r-project.org/web/packages/foreign/index.html If the data is amenable to become a data.table (ie. it's data.frame -like) you can convert it "in the usual way." Note that in the development version of data.table, Arun has implemented a setDT method which converts a data.frame to a data.table without making a copy, which might be helpful to you if your data.frame eats up enough RAM. HTH, -steve -- Steve Lianoglou Computational Biologist Genentech -------------- next part -------------- An HTML attachment was scrubbed... URL: From Veronica_Vaca at inec.gob.ec Wed Jan 8 21:31:57 2014 From: Veronica_Vaca at inec.gob.ec (=?iso-8859-1?Q?INEC_Ver=F3nica_Vaca?=) Date: Wed, 8 Jan 2014 20:31:57 +0000 Subject: [datatable-help] Reading spss files In-Reply-To: References: <1FEA41707A83D642857978B6EC6C698831495CD0@servernmail.pcentral.inec.gov.ec> Message-ID: <1FEA41707A83D642857978B6EC6C698831495D0E@servernmail.pcentral.inec.gov.ec> Thank you Steve I have tried with the foreign package but I have the problem with the memory, I can?t find the function setDT, I thought it would be very helpful, because the data frame takes all the memory. I am very new at R, so I don?t know how to get that function if it is in the development version of the package. Greetings Veronica De: mailinglist.honeypot at gmail.com [mailto:mailinglist.honeypot at gmail.com] En nombre de Steve Lianoglou Enviado el: mi?rcoles, 08 de enero de 2014 14:40 Para: INEC Ver?nica Vaca CC: datatable-help at lists.r-forge.r-project.org Asunto: Re: [datatable-help] Reading spss files Hi Veronica, 2014/1/8 INEC Ver?nica Vaca > Dear Members of the list I am trying to read some large spss files but I only found ways to read them and converting them in data frames. Is there a function in the package data.table that allow import files.sav??? Or a way to combine this package with the other packages for importing data?? I've never used it, but it looks like the package 'foreign' can read (some types of?) SPSS data and import into R: http://cran.r-project.org/web/packages/foreign/index.html If the data is amenable to become a data.table (ie. it's data.frame -like) you can convert it "in the usual way." Note that in the development version of data.table, Arun has implemented a setDT method which converts a data.frame to a data.table without making a copy, which might be helpful to you if your data.frame eats up enough RAM. HTH, -steve -- Steve Lianoglou Computational Biologist Genentech ________________________________ Somos responsables por la protecci?n del Medio Ambiente. Antes de Imprimir este mail confirme que sea necesario. Gracias "Cl?usula de Confidencialidad: La informaci?n contenida en el presente mensaje es confidencial, esta dirigida exclusivamente a su destinatario y no puede ser vinculante. El INEC no se responsabiliza por su uso y deja expresa constancia que en los registros de la Instituci?n consta la informaci?n originalmente enviada. Este mensaje esta protegido por la Ley de Propiedad Intelectual, Ley de Comercio Electr?nico, Firmas y Mensajes de datos, reglamentos y acuerdos internacionales relacionados. Si usted no es el destinatario de este mensaje, recomendamos su eliminaci?n inmediata. La distribuci?n o copia del mismo, esta prohibida y sera sancionada de acuerdo al C?digo Penal y dem?s normas aplicables. La transmisi?n de informaci?n por correo electr?nico, no garantiza que la misma sea segura o este libre de error, por consiguiente, se recomienda su verificaci?n. Toda solicitud de informaci?n requerida de manera oficial al INEC debe ser ingresada por Archivo General y dirigida a la m?xima autoridad de la Instituci?n, conforme a la Ley y dem?s normas vigentes." -------------- next part -------------- An HTML attachment was scrubbed... URL: From lianoglou.steve at gene.com Wed Jan 8 21:47:35 2014 From: lianoglou.steve at gene.com (Steve Lianoglou) Date: Wed, 8 Jan 2014 12:47:35 -0800 Subject: [datatable-help] Reading spss files In-Reply-To: <1FEA41707A83D642857978B6EC6C698831495D0E@servernmail.pcentral.inec.gov.ec> References: <1FEA41707A83D642857978B6EC6C698831495CD0@servernmail.pcentral.inec.gov.ec> <1FEA41707A83D642857978B6EC6C698831495D0E@servernmail.pcentral.inec.gov.ec> Message-ID: Hi, 2014/1/8 INEC Ver?nica Vaca : > Thank you Steve > > I have tried with the foreign package but I have the problem with the > memory, I can?t find the function setDT, I thought it would be very helpful, > > because the data frame takes all the memory. > > I am very new at R, so I don?t know how to get that function if it is in the > development version of the package. The setDT is only in the development version of data.table, which is available via SVN (and perhaps compiled) on r-forge: https://r-forge.r-project.org/projects/datatable/ If the foreign package can't even read that data into memory for you, though, then using the fancy setDT will be of no use to you, since it requires the object be loaded as a data.frame already. Is the SPSS object a table? (are they all tables? I have no idea, never has used it) Could you dump the data into a database from within SPSS then access it that way from R? An SQLite database would be the first/easiest choice. * How big is the data (rows x columns)? * How much RAM do you have? * Are you on a 64-bit machine? Is R running in 64bit mode? (The value of `.Machine$sizeof.pointer` should be 8). Unfortunately, if the data can't fit into the amount of usable RAM you have, then data.table will not be able to help you -- would getting more RAM isn't an option for you. If you can't fit the data into RAM, but you can dump it into a database and still want to use R to do split/apply/combine computation over the data as described here: http://www.jstatsoft.org/v40/i01/paper You might consider looking at the dplyr package Hadley Wickham is developing that I believe supports (or will sometime in the future) working with data stored in a database (among other places) in order to perform split/apply/combine stuff: https://github.com/hadley/dplyr HTH, -steve -- Steve Lianoglou Computational Biologist Genentech From statquant at outlook.com Fri Jan 10 16:51:01 2014 From: statquant at outlook.com (statquant3) Date: Fri, 10 Jan 2014 07:51:01 -0800 (PST) Subject: [datatable-help] fread crash Message-ID: <1389369061207-4683394.post@n4.nabble.com> I got the following crash message using fread R) data <- fread(FILE) *** caught segfault *** address 0x2acf7d5000, cause 'memory not mapped' Traceback: 1: fread(FILE) Possible actions: 1: abort (with core dump, if enabled) 2: normal R exit 3: exit R without saving workspace 4: exit R saving workspace FYI: R version 3.0.1 (2013-05-16) -- "Good Sport" Copyright (C) 2013 The R Foundation for Statistical Computing Platform: x86_64-unknown-linux-gnu (64-bit) That's probably due to the file itself, the csv file is generated by kdb, head/tail looks ok though fread(FILE,nrows=5) crashes too What can I do to trace back the issue ? -- View this message in context: http://r.789695.n4.nabble.com/fread-crash-tp4683394.html Sent from the datatable-help mailing list archive at Nabble.com. From aragorn168b at gmail.com Fri Jan 10 16:54:12 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 10 Jan 2014 16:54:12 +0100 Subject: [datatable-help] fread crash In-Reply-To: <1389369061207-4683394.post@n4.nabble.com> References: <1389369061207-4683394.post@n4.nabble.com> Message-ID: statquant, Thanks for reporting. One useful thing would be to do: options(data table.verbose=TRUE) and copy/paste the messages displayed as well. If you can reproduce it with a small file (like you seem to have), maybe you can also attach it (here or on the bugs page, once it's clear it's a bug)? Arun On 10 Jan 2014 at 16:51:12, statquant3 (statquant at outlook.com) wrote: I got the following crash message using fread R) data <- fread(FILE) *** caught segfault *** address 0x2acf7d5000, cause 'memory not mapped' Traceback: 1: fread(FILE) Possible actions: 1: abort (with core dump, if enabled) 2: normal R exit 3: exit R without saving workspace 4: exit R saving workspace FYI: R version 3.0.1 (2013-05-16) -- "Good Sport" Copyright (C) 2013 The R Foundation for Statistical Computing Platform: x86_64-unknown-linux-gnu (64-bit) That's probably due to the file itself, the csv file is generated by kdb, head/tail looks ok though fread(FILE,nrows=5) crashes too What can I do to trace back the issue ? -- View this message in context: http://r.789695.n4.nabble.com/fread-crash-tp4683394.html Sent from the datatable-help mailing list archive at Nabble.com. _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From statquant at outlook.com Fri Jan 10 17:04:31 2014 From: statquant at outlook.com (statquant3) Date: Fri, 10 Jan 2014 08:04:31 -0800 (PST) Subject: [datatable-help] fread crash In-Reply-To: References: <1389369061207-4683394.post@n4.nabble.com> Message-ID: <1389369871015-4683397.post@n4.nabble.com> R) sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] C attached base packages: [1] stats graphics grDevices datasets utils methods base other attached packages: [1] data.table_1.8.11 [**something I cannot disclose (...I know...)**] loaded via a namespace (and not attached): [1] plyr_1.8 reshape2_1.2.2 stringr_0.6.2 ====================================================== R) data <- fread(FILE, verbose=T) Input contains no \n. Taking this to be a filename to open Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. Using line 30 to detect sep (the last non blank line in the first 'autostart') ... sep=',' Found 22 columns First row with 22 fields occurs on line 1 (either column names or first row of data) All the fields on line 1 are character fields. Treating as the column names. Count of eol after first data row: 5094763 Subtracted 0 for last eol and any trailing empty lines, leaving 5094763 data rows Type codes: 4444444433333331411433 (first 5 rows) Type codes: 4444444433333333431433 (+middle 5 rows) *** caught segfault *** address 0x2ae7f1a000, cause 'memory not mapped' Traceback: 1: fread(FILE, verbose = T) Possible actions: 1: abort (with core dump, if enabled) 2: normal R exit 3: exit R without saving workspace 4: exit R saving workspace ==================== Here is the output for 5 lines, it worked this time R) dataT <- fread(FILE, nrow=5, verbose=T) Input contains no \n. Taking this to be a filename to open Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. Using line 30 to detect sep (the last non blank line in the first 'autostart') ... sep=',' Found 22 columns First row with 22 fields occurs on line 1 (either column names or first row of data) All the fields on line 1 are character fields. Treating as the column names. Count of eol after first data row: 5094763 Subtracted 0 for last eol and any trailing empty lines, leaving 5094763 data rows nrow limited to nrows passed in (5) Type codes: 4444444433333331411433 (first 5 rows) Type codes: 4444444433333331411433 (after applying colClasses and integer64) Type codes: 4444444433333331411433 (after applying drop or select (if supplied) Allocating 22 column slots (22 - 0 NULL) 0.000s ( 0%) Memory map (rerun may be quicker) 0.000s ( 0%) sep and header detection 2.760s (100%) Count rows (wc -l) 0.000s ( 0%) Column type detection (first, middle and last 5 rows) 0.000s ( 0%) Allocation of 5x22 result (xMB) in RAM 0.000s ( 0%) Reading data 0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered 0.000s ( 0%) Coercing data already read in type bumps (if any) 0.000s ( 0%) Changing na.strings to NA 2.760s Total Warning message: In fread(FILE, nrow = 5, verbose = T) : Mapped file ok but madvise failed -- View this message in context: http://r.789695.n4.nabble.com/fread-crash-tp4683394p4683397.html Sent from the datatable-help mailing list archive at Nabble.com. From mdowle at mdowle.plus.com Fri Jan 10 17:11:38 2014 From: mdowle at mdowle.plus.com (Matt Dowle) Date: Fri, 10 Jan 2014 16:11:38 +0000 Subject: [datatable-help] fread crash In-Reply-To: <1389369871015-4683397.post@n4.nabble.com> References: <1389369061207-4683394.post@n4.nabble.com> <1389369871015-4683397.post@n4.nabble.com> Message-ID: <52D01BBA.5030602@mdowle.plus.com> On 10/01/14 16:04, statquant3 wrote: > R) sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-unknown-linux-gnu (64-bit) > locale: > [1] C > attached base packages: > [1] stats graphics grDevices datasets utils methods base > other attached packages: > [1] data.table_1.8.11 [**something I cannot disclose (...I know...)**] > loaded via a namespace (and not attached): > [1] plyr_1.8 reshape2_1.2.2 stringr_0.6.2 > > ====================================================== > R) data <- fread(FILE, verbose=T) > Input contains no \n. Taking this to be a filename to open > Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. > Using line 30 to detect sep (the last non blank line in the first > 'autostart') ... sep=',' > Found 22 columns > First row with 22 fields occurs on line 1 (either column names or first row > of data) > All the fields on line 1 are character fields. Treating as the column names. > Count of eol after first data row: 5094763 > Subtracted 0 for last eol and any trailing empty lines, leaving 5094763 data > rows > Type codes: 4444444433333331411433 (first 5 rows) > Type codes: 4444444433333333431433 (+middle 5 rows) Seems to be crashing when detecting types using the last 5 rows. Can you say anything odd near the end of the file? > > *** caught segfault *** address 0x2ae7f1a000, cause 'memory not mapped' > > Traceback: > 1: fread(FILE, verbose = T) > > Possible actions: > 1: abort (with core dump, if enabled) > 2: normal R exit > 3: exit R without saving workspace > 4: exit R saving workspace > > ==================== > Here is the output for 5 lines, it worked this time > > R) dataT <- fread(FILE, nrow=5, verbose=T) > Input contains no \n. Taking this to be a filename to open > Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. > Using line 30 to detect sep (the last non blank line in the first > 'autostart') ... sep=',' > Found 22 columns > First row with 22 fields occurs on line 1 (either column names or first row > of data) > All the fields on line 1 are character fields. Treating as the column names. > Count of eol after first data row: 5094763 > Subtracted 0 for last eol and any trailing empty lines, leaving 5094763 data > rows > nrow limited to nrows passed in (5) > Type codes: 4444444433333331411433 (first 5 rows) > Type codes: 4444444433333331411433 (after applying colClasses and integer64) > Type codes: 4444444433333331411433 (after applying drop or select (if > supplied) > Allocating 22 column slots (22 - 0 NULL) > 0.000s ( 0%) Memory map (rerun may be quicker) > 0.000s ( 0%) sep and header detection > 2.760s (100%) Count rows (wc -l) > 0.000s ( 0%) Column type detection (first, middle and last 5 rows) > 0.000s ( 0%) Allocation of 5x22 result (xMB) in RAM > 0.000s ( 0%) Reading data > 0.000s ( 0%) Allocation for type bumps (if any), including gc time if > triggered > 0.000s ( 0%) Coercing data already read in type bumps (if any) > 0.000s ( 0%) Changing na.strings to NA > 2.760s Total > Warning message: > In fread(FILE, nrow = 5, verbose = T) : Mapped file ok but madvise failed > > > > > > -- > View this message in context: http://r.789695.n4.nabble.com/fread-crash-tp4683394p4683397.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > From hkirsten at imise.uni-leipzig.de Sat Jan 11 15:18:48 2014 From: hkirsten at imise.uni-leipzig.de (Holger Kirsten) Date: Sat, 11 Jan 2014 15:18:48 +0100 Subject: [datatable-help] setnames changes names of other data.table Message-ID: <2175122676-15998@hogwarts.imise.uni-leipzig.de> In a debugging session, I found that setnames changed the names of an identical data.table although having a different name> > ############### using setnames() > require(data.table) > mytab = data.table(a = letters[1:4], b = 1:4 ) > str(mytab) Classes ?data.table? and 'data.frame':??? 4 obs. of? 2 variables: ?$ a: chr? "a" "b" "c" "d" ?$ b: int? 1 2 3 4 ?- attr(*, ".internal.selfref")= > mytab ?? a b 1: a 1 2: b 2 3: c 3 4: d 4 > > othertab = mytab > othertab ?? a b 1: a 1 2: b 2 3: c 3 4: d 4 > setnames(othertab, c("a", "b"), c("aa","bb")) > othertab ?? aa bb 1:? a? 1 2:? b? 2 3:? c? 3 4:? d? 4 > mytab ## names have unexpectedly changed too ?? aa bb 1:? a? 1 2:? b? 2 3:? c? 3 4:? d? 4 > > ############### using names() > mytab = data.table(a = letters[1:4], b = 1:4 ) > str(mytab) Classes ?data.table? and 'data.frame':??? 4 obs. of? 2 variables: ?$ a: chr? "a" "b" "c" "d" ?$ b: int? 1 2 3 4 ?- attr(*, ".internal.selfref")= > mytab ?? a b 1: a 1 2: b 2 3: c 3 4: d 4 > > othertab = mytab > othertab ?? a b 1: a 1 2: b 2 3: c 3 4: d 4 > names(othertab) = c("aa","bb") Warning message: In `names<-.data.table`(`*tmp*`, value = c("aa", "bb")) : ? The names(x)<-value syntax copies the whole table. This is due to <- in R itself. Please change to setnames(x,old,new) which does not copy and is faster. See help('setnames'). You can safely ignore this warning if it is inconvenient to change right now. Setting options(warn=2) turns this warning into an error, so you can then use traceback() to find and change your names<- calls. > othertab ?? aa bb 1:? a? 1 2:? b? 2 3:? c? 3 4:? d? 4 > mytab ## names unchanged as expected ?? a b 1: a 1 2: b 2 3: c 3 4: d 4 > > sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=German_Germany.1252? LC_CTYPE=German_Germany.1252??? LC_MONETARY=German_Germany.1252 LC_NUMERIC=C??????????????????? LC_TIME=German_Germany.1252??? attached base packages: [1] stats???? graphics? grDevices utils???? datasets? methods?? base???? other attached packages: [1] data.table_1.8.10 loaded via a namespace (and not attached): [1] tools_3.0.1 -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sat Jan 11 15:31:53 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sat, 11 Jan 2014 15:31:53 +0100 Subject: [datatable-help] setnames changes names of other data.table In-Reply-To: <2175122676-15998@hogwarts.imise.uni-leipzig.de> References: <2175122676-15998@hogwarts.imise.uni-leipzig.de> Message-ID: Thanks for reporting. That's expected behaviour. Use an explicit copy. In short, when you do: DT1 <- DT2, there's no copy being made. They still reference/point to the same location (try doing tracemem(DT1) and tracemem(DT2)). So when you change the names of one DT by reference, the other one will get changed as well - they're both pointing to the same location. To overcome this, when you want to duplicate a DT, explicitly use copy. That is, DT1 <- copy(DT2). Now if you setnames(DT1, c("x", "y")), then DT2 names won't get changed. I think there's a FR somewhere on documenting this? Thanks again for reporting (with nice example). Arun From:?Holger Kirsten Holger Kirsten Reply:?Holger Kirsten hkirsten at imise.uni-leipzig.de Date:?January 11, 2014 at 3:19:02 PM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? [datatable-help] setnames changes names of other data.table In a debugging session, I found that setnames changed the names of an identical data.table although having a different name> > ############### using setnames() > require(data.table) > mytab = data.table(a = letters[1:4], b = 1:4 ) > str(mytab) Classes ?data.table? and 'data.frame':??? 4 obs. of? 2 variables: ?$ a: chr? "a" "b" "c" "d" ?$ b: int? 1 2 3 4 ?- attr(*, ".internal.selfref")= > mytab ?? a b 1: a 1 2: b 2 3: c 3 4: d 4 > > othertab = mytab > othertab ?? a b 1: a 1 2: b 2 3: c 3 4: d 4 > setnames(othertab, c("a", "b"), c("aa","bb")) > othertab ?? aa bb 1:? a? 1 2:? b? 2 3:? c? 3 4:? d? 4 > mytab ## names have unexpectedly changed too ?? aa bb 1:? a? 1 2:? b? 2 3:? c? 3 4:? d? 4 > > ############### using names() > mytab = data.table(a = letters[1:4], b = 1:4 ) > str(mytab) Classes ?data.table? and 'data.frame':??? 4 obs. of? 2 variables: ?$ a: chr? "a" "b" "c" "d" ?$ b: int? 1 2 3 4 ?- attr(*, ".internal.selfref")= > mytab ?? a b 1: a 1 2: b 2 3: c 3 4: d 4 > > othertab = mytab > othertab ?? a b 1: a 1 2: b 2 3: c 3 4: d 4 > names(othertab) = c("aa","bb") Warning message: In `names<-.data.table`(`*tmp*`, value = c("aa", "bb")) : ? The names(x)<-value syntax copies the whole table. This is due to <- in R itself. Please change to setnames(x,old,new) which does not copy and is faster. See help('setnames'). You can safely ignore this warning if it is inconvenient to change right now. Setting options(warn=2) turns this warning into an error, so you can then use traceback() to find and change your names<- calls. > othertab ?? aa bb 1:? a? 1 2:? b? 2 3:? c? 3 4:? d? 4 > mytab ## names unchanged as expected ?? a b 1: a 1 2: b 2 3: c 3 4: d 4 > > sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=German_Germany.1252? LC_CTYPE=German_Germany.1252??? LC_MONETARY=German_Germany.1252 LC_NUMERIC=C??????????????????? LC_TIME=German_Germany.1252??? attached base packages: [1] stats???? graphics? grDevices utils???? datasets? methods?? base???? other attached packages: [1] data.table_1.8.10 loaded via a namespace (and not attached): [1] tools_3.0.1 _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From hkirsten at imise.uni-leipzig.de Sat Jan 11 15:34:35 2014 From: hkirsten at imise.uni-leipzig.de (Holger Kirsten) Date: Sat, 11 Jan 2014 15:34:35 +0100 Subject: [datatable-help] setnames changes names of other data.table In-Reply-To: Message-ID: <2176371008-27468@hogwarts.imise.uni-leipzig.de> code {border:0;margin:0;padding:0;} .mcnt #mcntws {background-color:#f8f8f8;} .mcnt .mcntsend {color:#77bb77;} .mcnt .mcntserver {color:#7799bb;} .mcnt .mcnterror {color:#AA0000;} --> Normal 0 21 false false false DE X-NONE X-NONE MicrosoftInternetExplorer4 Thanks for the immediate answer! Now, everything is clear for me. Arunkumar Srinivasan , 11.01.2014 15:31: Thanks for reporting. That's expected behaviour. Use an explicit copy. In short, when you do: DT1 <- DT2, there's no copy being made. They still reference/point to the same location (try doing tracemem(DT1) and tracemem(DT2)). So when you change the names of one DT by reference, the other one will get changed as well - they're both pointing to the same location. To overcome this, when you want to duplicate a DT, explicitly use copy. That is, DT1 <- copy(DT2). Now if you setnames(DT1, c("x", "y")), then DT2 names won't get changed. I think there's a FR somewhere on documenting this? Thanks again for reporting (with nice example). Arun ---------------- From:?Holger Kirsten Holger Kirsten Reply:?Holger Kirsten hkirsten at imise.uni-leipzig.de Date:?January 11, 2014 at 3:19:02 PM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? [datatable-help] setnames changes names of other data.table In a debugging session, I found that setnames changed the names of an identical data.table although having a different name> > ############### using setnames() > require(data.table) > mytab = data.table(a = letters[1:4], b = 1:4 ) > str(mytab) Classes ?data.table? and 'data.frame':??? 4 obs. of? 2 variables: ?$ a: chr? "a" "b" "c" "d" ?$ b: int? 1 2 3 4 ?- attr(*, ".internal.selfref")= > mytab ?? a b 1: a 1 2: b 2 3: c 3 4: d 4 > > othertab = mytab > othertab ?? a b 1: a 1 2: b 2 3: c 3 4: d 4 > setnames(othertab, c("a", "b"), c("aa","bb")) > othertab ?? aa bb 1:? a? 1 2:? b? 2 3:? c? 3 4:? d? 4 > mytab ## names have unexpectedly changed too ?? aa bb 1:? a? 1 2:? b? 2 3:? c? 3 4:? d? 4 > > ############### using names() > mytab = data.table(a = letters[1:4], b = 1:4 ) > str(mytab) Classes ?data.table? and 'data.frame':??? 4 obs. of? 2 variables: ?$ a: chr? "a" "b" "c" "d" ?$ b: int? 1 2 3 4 ?- attr(*, ".internal.selfref")= > mytab ?? a b 1: a 1 2: b 2 3: c 3 4: d 4 > > othertab = mytab > othertab ?? a b 1: a 1 2: b 2 3: c 3 4: d 4 > names(othertab) = c("aa","bb") Warning message: In `names<-.data.table`(`*tmp*`, value = c("aa", "bb")) : ? The names(x)<-value syntax copies the whole table. This is due to <- in R itself. Please change to setnames(x,old,new) which does not copy and is faster. See help('setnames'). You can safely ignore this warning if it is inconvenient to change right now. Setting options(warn=2) turns this warning into an error, so you can then use traceback() to find and change your names<- calls. > othertab ?? aa bb 1:? a? 1 2:? b? 2 3:? c? 3 4:? d? 4 > mytab ## names unchanged as expected ?? a b 1: a 1 2: b 2 3: c 3 4: d 4 > > sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=German_Germany.1252? LC_CTYPE=German_Germany.1252??? LC_MONETARY=German_Germany.1252 LC_NUMERIC=C??????????????????? LC_TIME=German_Germany.1252??? attached base packages: [1] stats???? graphics? grDevices utils???? datasets? methods?? base???? other attached packages: [1] data.table_1.8.10 loaded via a namespace (and not attached): [1] tools_3.0.1 _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Sat Jan 11 15:36:44 2014 From: mdowle at mdowle.plus.com (Matt Dowle) Date: Sat, 11 Jan 2014 14:36:44 +0000 Subject: [datatable-help] setnames changes names of other data.table In-Reply-To: <2175122676-15998@hogwarts.imise.uni-leipzig.de> References: <2175122676-15998@hogwarts.imise.uni-leipzig.de> Message-ID: <52D156FC.8020109@mdowle.plus.com> Yes that's correct behaviour and desirable/crucial. The set* functions and := do not copy-on-write, unlike base. They work by reference. See ?copy and type example(copy) at the prompt. If you *really* want to copy a 20GB in RAM, use DT2 <- copy(DT) http://stackoverflow.com/questions/10225098/understanding-exactly-when-a-data-table-is-a-reference-to-vs-a-copy-of-another Matt On 11/01/14 14:18, Holger Kirsten wrote: > In a debugging session, I found that setnames changed the names of an > identical data.table although having a different name> > > > ############### using setnames() > > require(data.table) > > mytab = data.table(a = letters[1:4], b = 1:4 ) > > str(mytab) > Classes 'data.table' and 'data.frame': 4 obs. of 2 variables: > $ a: chr "a" "b" "c" "d" > $ b: int 1 2 3 4 > - attr(*, ".internal.selfref")= > > mytab > a b > 1: a 1 > 2: b 2 > 3: c 3 > 4: d 4 > > > > othertab = mytab > > othertab > a b > 1: a 1 > 2: b 2 > 3: c 3 > 4: d 4 > > setnames(othertab, c("a", "b"), c("aa","bb")) > > othertab > aa bb > 1: a 1 > 2: b 2 > 3: c 3 > 4: d 4 > > mytab ## names have unexpectedly changed too > aa bb > 1: a 1 > 2: b 2 > 3: c 3 > 4: d 4 > > > > ############### using names() > > mytab = data.table(a = letters[1:4], b = 1:4 ) > > str(mytab) > Classes 'data.table' and 'data.frame': 4 obs. of 2 variables: > $ a: chr "a" "b" "c" "d" > $ b: int 1 2 3 4 > - attr(*, ".internal.selfref")= > > mytab > a b > 1: a 1 > 2: b 2 > 3: c 3 > 4: d 4 > > > > othertab = mytab > > othertab > a b > 1: a 1 > 2: b 2 > 3: c 3 > 4: d 4 > > names(othertab) = c("aa","bb") > Warning message: > In `names<-.data.table`(`*tmp*`, value = c("aa", "bb")) : > The names(x)<-value syntax copies the whole table. This is due to <- > in R itself. Please change to setnames(x,old,new) which does not copy > and is faster. See help('setnames'). You can safely ignore this > warning if it is inconvenient to change right now. Setting > options(warn=2) turns this warning into an error, so you can then use > traceback() to find and change your names<- calls. > > othertab > aa bb > 1: a 1 > 2: b 2 > 3: c 3 > 4: d 4 > > mytab ## names unchanged as expected > a b > 1: a 1 > 2: b 2 > 3: c 3 > 4: d 4 > > > > sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 > LC_MONETARY=German_Germany.1252 LC_NUMERIC=C > LC_TIME=German_Germany.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] data.table_1.8.10 > > loaded via a namespace (and not attached): > [1] tools_3.0.1 > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Sat Jan 11 15:53:00 2014 From: mdowle at mdowle.plus.com (Matt Dowle) Date: Sat, 11 Jan 2014 14:53:00 +0000 Subject: [datatable-help] setnames changes names of other data.table In-Reply-To: References: <2175122676-15998@hogwarts.imise.uni-leipzig.de> Message-ID: <52D15ACC.6020006@mdowle.plus.com> On 11/01/14 14:31, Arunkumar Srinivasan wrote: > > Thanks for reporting. That's expected behaviour. Use an explicit |copy|. > > In short, when you do: |DT1 <- DT2|, there's *|no copy|* being made. > They still reference/point to the same location (try doing > |tracemem(DT1)| and |tracemem(DT2)|). > Just to be clear that's no different to base. DF1 <- DF2 makes no copy in base either. In fact x <- y never makes a copy in R regardless of what x and y are. The phrase "copy-on-write" is terribly named because it might imply DF1 <- DF2 copies. I think the term should be "copy-on-subassign" because that's really what R does. Only at the point of changing a sub-element of an object, does <- copy (if another symbol is pointing to that same object). It is switching from <- to set and := that does things by reference. Not switching from data.frame to data.table. Subassigning to a data.table using <- will still copy the entire data.table, just like base. Only set* and := can modify by reference. In fact, set* can be used on data.frame too, and other objects; e.g. setattr is often useful on non-data.table's and therefore copy() is useful on non-data.table's too. Hope that clarifies. > > So when you change the names of one |DT| by reference, the other one > will get changed as well - they're both pointing to the same location. > > To overcome this, when you want to duplicate a |DT|, explicitly use > |copy|. That is, |DT1 <- copy(DT2)|. Now if you |setnames(DT1, c("x", > "y"))|, then |DT2| names won't get changed. > > I think there's a FR somewhere on documenting this... Thanks again for > reporting (with nice example). > > > Arun > ------------------------------------------------------------------------ > From: Holger Kirsten Holger Kirsten > Reply: Holger Kirsten hkirsten at imise.uni-leipzig.de > > Date: January 11, 2014 at 3:19:02 PM > To: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > > Subject: [datatable-help] setnames changes names of other data.table >> In a debugging session, I found that setnames changed the names of an >> identical data.table although having a different name> >> >> > ############### using setnames() >> > require(data.table) >> > mytab = data.table(a = letters[1:4], b = 1:4 ) >> > str(mytab) >> Classes 'data.table' and 'data.frame': 4 obs. of 2 variables: >> $ a: chr "a" "b" "c" "d" >> $ b: int 1 2 3 4 >> - attr(*, ".internal.selfref")= >> > mytab >> a b >> 1: a 1 >> 2: b 2 >> 3: c 3 >> 4: d 4 >> > >> > othertab = mytab >> > othertab >> a b >> 1: a 1 >> 2: b 2 >> 3: c 3 >> 4: d 4 >> > setnames(othertab, c("a", "b"), c("aa","bb")) >> > othertab >> aa bb >> 1: a 1 >> 2: b 2 >> 3: c 3 >> 4: d 4 >> > mytab ## names have unexpectedly changed too >> aa bb >> 1: a 1 >> 2: b 2 >> 3: c 3 >> 4: d 4 >> > >> > ############### using names() >> > mytab = data.table(a = letters[1:4], b = 1:4 ) >> > str(mytab) >> Classes 'data.table' and 'data.frame': 4 obs. of 2 variables: >> $ a: chr "a" "b" "c" "d" >> $ b: int 1 2 3 4 >> - attr(*, ".internal.selfref")= >> > mytab >> a b >> 1: a 1 >> 2: b 2 >> 3: c 3 >> 4: d 4 >> > >> > othertab = mytab >> > othertab >> a b >> 1: a 1 >> 2: b 2 >> 3: c 3 >> 4: d 4 >> > names(othertab) = c("aa","bb") >> Warning message: >> In `names<-.data.table`(`*tmp*`, value = c("aa", "bb")) : >> The names(x)<-value syntax copies the whole table. This is due to >> <- in R itself. Please change to setnames(x,old,new) which does not >> copy and is faster. See help('setnames'). You can safely ignore this >> warning if it is inconvenient to change right now. Setting >> options(warn=2) turns this warning into an error, so you can then use >> traceback() to find and change your names<- calls. >> > othertab >> aa bb >> 1: a 1 >> 2: b 2 >> 3: c 3 >> 4: d 4 >> > mytab ## names unchanged as expected >> a b >> 1: a 1 >> 2: b 2 >> 3: c 3 >> 4: d 4 >> > >> > sessionInfo() >> R version 3.0.1 (2013-05-16) >> Platform: x86_64-w64-mingw32/x64 (64-bit) >> >> locale: >> [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 >> LC_MONETARY=German_Germany.1252 LC_NUMERIC=C LC_TIME=German_Germany.1252 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] data.table_1.8.10 >> >> loaded via a namespace (and not attached): >> [1] tools_3.0.1 >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From rcse2006 at gmail.com Tue Jan 21 19:17:08 2014 From: rcse2006 at gmail.com (rcse2006) Date: Tue, 21 Jan 2014 10:17:08 -0800 (PST) Subject: [datatable-help] Error in structure(ordered, dim = ns) : dims [product 1] do not match the length of object [0] Message-ID: <1390328228940-4683923.post@n4.nabble.com> Trying to run below code. library(quantmod) symbols <- c("AAPL", "DELL", "GOOG", "MSFT", "AMZN", "BIDU", "EBAY", "YHOO") d <- list() for(s in symbols) { tmp <- getSymbols(s, auto.assign=FALSE, verbose=TRUE) tmp <- Ad(tmp) names(tmp) <- "price" tmp <- data.frame( date=index(tmp), id=s, price=coredata(tmp) ) d[[s]] <- tmp } d <- do.call(rbind, d) d <- d[ d$date >= as.Date("2007-01-01"), ] rownames(d) <- NULL # Weekly returns library(plyr) library(reshape2) d$next_friday <- d$date - as.numeric(format(d$date, "%u")) + 5 d <- subset(d, date==next_friday) d <- ddply(d, "id", mutate, previous_price = lag(xts(price,date)), log_return = log(price / previous_price), simple_return = price / previous_price - 1 ) d <- dcast(d, date ~ id, value.var="simple_return") Getting error > d <- dcast(d, date ~ id, value.var="simple_return") Error in structure(ordered, dim = ns) : dims [product 1] do not match the length of object [0] Please help me how to use ddply and dcast or using other similar function to get same data. -- View this message in context: http://r.789695.n4.nabble.com/Error-in-structure-ordered-dim-ns-dims-product-1-do-not-match-the-length-of-object-0-tp4683923.html Sent from the datatable-help mailing list archive at Nabble.com. From mdowle at mdowle.plus.com Tue Jan 21 22:07:18 2014 From: mdowle at mdowle.plus.com (Matt Dowle) Date: Tue, 21 Jan 2014 21:07:18 +0000 Subject: [datatable-help] Error in structure(ordered, dim = ns) : dims [product 1] do not match the length of object [0] In-Reply-To: <1390328228940-4683923.post@n4.nabble.com> References: <1390328228940-4683923.post@n4.nabble.com> Message-ID: <52DEE186.2050200@mdowle.plus.com> Hello, This mailing list is just for the data.table package. You could try the [plyr] tag on Stack Overflow or look at the documentation for those packages to find out where support is for those packages. You probably intended to post to r-help, which is the level up in Nabble. To post here, you have to join this list and see the welcome message that this is just about data.table. If that didn't work please let us know. Matt On 21/01/14 18:17, rcse2006 wrote: > Trying to run below code. > > library(quantmod) > symbols <- c("AAPL", "DELL", "GOOG", "MSFT", "AMZN", "BIDU", "EBAY", "YHOO") > d <- list() > for(s in symbols) { > tmp <- getSymbols(s, auto.assign=FALSE, verbose=TRUE) > tmp <- Ad(tmp) > names(tmp) <- "price" > tmp <- data.frame( date=index(tmp), id=s, price=coredata(tmp) ) > d[[s]] <- tmp > } > d <- do.call(rbind, d) > d <- d[ d$date >= as.Date("2007-01-01"), ] > rownames(d) <- NULL > > # Weekly returns > library(plyr) > library(reshape2) > d$next_friday <- d$date - as.numeric(format(d$date, "%u")) + 5 > d <- subset(d, date==next_friday) > d <- ddply(d, "id", mutate, > previous_price = lag(xts(price,date)), > log_return = log(price / previous_price), > simple_return = price / previous_price - 1 > ) > d <- dcast(d, date ~ id, value.var="simple_return") > > Getting error > >> d <- dcast(d, date ~ id, value.var="simple_return") > Error in structure(ordered, dim = ns) : > dims [product 1] do not match the length of object [0] > > Please help me how to use ddply and dcast or using other similar function to > get same data. > > > > -- > View this message in context: http://r.789695.n4.nabble.com/Error-in-structure-ordered-dim-ns-dims-product-1-do-not-match-the-length-of-object-0-tp4683923.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > From jholtman at gmail.com Wed Jan 22 17:21:01 2014 From: jholtman at gmail.com (jholtman) Date: Wed, 22 Jan 2014 08:21:01 -0800 (PST) Subject: [datatable-help] leading spaces on column names from CSV file Message-ID: <1390407661392-4683986.post@n4.nabble.com> I have a CSV file that I had been reading with 'read.csv' and it turns out that the column names had leading spaces in some instances, but read.csv would remove them. I tried to read the file with fread and it was keeping the leading spaces. Here is dump of the session: > positXY <- read.csv(positionFile, as.is = TRUE) > > > str(positXY) 'data.frame': 12 obs. of 8 variables: $ ieee : chr "1972cd01004b1200" "6375cd01004b1200" "5875cd01004b1200" "1972cd01004b1200" ... $ startTime : chr "13:46" "13:46" "13:46" "13:53" ... # no leading space on next three $ endTime : chr "13:51" "13:51" "13:51" "13:58" ... $ x : int 65 65 65 65 65 65 65 65 65 65 ... $ y : int 45 45 45 45 45 45 45 45 45 45 ... $ z : num 3.4 3.4 3.4 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ... $ deviceDirection: chr "east" "west" "north" "south" ... $ test. : logi NA NA NA NA NA NA ... > positXY <- fread(positionFile) > > str(positXY) Classes ?data.table? and 'data.frame': 12 obs. of 8 variables: $ ieee : chr "1972cd01004b1200" "6375cd01004b1200" "5875cd01004b1200" "1972cd01004b1200" ... $ startTime : chr "13:46" "13:46" "13:46" "13:53" ... # leading space on next three $ endTime : chr "13:51" "13:51" "13:51" "13:58" ... $ x : int 65 65 65 65 65 65 65 65 65 65 ... $ y : int 45 45 45 45 45 45 45 45 45 45 ... $ z : num 3.4 3.4 3.4 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ... $ deviceDirection: chr "east" "west" "north" "south" ... $ test# : int NA NA NA NA NA NA NA NA NA NA ... - attr(*, ".internal.selfref")= > sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats grDevices utils datasets graphics methods [7] base other attached packages: [1] data.table_1.8.10 bitops_1.0-6 loaded via a namespace (and not attached): [1] tools_3.0.2 Is it possible to have the leading spaces removed automatically, or via a parameter? -- View this message in context: http://r.789695.n4.nabble.com/leading-spaces-on-column-names-from-CSV-file-tp4683986.html Sent from the datatable-help mailing list archive at Nabble.com. From mdowle at mdowle.plus.com Wed Jan 22 20:40:50 2014 From: mdowle at mdowle.plus.com (Matt Dowle) Date: Wed, 22 Jan 2014 19:40:50 +0000 Subject: [datatable-help] leading spaces on column names from CSV file In-Reply-To: <1390407661392-4683986.post@n4.nabble.com> References: <1390407661392-4683986.post@n4.nabble.com> Message-ID: <52E01EC2.3090008@mdowle.plus.com> Hi Jim, Ok yes good idea. The roots of that are due to data.table allows leading and trailing spaces (and all special characters) in column names. The confusion that by="a, b, c" doesn't work due to the spaces, due to column name " b" being different to "b". But for fread that doesn't make sense and it should drop the leading spaces by default, yes. I've added this to the top of fread.c where I'm logging the fread ToDos. Thanks, Matt On 22/01/14 16:21, jholtman wrote: > I have a CSV file that I had been reading with 'read.csv' and it turns out > that the column names had leading spaces in some instances, but read.csv > would remove them. > > I tried to read the file with fread and it was keeping the leading spaces. > Here is dump of the session: > >> positXY <- read.csv(positionFile, as.is = TRUE) >> >> >> str(positXY) > 'data.frame': 12 obs. of 8 variables: > $ ieee : chr "1972cd01004b1200" "6375cd01004b1200" > "5875cd01004b1200" "1972cd01004b1200" ... > $ startTime : chr "13:46" "13:46" "13:46" "13:53" ... # no leading > space on next three > $ endTime : chr "13:51" "13:51" "13:51" "13:58" ... > $ x : int 65 65 65 65 65 65 65 65 65 65 ... > $ y : int 45 45 45 45 45 45 45 45 45 45 ... > $ z : num 3.4 3.4 3.4 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ... > $ deviceDirection: chr "east" "west" "north" "south" ... > $ test. : logi NA NA NA NA NA NA ... >> positXY <- fread(positionFile) >> >> str(positXY) > Classes ?data.table? and 'data.frame': 12 obs. of 8 variables: > $ ieee : chr "1972cd01004b1200" "6375cd01004b1200" > "5875cd01004b1200" "1972cd01004b1200" ... > $ startTime : chr "13:46" "13:46" "13:46" "13:53" ... # leading > space on next three > $ endTime : chr "13:51" "13:51" "13:51" "13:58" ... > $ x : int 65 65 65 65 65 65 65 65 65 65 ... > $ y : int 45 45 45 45 45 45 45 45 45 45 ... > $ z : num 3.4 3.4 3.4 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ... > $ deviceDirection: chr "east" "west" "north" "south" ... > $ test# : int NA NA NA NA NA NA NA NA NA NA ... > - attr(*, ".internal.selfref")= >> sessionInfo() > R version 3.0.2 (2013-09-25) > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 > [2] LC_CTYPE=English_United States.1252 > [3] LC_MONETARY=English_United States.1252 > [4] LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] stats grDevices utils datasets graphics methods > [7] base > > other attached packages: > [1] data.table_1.8.10 bitops_1.0-6 > > loaded via a namespace (and not attached): > [1] tools_3.0.2 > > > Is it possible to have the leading spaces removed automatically, or via a > parameter? > > > > -- > View this message in context: http://r.789695.n4.nabble.com/leading-spaces-on-column-names-from-CSV-file-tp4683986.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From aragorn168b at gmail.com Wed Jan 22 20:54:26 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Wed, 22 Jan 2014 20:54:26 +0100 Subject: [datatable-help] Response to dplyr baseball vignette benchmarks Message-ID: Hello, Matthew and I have redone the benchmarks and posted a response to the dplyr's? baseball vignette benchmark here:?http://arunsrinivasan.github.io/dplyr_benchmark/ Have a look and let us know what you think! Arun -------------- next part -------------- An HTML attachment was scrubbed... URL: From caneff at gmail.com Wed Jan 22 21:06:54 2014 From: caneff at gmail.com (Chris Neff) Date: Wed, 22 Jan 2014 15:06:54 -0500 Subject: [datatable-help] Response to dplyr baseball vignette benchmarks In-Reply-To: References: Message-ID: Thank you for responding to this so fast to get out ahead of the misleading aspects. As another comparison, it would definitely be constructive to also use a data set that is larger than 10 MB. Something in the 1m+ row range perhaps. On Wed, Jan 22, 2014 at 2:54 PM, Arunkumar Srinivasan wrote: > Hello, > > Matthew and I have redone the benchmarks and posted a response to the > dplyr's > baseball vignette benchmark here: > http://arunsrinivasan.github.io/dplyr_benchmark/ > > Have a look and let us know what you think! > > Arun > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Wed Jan 22 21:09:18 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Wed, 22 Jan 2014 21:09:18 +0100 Subject: [datatable-help] Response to dplyr baseball vignette benchmarks In-Reply-To: References: Message-ID: Chris, Thanks. Yes that's the plan (the last line in the link). Once the next version of data.table is out on CRAN, the benchmarks should come out. Arun From:?Chris Neff Chris Neff Reply:?Chris Neff caneff at gmail.com Date:?January 22, 2014 at 9:07:34 PM To:?Arunkumar Srinivasan aragorn168b at gmail.com Subject:? Re: [datatable-help] Response to dplyr baseball vignette benchmarks Thank you for responding to this so fast to get out ahead of the misleading aspects. As another comparison, it would definitely be constructive to also use a data set that is larger than 10 MB. ?Something in the 1m+ row range perhaps. On Wed, Jan 22, 2014 at 2:54 PM, Arunkumar Srinivasan wrote: Hello, Matthew and I have redone the benchmarks and posted a response to the dplyr's? baseball vignette benchmark here:?http://arunsrinivasan.github.io/dplyr_benchmark/ Have a look and let us know what you think! Arun _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From caneff at gmail.com Wed Jan 22 21:17:08 2014 From: caneff at gmail.com (Chris Neff) Date: Wed, 22 Jan 2014 15:17:08 -0500 Subject: [datatable-help] Response to dplyr baseball vignette benchmarks In-Reply-To: References: Message-ID: When you do use larger data sets where it will matter, I think more strongly highlighting the in-place vs. copying differences will be key. There is also the notion that yes, you should compare things as closely as possible when just doing standard benchmarking, but I think this is selling data.table a bit short by mimicking dplyr with copying. You show this a bit in the mutate example, but even in the arrange example the copy is slowing things down. It is so small that it wouldn't really make a ton of difference in this case, but with 10m rows the copying gets to be a large noticeable difference between data.table and standard data.frame methods like setnames vs names<- On Wed, Jan 22, 2014 at 3:09 PM, Arunkumar Srinivasan wrote: > Chris, > > Thanks. Yes that's the plan (the last line in the link). Once the next > version of data.table is out on CRAN, the benchmarks should come out. > > Arun > ------------------------------ > From: Chris Neff Chris Neff > Reply: Chris Neff caneff at gmail.com > Date: January 22, 2014 at 9:07:34 PM > To: Arunkumar Srinivasan aragorn168b at gmail.com > Subject: Re: [datatable-help] Response to dplyr baseball vignette > benchmarks > > Thank you for responding to this so fast to get out ahead of the > misleading aspects. > > As another comparison, it would definitely be constructive to also use a > data set that is larger than 10 MB. Something in the 1m+ row range perhaps. > > > On Wed, Jan 22, 2014 at 2:54 PM, Arunkumar Srinivasan < > aragorn168b at gmail.com> wrote: > >> Hello, >> >> Matthew and I have redone the benchmarks and posted a response to the >> dplyr's >> baseball vignette benchmark here: >> http://arunsrinivasan.github.io/dplyr_benchmark/ >> >> Have a look and let us know what you think! >> >> Arun >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Wed Jan 22 21:20:57 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Wed, 22 Jan 2014 21:20:57 +0100 Subject: [datatable-help] Response to dplyr baseball vignette benchmarks In-Reply-To: References: Message-ID: Chris, You're 100% right. That's what we've conversed with Hadley as well. For this data, we decided to stick to this, as we weren't lagging behind "dplyr". This is also why I made the point that "However, when benchmarking one should be benchmarking the equivalent of an operation in each tool, not how one?thinks?the design should be." This is so that the next time we benchmark, we can do it the data.table way and dplyr way and not dplyr's data.table way. Arun From:?Chris Neff Chris Neff Reply:?Chris Neff caneff at gmail.com Date:?January 22, 2014 at 9:17:49 PM To:?Arunkumar Srinivasan aragorn168b at gmail.com Subject:? Re: [datatable-help] Response to dplyr baseball vignette benchmarks When you do use larger data sets where it will matter, I think more strongly highlighting the in-place vs. copying differences will be key. There is also the notion that yes, you should compare things as closely as possible when just doing standard benchmarking, but I think this is selling data.table a bit short by mimicking dplyr with copying. ?You show this a bit in the mutate example, but even in the arrange example the copy is slowing things down. ?It is so small that it wouldn't really make a ton of difference in this case, but with 10m rows the copying gets to be a large noticeable difference between data.table and standard data.frame methods like setnames vs names<- On Wed, Jan 22, 2014 at 3:09 PM, Arunkumar Srinivasan wrote: Chris, Thanks. Yes that's the plan (the last line in the link). Once the next version of data.table is out on CRAN, the benchmarks should come out. Arun From:?Chris Neff Chris Neff Reply:?Chris Neff caneff at gmail.com Date:?January 22, 2014 at 9:07:34 PM To:?Arunkumar Srinivasan aragorn168b at gmail.com Subject:? Re: [datatable-help] Response to dplyr baseball vignette benchmarks Thank you for responding to this so fast to get out ahead of the misleading aspects. As another comparison, it would definitely be constructive to also use a data set that is larger than 10 MB. ?Something in the 1m+ row range perhaps. On Wed, Jan 22, 2014 at 2:54 PM, Arunkumar Srinivasan wrote: Hello, Matthew and I have redone the benchmarks and posted a response to the dplyr's? baseball vignette benchmark here:?http://arunsrinivasan.github.io/dplyr_benchmark/ Have a look and let us know what you think! Arun _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From guenter.hitsch at mac.com Wed Jan 22 21:52:12 2014 From: guenter.hitsch at mac.com (=?windows-1252?Q?=22G=FCnter_J=2E_Hitsch=22?=) Date: Wed, 22 Jan 2014 14:52:12 -0600 Subject: [datatable-help] segfault with "large" number of rows Message-ID: I?ve been using data.table for several months. It?s a great package?thank you for developing it! Here?s my question: I?ve run into a problem when I use ?large? data tables with many millions of rows. In particular, for such large data tables I get segmentation faults when I create columns by groups. Example: N = 2500 # No. of groups T = 100000 # No. of observations per group DT = data.table(group = rep(1:N, each = T), x = 1) setkey(DT, group) DT[, sum_x := sum(x), by = group] print(head(DT)) This runs fine. But when I increase the number of groups, say from 2500 to 3000, I get a segfault: N = 3000 # No. of groups T = 100000 # No. of observations per group ... *** caught segfault *** address 0x159069140, cause 'memory not mapped' Traceback: 1: `[.data.table`(DT, , `:=`(sum_x, sum(x)), by = group) 2: DT[, `:=`(sum_x, sum(x)), by = group] 3: eval(expr, envir, enclos) 4: eval(ei, envir) 5: withVisible(eval(ei, envir)) I can reproduce this problem on: (1) OS X 10.9, R 3.0.2, data.table 1.8.10 (2) Ubuntu 13.10, R 3.0.1, data.table 1.8.10 And of course the amount of RAM in my machines is not the issue. Thanks in advance for your help with this! G?nter From aragorn168b at gmail.com Wed Jan 22 22:35:09 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Wed, 22 Jan 2014 22:35:09 +0100 Subject: [datatable-help] segfault with "large" number of rows In-Reply-To: References: Message-ID: G?nter, Great report! I'm able to reproduce it on 1.8.11 here. Will file a bug and look into it. Thanks again for reporting. Arun From:?G?nter J. Hitsch G?nter J. Hitsch Reply:?G?nter J. Hitsch guenter.hitsch at mac.com Date:?January 22, 2014 at 9:52:36 PM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? [datatable-help] segfault with "large" number of rows I?ve been using data.table for several months. It?s a great package?thank you for developing it! Here?s my question: I?ve run into a problem when I use ?large? data tables with many millions of rows. In particular, for such large data tables I get segmentation faults when I create columns by groups. Example: N = 2500 # No. of groups T = 100000 # No. of observations per group DT = data.table(group = rep(1:N, each = T), x = 1) setkey(DT, group) DT[, sum_x := sum(x), by = group] print(head(DT)) This runs fine. But when I increase the number of groups, say from 2500 to 3000, I get a segfault: N = 3000 # No. of groups T = 100000 # No. of observations per group ... *** caught segfault *** address 0x159069140, cause 'memory not mapped' Traceback: 1: `[.data.table`(DT, , `:=`(sum_x, sum(x)), by = group) 2: DT[, `:=`(sum_x, sum(x)), by = group] 3: eval(expr, envir, enclos) 4: eval(ei, envir) 5: withVisible(eval(ei, envir)) I can reproduce this problem on: (1) OS X 10.9, R 3.0.2, data.table 1.8.10 (2) Ubuntu 13.10, R 3.0.1, data.table 1.8.10 And of course the amount of RAM in my machines is not the issue. Thanks in advance for your help with this! G?nter _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From jholtman at gmail.com Thu Jan 23 01:44:17 2014 From: jholtman at gmail.com (Jim Holtman) Date: Wed, 22 Jan 2014 19:44:17 -0500 Subject: [datatable-help] leading spaces on column names from CSV file Message-ID: Matt Thanks for making the change. When will the update be available?? Sent from my Verizon Wireless 4G LTE Smartphone
-------- Original message --------
From: Matt Dowle
Date:01/22/2014 14:40 (GMT-05:00)
To: jholtman ,datatable-help at lists.r-forge.r-project.org
Subject: Re: [datatable-help] leading spaces on column names from CSV file
Hi Jim, Ok yes good idea.?? The roots of that are due to data.table allows leading and trailing spaces (and all special characters) in column names.? The confusion that? by="a, b, c" doesn't work due to the spaces, due to column name " b" being different to "b".?? But for fread that doesn't make sense and it should drop the leading spaces by default, yes.?? I've added this to the top of fread.c where I'm logging the fread ToDos. Thanks, Matt On 22/01/14 16:21, jholtman wrote: > I have a CSV file that I had been reading with 'read.csv' and it turns out > that the column names had leading spaces in some instances, but read.csv > would remove them. > > I tried to read the file with fread and it was keeping the leading spaces. > Here is dump of the session: > >> positXY <- read.csv(positionFile, as.is = TRUE) >> >> >> str(positXY) > 'data.frame':?? 12 obs. of? 8 variables: >?? $ ieee?????????? : chr? "1972cd01004b1200" "6375cd01004b1200" > "5875cd01004b1200" "1972cd01004b1200" ... >?? $ startTime????? : chr? "13:46" "13:46" "13:46" "13:53" ...? # no leading > space on next three >?? $ endTime??????? : chr? "13:51" "13:51" "13:51" "13:58" ... >?? $ x????????????? : int? 65 65 65 65 65 65 65 65 65 65 ... >?? $ y????????????? : int? 45 45 45 45 45 45 45 45 45 45 ... >?? $ z????????????? : num? 3.4 3.4 3.4 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ... >?? $ deviceDirection: chr? "east" "west" "north" "south" ... >?? $ test.????????? : logi? NA NA NA NA NA NA ... >> positXY <- fread(positionFile) >> >> str(positXY) > Classes ?data.table? and 'data.frame':? 12 obs. of? 8 variables: >?? $ ieee?????????? : chr? "1972cd01004b1200" "6375cd01004b1200" > "5875cd01004b1200" "1972cd01004b1200" ... >?? $? startTime???? : chr? "13:46" "13:46" "13:46" "13:53" ...? # leading > space on next three >?? $? endTime?????? : chr? "13:51" "13:51" "13:51" "13:58" ... >?? $? x???????????? : int? 65 65 65 65 65 65 65 65 65 65 ... >?? $ y????????????? : int? 45 45 45 45 45 45 45 45 45 45 ... >?? $ z????????????? : num? 3.4 3.4 3.4 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ... >?? $ deviceDirection: chr? "east" "west" "north" "south" ... >?? $? test#???????? : int? NA NA NA NA NA NA NA NA NA NA ... >?? - attr(*, ".internal.selfref")= >> sessionInfo() > R version 3.0.2 (2013-09-25) > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 > [2] LC_CTYPE=English_United States.1252 > [3] LC_MONETARY=English_United States.1252 > [4] LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] stats???? grDevices utils???? datasets? graphics? methods > [7] base > > other attached packages: > [1] data.table_1.8.10 bitops_1.0-6 > > loaded via a namespace (and not attached): > [1] tools_3.0.2 > > > Is it possible to have the leading spaces removed automatically, or via a > parameter? > > > > -- > View this message in context: http://r.789695.n4.nabble.com/leading-spaces-on-column-names-from-CSV-file-tp4683986.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Wed Jan 29 01:41:29 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Wed, 29 Jan 2014 01:41:29 +0100 Subject: [datatable-help] segfault with "large" number of rows In-Reply-To: References: Message-ID: Hi Guenter, CC: data.table list, I filed this as bug #5305 and now we've now fixed it with commit 1100 v1.8.11. Thank you very much once again for reporting! On Wed, Jan 22, 2014 at 9:52 PM, "G?nter J. Hitsch" wrote: > > I?ve been using data.table for several months. It?s a great package?thank > you for developing it! > > Here?s my question: I?ve run into a problem when I use ?large? data > tables with many millions of rows. In particular, for such large data > tables I get segmentation faults when I create columns by groups. Example: > > N = 2500 # No. of groups > T = 100000 # No. of observations per group > > DT = data.table(group = rep(1:N, each = T), x = 1) > setkey(DT, group) > > DT[, sum_x := sum(x), by = group] > print(head(DT)) > > This runs fine. But when I increase the number of groups, say from 2500 > to 3000, I get a segfault: > > N = 3000 # No. of groups > T = 100000 # No. of observations per group > > ... > > *** caught segfault *** > address 0x159069140, cause 'memory not mapped' > > Traceback: > 1: `[.data.table`(DT, , `:=`(sum_x, sum(x)), by = group) > 2: DT[, `:=`(sum_x, sum(x)), by = group] > 3: eval(expr, envir, enclos) > 4: eval(ei, envir) > 5: withVisible(eval(ei, envir)) > > > I can reproduce this problem on: > > (1) OS X 10.9, R 3.0.2, data.table 1.8.10 > (2) Ubuntu 13.10, R 3.0.1, data.table 1.8.10 > > And of course the amount of RAM in my machines is not the issue. > > Thanks in advance for your help with this! > > G?nter > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: