From fjbuch at gmail.com Mon Mar 9 19:08:29 2015 From: fjbuch at gmail.com (Farrel Buchinsky) Date: Mon, 9 Mar 2015 14:08:29 -0400 Subject: [datatable-help] error in rbindlist Message-ID: I am doing something really simple but getting an error I do not know how to troubleshoot. l = list(specimens, specs.post.crash) rbindlist(l, fill = TRUE) specimens has about 5 columns and specs.post.crash has about 29. I get this error. I cannot find anyone else with this problem. I cannot share my actual data.table since it has private information. Error in if (any(neg)) res[neg] = paste("-", res[neg], sep = "") : missing value where TRUE/FALSE needed Interested in the off-the-top-of-your-head suggestions of where to troubleshoot. Using {R package version 1.9.5} Farrel Buchinsky Google Voice Tel: (412) 567-7870 -------------- next part -------------- An HTML attachment was scrubbed... URL: From fjbuch at gmail.com Mon Mar 9 19:21:35 2015 From: fjbuch at gmail.com (Farrel Buchinsky) Date: Mon, 9 Mar 2015 14:21:35 -0400 Subject: [datatable-help] error in rbindlist In-Reply-To: References: Message-ID: I used a series of statements such as this l = list(specs.post.crash, specimens[,1:26, with=FALSE]) to work out which set of columns worked and which did not. I found the offending column kbdtimespecim is a column from sepcimens. It is the 27 th column. With it included the whole rbindlist collapses. Without it, it works. "15:04:48" "15:02:52" "15:06:44" "16:07:22" "16:07:22" "16:07:22" > class(specimens$kbdtimespecim) [1] "ITime" Any idea why? Farrel Buchinsky Google Voice Tel: (412) 567-7870 On Mon, Mar 9, 2015 at 2:08 PM, Farrel Buchinsky wrote: > I am doing something really simple but getting an error I do not know how > to troubleshoot. > > l = list(specimens, specs.post.crash) > rbindlist(l, fill = TRUE) > > specimens has about 5 columns and specs.post.crash has about 29. > > I get this error. I cannot find anyone else with this problem. I cannot > share my actual data.table since it has private information. > > Error in if (any(neg)) res[neg] = paste("-", res[neg], sep = "") : > missing value where TRUE/FALSE needed > > Interested in the off-the-top-of-your-head suggestions of where to > troubleshoot. > Using {R package version 1.9.5} > > > Farrel Buchinsky > Google Voice Tel: (412) 567-7870 > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rivokl at gmail.com Tue Mar 10 15:12:48 2015 From: rivokl at gmail.com (Rivo R) Date: Tue, 10 Mar 2015 15:12:48 +0100 Subject: [datatable-help] NA handling not working? Message-ID: Hi all, I tried to load the following (huge) dataset as data.table but fread seems to choke on NA's. https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2Fhousehold_power_consumption.zip Steps: 1- Dowload and unzip 2- > packageVersion("data.table") [1] ?1.9.4? 3- > tmp <- fread(dataFile, sep=';', header=TRUE, na.strings=c("NA","'?'", ""), + stringsAsFactors=FALSE, + colClasses=c(rep("character",2), rep("numeric",7)), verbose=TRUE) Input contains no \n. Taking this to be a filename to open File opened, filesize is 0.121897 GB. Memory mapping ... ok Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. Looking for supplied sep ';' on line 30 (the last non blank line in the first 'autostart') ... found ok Found 9 columns First row with 9 fields occurs on line 1 (either column names or first row of data) 'header' changed by user from 'auto' to TRUE Count of eol after first data row: 2075260 Subtracted 1 for last eol and any trailing empty lines, leaving 2075259 data rows Type codes ( first 5 rows): 443333333 Type codes (+ middle 5 rows): 443333333 Type codes (+ last 5 rows): 443333333 Type codes: 443333333 (after applying colClasses and integer64) Type codes: 443333333 (after applying drop or select (if supplied) Allocating 9 column slots (9 - 0 dropped) Bumping column 3 from REAL to STR on data row 6840, field contains '?' Bumping column 4 from REAL to STR on data row 6840, field contains '?' Bumping column 5 from REAL to STR on data row 6840, field contains '?' Bumping column 6 from REAL to STR on data row 6840, field contains '?' Bumping column 7 from REAL to STR on data row 6840, field contains '?' Bumping column 8 from REAL to STR on data row 6840, field contains '?' Read 2075259 rows and 9 (of 9) columns from 0.122 GB file in 00:00:04 0.000s ( 0%) Memory map (rerun may be quicker) 0.001s ( 0%) sep and header detection 0.282s ( 7%) Count rows (wc -l) 0.002s ( 0%) Column type detection (first, middle and last 5 rows) 0.627s ( 16%) Allocation of 2075259x9 result (xMB) in RAM 2.525s ( 64%) Reading data 0.298s ( 8%) Allocation for type bumps (if any), including gc time if triggered 0.123s ( 3%) Coercing data already read in type bumps (if any) 0.059s ( 2%) Changing na.strings to NA 3.917s Total Any hint?? Kely From ica at ign.ku.dk Wed Mar 11 10:20:29 2015 From: ica at ign.ku.dk (Ingeborg Callesen) Date: Wed, 11 Mar 2015 09:20:29 +0000 Subject: [datatable-help] multiply columns in two vectors Message-ID: <1DD7F9F3A5AE9845846F5B44C395CDF1ADC0D02F@P1KITMBX03WC02.unicph.domain> Dear r-helpers out there Is there a function or a code string for multiplying the numbers in e.g. a n x m vector by another n x m vector ? The result should be a n x m vector with the product of the two numbers in each vector on the same position in the new vector. Thanks for any help in advance. BR Ingeborg Callesen -------------- next part -------------- An HTML attachment was scrubbed... URL: From my.r.help at gmail.com Wed Mar 11 15:35:28 2015 From: my.r.help at gmail.com (Michael Smith) Date: Wed, 11 Mar 2015 22:35:28 +0800 Subject: [datatable-help] multiply columns in two vectors In-Reply-To: <1DD7F9F3A5AE9845846F5B44C395CDF1ADC0D02F@P1KITMBX03WC02.unicph.domain> References: <1DD7F9F3A5AE9845846F5B44C395CDF1ADC0D02F@P1KITMBX03WC02.unicph.domain> Message-ID: <550052B0.5020804@gmail.com> Vector? I guess you mean n x m matrix. Is `%*%` what you're looking for? And this question is probably more suitable for the r-help mailing list. On 03/11/2015 05:20 PM, Ingeborg Callesen wrote: > Dear r-helpers out there > > Is there a function or a code string for multiplying the numbers in e.g. > a n x m vector by another n x m vector ? The result should be a n x m > vector with the product of the two numbers in each vector on the same > position in the new vector. > > Thanks for any help in advance. > > BR Ingeborg Callesen > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > From npgraham1 at gmail.com Sat Mar 14 21:19:41 2015 From: npgraham1 at gmail.com (Nathaniel Graham) Date: Sat, 14 Mar 2015 16:19:41 -0400 Subject: [datatable-help] multiple data.table calls Message-ID: There's particular problem I often have, and I'm hoping someone can tell me how to speed it up in data.table. It seems to involve a sort of recursion that data.table (as I'm using it) doesn't do well with, where for each record in a set, I do a another search within the same table. I hope the formatting of the code below is legible--it's a lot easier to read in the RStudio text editor! I have a moderately large (more than 3 million rows) data.table of the employment histories of brokers in the US. Each row is an employment record, with a unique individual id (icrdn), a unique firm id (fcrdn), a branch identifier (branch), start and end dates (fromdate and todate), and a few other items (each row has a unique id as well, called job.index). For example, finding all the brokers that ever worked at Stratton Oakmont (from the Wolf of Wall Street): employ.hist[fcrdn == 18692, icrdn] where fcrdn is the firm identifier, 18692 is Stratton's ID, and icrdn is the individual identifier. What I want is to find all the individuals that ever met a Stratton alum. Specifically, every icrdn such that the branch == a branch a Stratton alum ever worked at and the start and end dates overlap. The only way I've found to do so involves something like this: find_brokers_by_single_branch <- cmpfun(function(sdt, edt, brnch) { employ.hist[fromdate <= sdt & todate >= edt & branch == brnch, list(icrdn, branch, job.index, fcrdn)] }) stratton.people <- employ.hist[fcrdn == 18692, icrdn] stratton.contacts <- employ.hist[icrdn %in% stratton.people, find_brokers_by_single_branch(fromdate, todate, branch), by = "job.index"] This works, but effectively means calling the data.table '[' function thousands of times, once for each job entry a Stratton broker ever had (which are in the thousands, as many left before the government busted the place and are still in the industry). It's quite slow, and I'm hoping someone can show me a way to speed it up, as I have many similar tasks, some of which are vastly larger. Memory really isn't an issue for me (32 GB) and CPU shouldn't be either (Intel i7-4770 3.4GHz), in case that helps. ------- Nathaniel Graham npgraham1 at gmail.com npgraham1 at uky.edu https://sites.google.com/site/npgraham1/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From fperickson at wisc.edu Sun Mar 15 17:16:26 2015 From: fperickson at wisc.edu (Frank Erickson) Date: Sun, 15 Mar 2015 12:16:26 -0400 Subject: [datatable-help] multiple data.table calls In-Reply-To: References: Message-ID: I'd suggest: (1) Get a table identifying the condition "meeting someone who at some point works at Stratton." (They aren't really "alums" if they haven't worked there yet, but this is the definition you seem to be looking for.) You can do this by looking at any (firm,date) combinations that involve bumping into such a person: meet.stratton <- unique(employ.hist[icrdn %in% stratton.people,list(fcrdn,date=fromdate:todate)]) (2) Find people who meet the conditions: setkey(employ.hist,fcrdn) met.stratton.people <- employ.hist[meet.stratton,any(date>= startdate & date <= todate),by="icrdn,fcrdn"][V1==TRUE,unique(icrdn)] (3) If you want to exclude Stratton folks, then use setdiff() --Frank On Sat, Mar 14, 2015 at 4:19 PM, Nathaniel Graham wrote: > There's particular problem I often have, and I'm hoping someone can tell > me how to speed it up in data.table. It seems to involve a sort of > recursion that data.table (as I'm using it) doesn't do well with, where for > each record in a set, I do a another search within the same table. I hope > the formatting of the code below is legible--it's a lot easier to read in > the RStudio text editor! > > I have a moderately large (more than 3 million rows) data.table of the > employment histories of brokers in the US. Each row is an employment > record, with a unique individual id (icrdn), a unique firm id (fcrdn), a > branch identifier (branch), start and end dates (fromdate and todate), and > a few other items (each row has a unique id as well, called job.index). > For example, finding all the brokers that ever worked at Stratton Oakmont > (from the Wolf of Wall Street): > > employ.hist[fcrdn == 18692, icrdn] > > where fcrdn is the firm identifier, 18692 is Stratton's ID, and icrdn is > the individual identifier. > > What I want is to find all the individuals that ever met a Stratton alum. > Specifically, every icrdn such that the branch == a branch a Stratton alum > ever worked at and the start and end dates overlap. The only way I've > found to do so involves something like this: > > find_brokers_by_single_branch <- cmpfun(function(sdt, edt, brnch) { > employ.hist[fromdate <= sdt & todate >= edt & branch == brnch, > list(icrdn, branch, job.index, fcrdn)] > }) > > stratton.people <- employ.hist[fcrdn == 18692, icrdn] > stratton.contacts <- employ.hist[icrdn %in% stratton.people, > find_brokers_by_single_branch(fromdate, > todate, branch), > by = "job.index"] > > This works, but effectively means calling the data.table '[' function > thousands of times, once for each job entry > a Stratton broker ever had (which are in the thousands, as many left > before the government busted the place > and are still in the industry). It's quite slow, and I'm hoping someone > can show me a way to speed it up, as I have > many similar tasks, some of which are vastly larger. Memory really isn't > an issue for me (32 GB) and CPU shouldn't be either (Intel i7-4770 3.4GHz), > in case that helps. > > ------- > Nathaniel Graham > npgraham1 at gmail.com > npgraham1 at uky.edu > https://sites.google.com/site/npgraham1/ > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Cyrille.laurent.sage at gmail.com Sun Mar 15 20:41:33 2015 From: Cyrille.laurent.sage at gmail.com (Papysounours) Date: Sun, 15 Mar 2015 12:41:33 -0700 (PDT) Subject: [datatable-help] R beginner Message-ID: <1426448493840-4704684.post@n4.nabble.com> Hi I am just starting R programming because i need it to analyse new sequencing data. I got two list of data (excel table) one is gene list with chromosomal position (like start:123456 end:124567), the other is miRNA list with only one position (like 123789). In the first liste i have around 20000 row (meaning 20000 gene name to compare to) and for the second around 4500 row (4500 miRNA). I want to compare the position of each individual miRNA position ( genestart<=miRNA<=geneend ) to the entire list of gene in order to get in a new table the name of the miRNA (first colum of the miRNA list) and the name of the gene (first colum of the gene list) related to the miRNA. Hope thisis not to much to ask. Papy -- View this message in context: http://r.789695.n4.nabble.com/R-beginner-tp4704684.html Sent from the datatable-help mailing list archive at Nabble.com. From jholtman at gmail.com Sun Mar 15 22:16:49 2015 From: jholtman at gmail.com (jim holtman) Date: Sun, 15 Mar 2015 17:16:49 -0400 Subject: [datatable-help] R beginner In-Reply-To: <1426448493840-4704684.post@n4.nabble.com> References: <1426448493840-4704684.post@n4.nabble.com> Message-ID: You didn't provide any test data, so I made some up with the sizes you gave. This uses the 'sqldf' package and took about 2 minutes to come up with the matches. > n <- 200000 > mi <- 4500 > start <- sample(n * 10, n) # start times > int <- sample(1000, n, TRUE) # interval between start and end > genes <- data.frame(gene = paste0('gene', 1:n) + , start = start + , end = start + int + , stringsAsFactors = FALSE + ) > miRNA <- data.frame(name = paste0('mi', 1:mi) + , pos = sample(n * 9, mi) + , stringsAsFactors = FALSE + ) > require(sqldf) Loading required package: sqldf Loading required package: gsubfn Loading required package: proto Loading required package: RSQLite Loading required package: DBI > matches <- sqldf(" + select m.*, g.* + from miRNA as m + join genes as g + on m.pos between g.start and g.end + ") Loading required package: tcltk > > str(matches) 'data.frame': 225045 obs. of 5 variables: $ name : chr "mi1" "mi1" "mi1" "mi1" ... $ pos : int 279341 279341 279341 279341 279341 279341 279341 279341 279341 279341 ... $ gene : chr "gene3133" "gene14326" "gene14997" "gene17652" ... $ start: int 279000 278623 279157 279296 278379 279055 279180 279273 278938 278960 ... $ end : int 279924 279444 280150 279930 279347 279861 279782 280268 279791 279796 ... > head(matches) name pos gene start end 1 mi1 279341 gene3133 279000 279924 2 mi1 279341 gene14326 278623 279444 3 mi1 279341 gene14997 279157 280150 4 mi1 279341 gene17652 279296 279930 5 mi1 279341 gene21208 278379 279347 6 mi1 279341 gene30889 279055 279861 Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. On Sun, Mar 15, 2015 at 3:41 PM, Papysounours < Cyrille.laurent.sage at gmail.com> wrote: > Hi > > I am just starting R programming because i need it to analyse new > sequencing > data. I got two list of data (excel table) one is gene list with > chromosomal > position (like start:123456 end:124567), the other is miRNA list with only > one position (like 123789). > In the first liste i have around 20000 row (meaning 20000 gene name to > compare to) and for the second around 4500 row (4500 miRNA). > I want to compare the position of each individual miRNA position ( > genestart<=miRNA<=geneend ) to the entire list of gene in order to get in a > new table the name of the miRNA (first colum of the miRNA list) and the > name > of the gene (first colum of the gene list) related to the miRNA. > Hope thisis not to much to ask. > Papy > > > > -- > View this message in context: > http://r.789695.n4.nabble.com/R-beginner-tp4704684.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jholtman at gmail.com Sun Mar 15 22:41:07 2015 From: jholtman at gmail.com (jim holtman) Date: Sun, 15 Mar 2015 17:41:07 -0400 Subject: [datatable-help] R beginner In-Reply-To: <1426448493840-4704684.post@n4.nabble.com> References: <1426448493840-4704684.post@n4.nabble.com> Message-ID: I was off by a factor of 10; I thought it said 200,000 but it was only 20,000 so it only takes 10 seconds to solves > n <- 20000 > mi <- 4500 > start <- sample(n * 10, n) # start times > int <- sample(1000, n, TRUE) # interval between start and end > genes <- data.frame(gene = paste0('gene', 1:n) + , start = start + , end = start + int + , stringsAsFactors = FALSE + ) > miRNA <- data.frame(name = paste0('mi', 1:mi) + , pos = sample(n * 9, mi) + , stringsAsFactors = FALSE + ) > require(sqldf) > > system.time({ + matches <- sqldf(" + select m.*, g.* + from miRNA as m + join genes as g + on m.pos between g.start and g.end + ") + }) user system elapsed 10.91 0.02 10.96 > head(matches, 10) name pos gene start end 1 mi1 3825 gene200 3634 4134 2 mi1 3825 gene385 3616 4241 3 mi1 3825 gene410 3492 4089 4 mi1 3825 gene1172 3707 3847 5 mi1 3825 gene1228 3825 3919 6 mi1 3825 gene1726 3586 4552 7 mi1 3825 gene1859 3633 4163 8 mi1 3825 gene1869 3269 4138 9 mi1 3825 gene2061 3812 4094 10 mi1 3825 gene2248 3225 3939 > str(matches) 'data.frame': 224028 obs. of 5 variables: $ name : chr "mi1" "mi1" "mi1" "mi1" ... $ pos : int 3825 3825 3825 3825 3825 3825 3825 3825 3825 3825 ... $ gene : chr "gene200" "gene385" "gene410" "gene1172" ... $ start: int 3634 3616 3492 3707 3825 3586 3633 3269 3812 3225 ... $ end : int 4134 4241 4089 3847 3919 4552 4163 4138 4094 3939 ... Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. On Sun, Mar 15, 2015 at 3:41 PM, Papysounours < Cyrille.laurent.sage at gmail.com> wrote: > Hi > > I am just starting R programming because i need it to analyse new > sequencing > data. I got two list of data (excel table) one is gene list with > chromosomal > position (like start:123456 end:124567), the other is miRNA list with only > one position (like 123789). > In the first liste i have around 20000 row (meaning 20000 gene name to > compare to) and for the second around 4500 row (4500 miRNA). > I want to compare the position of each individual miRNA position ( > genestart<=miRNA<=geneend ) to the entire list of gene in order to get in a > new table the name of the miRNA (first colum of the miRNA list) and the > name > of the gene (first colum of the gene list) related to the miRNA. > Hope thisis not to much to ask. > Papy > > > > -- > View this message in context: > http://r.789695.n4.nabble.com/R-beginner-tp4704684.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From npgraham1 at gmail.com Sun Mar 15 23:59:03 2015 From: npgraham1 at gmail.com (Nathaniel Graham) Date: Sun, 15 Mar 2015 18:59:03 -0400 Subject: [datatable-help] multiple data.table calls In-Reply-To: References: Message-ID: Thanks for the suggestion! Using a data.table of all possible meeting dates & branches and joining it to the employment history didn't occur to me. Unfortunately, even after tinkering with it for a bit, the join (even though it's a temporary structure) isn't feasible due to memory usage--meet.stratton and employ.hist joined produce a table of billions of rows. So I guess I was wrong about memory not being an issue! A note about terminology, because I wasn't very clear: I define a Stratton 'alum' as someone that actually at Stratton-Oakmont; I don't have a term for the brokers that Stratton alums later meet, even though they're the ones I need to find. In case someone stumbles across this later: The meet.stratton table of possible dates and branches is specified as (and again, I hope the formatting comes through): meet.stratton <- unique(employ.hist[icrdn %in% stratton.people, list(branch, date = as.Date(fromdate:todate)), by = "job.index"], by = c("branch", "date")) The unique() call is important to get right. Obviously, Frank didn't have the opportunity to experiment with the data (it's too big to pass around, and it's built from proprietary data). Also, I use the branch rather than the whole firm, as it's not so clear that just working at the same firm is meaningful--many broker-dealers have branches all over the country. It's also probably easier to drop Stratton people from the final results explicitly, doing something like: met.stratton.people <- met.stratton.people[!(icrdn %in% stratton.people)] I'm thinking about cooking something up using foverlaps(), although I'll need to learn its ins and outs first. ------- Nathaniel Graham npgraham1 at gmail.com npgraham1 at uky.edu https://sites.google.com/site/npgraham1/ On Sun, Mar 15, 2015 at 12:16 PM, Frank Erickson wrote: > I'd suggest: > > (1) Get a table identifying the condition "meeting someone who at some > point works at Stratton." (They aren't really "alums" if they haven't > worked there yet, but this is the definition you seem to be looking for.) > You can do this by looking at any (firm,date) combinations that involve > bumping into such a person: > > meet.stratton <- unique(employ.hist[icrdn %in% > stratton.people,list(fcrdn,date=fromdate:todate)]) > > (2) Find people who meet the conditions: > > setkey(employ.hist,fcrdn) > met.stratton.people <- employ.hist[meet.stratton,any(date>= startdate & > date <= todate),by="icrdn,fcrdn"][V1==TRUE,unique(icrdn)] > > (3) If you want to exclude Stratton folks, then use setdiff() > > --Frank > > On Sat, Mar 14, 2015 at 4:19 PM, Nathaniel Graham > wrote: > >> There's particular problem I often have, and I'm hoping someone can tell >> me how to speed it up in data.table. It seems to involve a sort of >> recursion that data.table (as I'm using it) doesn't do well with, where for >> each record in a set, I do a another search within the same table. I hope >> the formatting of the code below is legible--it's a lot easier to read in >> the RStudio text editor! >> >> I have a moderately large (more than 3 million rows) data.table of the >> employment histories of brokers in the US. Each row is an employment >> record, with a unique individual id (icrdn), a unique firm id (fcrdn), a >> branch identifier (branch), start and end dates (fromdate and todate), and >> a few other items (each row has a unique id as well, called job.index). >> For example, finding all the brokers that ever worked at Stratton Oakmont >> (from the Wolf of Wall Street): >> >> employ.hist[fcrdn == 18692, icrdn] >> >> where fcrdn is the firm identifier, 18692 is Stratton's ID, and icrdn is >> the individual identifier. >> >> What I want is to find all the individuals that ever met a Stratton >> alum. Specifically, every icrdn such that the branch == a branch a >> Stratton alum ever worked at and the start and end dates overlap. The only >> way I've found to do so involves something like this: >> >> find_brokers_by_single_branch <- cmpfun(function(sdt, edt, brnch) { >> employ.hist[fromdate <= sdt & todate >= edt & branch == brnch, >> list(icrdn, branch, job.index, fcrdn)] >> }) >> >> stratton.people <- employ.hist[fcrdn == 18692, icrdn] >> stratton.contacts <- employ.hist[icrdn %in% stratton.people, >> find_brokers_by_single_branch(fromdate, >> todate, branch), >> by = "job.index"] >> >> This works, but effectively means calling the data.table '[' function >> thousands of times, once for each job entry >> a Stratton broker ever had (which are in the thousands, as many left >> before the government busted the place >> and are still in the industry). It's quite slow, and I'm hoping someone >> can show me a way to speed it up, as I have >> many similar tasks, some of which are vastly larger. Memory really isn't >> an issue for me (32 GB) and CPU shouldn't be either (Intel i7-4770 3.4GHz), >> in case that helps. >> >> ------- >> Nathaniel Graham >> npgraham1 at gmail.com >> npgraham1 at uky.edu >> https://sites.google.com/site/npgraham1/ >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fperickson at wisc.edu Mon Mar 16 14:13:21 2015 From: fperickson at wisc.edu (Frank Erickson) Date: Mon, 16 Mar 2015 09:13:21 -0400 Subject: [datatable-help] multiple data.table calls In-Reply-To: References: Message-ID: Oh. In that case, I'd suggest checking only using the month and year, not the day. You'll get some false positives, but the data should be small enough to merge, I guess. It depends on your application whether that's tolerable. --Frank On Sun, Mar 15, 2015 at 6:59 PM, Nathaniel Graham wrote: > Thanks for the suggestion! Using a data.table of all possible meeting > dates & branches and joining it to the employment history didn't occur to > me. Unfortunately, even after tinkering with it for a bit, the join (even > though it's a temporary structure) isn't feasible due to memory > usage--meet.stratton and employ.hist joined produce a table of billions of > rows. So I guess I was wrong about memory not being an issue! > > A note about terminology, because I wasn't very clear: I define a Stratton > 'alum' as someone that actually at Stratton-Oakmont; I don't have a term > for the brokers that Stratton alums later meet, even though they're the > ones I need to find. > > In case someone stumbles across this later: > > The meet.stratton table of possible dates and branches is specified as > (and again, I hope the formatting comes through): > > meet.stratton <- unique(employ.hist[icrdn %in% stratton.people, > list(branch, date = > as.Date(fromdate:todate)), > by = "job.index"], > by = c("branch", "date")) > > The unique() call is important to get right. Obviously, Frank didn't have > the opportunity to experiment with the data (it's too big to pass around, > and it's built from proprietary data). Also, I use the branch rather than > the whole firm, as it's not so clear that just working at the same firm is > meaningful--many broker-dealers have branches all over the country. It's > also probably easier to drop Stratton people from the final results > explicitly, doing something like: > > met.stratton.people <- met.stratton.people[!(icrdn %in% stratton.people)] > > I'm thinking about cooking something up using foverlaps(), although I'll > need to learn its ins and outs first. > > ------- > Nathaniel Graham > npgraham1 at gmail.com > npgraham1 at uky.edu > https://sites.google.com/site/npgraham1/ > > On Sun, Mar 15, 2015 at 12:16 PM, Frank Erickson > wrote: > >> I'd suggest: >> >> (1) Get a table identifying the condition "meeting someone who at some >> point works at Stratton." (They aren't really "alums" if they haven't >> worked there yet, but this is the definition you seem to be looking for.) >> You can do this by looking at any (firm,date) combinations that involve >> bumping into such a person: >> >> meet.stratton <- unique(employ.hist[icrdn %in% >> stratton.people,list(fcrdn,date=fromdate:todate)]) >> >> (2) Find people who meet the conditions: >> >> setkey(employ.hist,fcrdn) >> met.stratton.people <- employ.hist[meet.stratton,any(date>= startdate & >> date <= todate),by="icrdn,fcrdn"][V1==TRUE,unique(icrdn)] >> >> (3) If you want to exclude Stratton folks, then use setdiff() >> >> --Frank >> >> On Sat, Mar 14, 2015 at 4:19 PM, Nathaniel Graham >> wrote: >> >>> There's particular problem I often have, and I'm hoping someone can tell >>> me how to speed it up in data.table. It seems to involve a sort of >>> recursion that data.table (as I'm using it) doesn't do well with, where for >>> each record in a set, I do a another search within the same table. I hope >>> the formatting of the code below is legible--it's a lot easier to read in >>> the RStudio text editor! >>> >>> I have a moderately large (more than 3 million rows) data.table of the >>> employment histories of brokers in the US. Each row is an employment >>> record, with a unique individual id (icrdn), a unique firm id (fcrdn), a >>> branch identifier (branch), start and end dates (fromdate and todate), and >>> a few other items (each row has a unique id as well, called job.index). >>> For example, finding all the brokers that ever worked at Stratton Oakmont >>> (from the Wolf of Wall Street): >>> >>> employ.hist[fcrdn == 18692, icrdn] >>> >>> where fcrdn is the firm identifier, 18692 is Stratton's ID, and icrdn is >>> the individual identifier. >>> >>> What I want is to find all the individuals that ever met a Stratton >>> alum. Specifically, every icrdn such that the branch == a branch a >>> Stratton alum ever worked at and the start and end dates overlap. The only >>> way I've found to do so involves something like this: >>> >>> find_brokers_by_single_branch <- cmpfun(function(sdt, edt, brnch) { >>> employ.hist[fromdate <= sdt & todate >= edt & branch == brnch, >>> list(icrdn, branch, job.index, fcrdn)] >>> }) >>> >>> stratton.people <- employ.hist[fcrdn == 18692, icrdn] >>> stratton.contacts <- employ.hist[icrdn %in% stratton.people, >>> find_brokers_by_single_branch(fromdate, >>> todate, branch), >>> by = "job.index"] >>> >>> This works, but effectively means calling the data.table '[' function >>> thousands of times, once for each job entry >>> a Stratton broker ever had (which are in the thousands, as many left >>> before the government busted the place >>> and are still in the industry). It's quite slow, and I'm hoping someone >>> can show me a way to speed it up, as I have >>> many similar tasks, some of which are vastly larger. Memory really >>> isn't an issue for me (32 GB) and CPU shouldn't be either (Intel i7-4770 >>> 3.4GHz), in case that helps. >>> >>> ------- >>> Nathaniel Graham >>> npgraham1 at gmail.com >>> npgraham1 at uky.edu >>> https://sites.google.com/site/npgraham1/ >>> >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >> >> > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Mon Mar 16 18:49:15 2015 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Mon, 16 Mar 2015 18:49:15 +0100 Subject: [datatable-help] R beginner In-Reply-To: References: <1426448493840-4704684.post@n4.nabble.com> Message-ID: Cyrile, See `?foverlaps` function from data.table package or `?findOverlaps` from GenomicRanges package. These implement algorithms specifically designed for operating on interval ranges efficiently. --? Arun On 15 Mar 2015 at 22:41:18, jim holtman (jholtman at gmail.com) wrote: I was off by a factor of 10; I thought it said 200,000 but it was only 20,000 so it only takes 10 seconds to solves > n <- 20000 > mi <- 4500 > start <- sample(n * 10, n)? # start times > int <- sample(1000, n, TRUE)? # interval between start and end > genes <- data.frame(gene = paste0('gene', 1:n) +???????????????? , start = start +???????????????? , end = start + int +???????????????? , stringsAsFactors = FALSE +???????????????? ) > miRNA <- data.frame(name = paste0('mi', 1:mi) +???????????????? , pos = sample(n * 9, mi) +???????????????? , stringsAsFactors = FALSE +???????????????? ) > require(sqldf) > > system.time({ + matches <- sqldf(" +???? select m.*, g.* +???? from miRNA as m +???? join genes as g +???????? on m.pos between g.start and g.end + ") + })??????? ?? user? system elapsed ? 10.91??? 0.02?? 10.96 > head(matches, 10) ?? name? pos???? gene start? end 1?? mi1 3825? gene200? 3634 4134 2?? mi1 3825? gene385? 3616 4241 3?? mi1 3825? gene410? 3492 4089 4?? mi1 3825 gene1172? 3707 3847 5?? mi1 3825 gene1228? 3825 3919 6?? mi1 3825 gene1726? 3586 4552 7?? mi1 3825 gene1859? 3633 4163 8?? mi1 3825 gene1869? 3269 4138 9?? mi1 3825 gene2061? 3812 4094 10? mi1 3825 gene2248? 3225 3939 > str(matches) 'data.frame':?? 224028 obs. of? 5 variables: ?$ name : chr? "mi1" "mi1" "mi1" "mi1" ... ?$ pos? : int? 3825 3825 3825 3825 3825 3825 3825 3825 3825 3825 ... ?$ gene : chr? "gene200" "gene385" "gene410" "gene1172" ... ?$ start: int? 3634 3616 3492 3707 3825 3586 3633 3269 3812 3225 ... ?$ end? : int? 4134 4241 4089 3847 3919 4552 4163 4138 4094 3939 ... Jim Holtman Data Munger Guru ? What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. On Sun, Mar 15, 2015 at 3:41 PM, Papysounours wrote: Hi I am just starting R programming because i need it to analyse new sequencing data. I got two list of data (excel table) one is gene list with chromosomal position (like start:123456 end:124567), the other is miRNA list with only one position (like 123789). ?In the first liste i have around 20000 row (meaning 20000 gene name to compare to) and for the second around 4500 row (4500 miRNA). I want to compare the position of each individual miRNA position ( genestart<=miRNA<=geneend ) to the entire list of gene in order to get in a new table the name of the miRNA (first colum of the miRNA list) and the name of the gene? (first colum of the gene list) related to the miRNA. Hope thisis not to much to ask. Papy -- View this message in context: http://r.789695.n4.nabble.com/R-beginner-tp4704684.html Sent from the datatable-help mailing list archive at Nabble.com. _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Mon Mar 16 18:54:25 2015 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Mon, 16 Mar 2015 18:54:25 +0100 Subject: [datatable-help] NA handling not working? In-Reply-To: References: Message-ID: What?s the issue here? It seems to have taken ~4 seconds IIUC. The problem seems that your file has a ??? at the line denoted, which results in having to coerce all the lines read previously to character type first. Handling ?na.strings? is on the list -?https://github.com/Rdatatable/data.table/issues/504?but I don?t get as to why it?s choking.. 4 seconds isn?t a lot, really. --? Arun On 10 Mar 2015 at 15:13:20, Rivo R (rivokl at gmail.com) wrote: Hi all, I tried to load the following (huge) dataset as data.table but fread seems to choke on NA's. https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2Fhousehold_power_consumption.zip Steps: 1- Dowload and unzip 2- > packageVersion("data.table") [1] ?1.9.4? 3- > tmp <- fread(dataFile, sep=';', header=TRUE, na.strings=c("NA","'?'", ""), + stringsAsFactors=FALSE, + colClasses=c(rep("character",2), rep("numeric",7)), verbose=TRUE) Input contains no \n. Taking this to be a filename to open File opened, filesize is 0.121897 GB. Memory mapping ... ok Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. Looking for supplied sep ';' on line 30 (the last non blank line in the first 'autostart') ... found ok Found 9 columns First row with 9 fields occurs on line 1 (either column names or first row of data) 'header' changed by user from 'auto' to TRUE Count of eol after first data row: 2075260 Subtracted 1 for last eol and any trailing empty lines, leaving 2075259 data rows Type codes ( first 5 rows): 443333333 Type codes (+ middle 5 rows): 443333333 Type codes (+ last 5 rows): 443333333 Type codes: 443333333 (after applying colClasses and integer64) Type codes: 443333333 (after applying drop or select (if supplied) Allocating 9 column slots (9 - 0 dropped) Bumping column 3 from REAL to STR on data row 6840, field contains '?' Bumping column 4 from REAL to STR on data row 6840, field contains '?' Bumping column 5 from REAL to STR on data row 6840, field contains '?' Bumping column 6 from REAL to STR on data row 6840, field contains '?' Bumping column 7 from REAL to STR on data row 6840, field contains '?' Bumping column 8 from REAL to STR on data row 6840, field contains '?' Read 2075259 rows and 9 (of 9) columns from 0.122 GB file in 00:00:04 0.000s ( 0%) Memory map (rerun may be quicker) 0.001s ( 0%) sep and header detection 0.282s ( 7%) Count rows (wc -l) 0.002s ( 0%) Column type detection (first, middle and last 5 rows) 0.627s ( 16%) Allocation of 2075259x9 result (xMB) in RAM 2.525s ( 64%) Reading data 0.298s ( 8%) Allocation for type bumps (if any), including gc time if triggered 0.123s ( 3%) Coercing data already read in type bumps (if any) 0.059s ( 2%) Changing na.strings to NA 3.917s Total Any hint?? Kely _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Mon Mar 16 18:55:01 2015 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Mon, 16 Mar 2015 18:55:01 +0100 Subject: [datatable-help] error in rbindlist In-Reply-To: References: Message-ID: Without an MRE my best guess is that you?ve a NA somewhere. --? Arun On 9 Mar 2015 at 19:22:15, Farrel Buchinsky (fjbuch at gmail.com) wrote: I used a series of statements such as this l = list(specs.post.crash, specimens[,1:26, with=FALSE]) to work out which set of columns worked and which did not.? I found the offending column kbdtimespecim is a column from sepcimens. It is the 27 th column. With it included the whole rbindlist collapses. Without it, it works.? "15:04:48" "15:02:52" "15:06:44" "16:07:22" "16:07:22" "16:07:22" > class(specimens$kbdtimespecim) [1] "ITime" Any idea why? Farrel Buchinsky Google Voice Tel: (412) 567-7870 On Mon, Mar 9, 2015 at 2:08 PM, Farrel Buchinsky wrote: I am doing something really simple but getting an error I do not know how to troubleshoot. l = list(specimens, specs.post.crash) rbindlist(l, fill = TRUE) specimens has about 5 columns and specs.post.crash has about 29.? I get this error. I cannot find anyone else with this problem. I cannot share my actual data.table since it has private information.? Error in if (any(neg)) res[neg] = paste("-", res[neg], sep = "") :? ? missing value where TRUE/FALSE needed Interested in the off-the-top-of-your-head suggestions of where to troubleshoot.? Using?{R package version 1.9.5} Farrel Buchinsky Google Voice Tel: (412) 567-7870 _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Mon Mar 16 18:56:26 2015 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Mon, 16 Mar 2015 18:56:26 +0100 Subject: [datatable-help] multiply columns in two vectors In-Reply-To: <1DD7F9F3A5AE9845846F5B44C395CDF1ADC0D02F@P1KITMBX03WC02.unicph.domain> References: <1DD7F9F3A5AE9845846F5B44C395CDF1ADC0D02F@P1KITMBX03WC02.unicph.domain> Message-ID: Hi Ingeborg,? I think you?re looking for the r-help mailing list. This mailing list is for data.table R package. --? Arun On 11 Mar 2015 at 10:20:52, Ingeborg Callesen (ica at ign.ku.dk) wrote: Dear r-helpers out there ? Is there a function or a code string for multiplying the numbers in e.g. a n x m vector by another n x m vector ? The result should be a n x m vector with the product of the two numbers in each vector on the same position in the new vector. ? Thanks for any help in advance. ? BR Ingeborg Callesen _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From fjbuch at gmail.com Sat Mar 21 21:22:49 2015 From: fjbuch at gmail.com (Farrel Buchinsky) Date: Sat, 21 Mar 2015 16:22:49 -0400 Subject: [datatable-help] error in rbindlist In-Reply-To: References: Message-ID: Dear Arun What is "MRE"? There was no NA > anyNA(specimens$kbdtimespecim) [1] FALSE Farrel Buchinsky Google Voice Tel: (412) 567-7870 On Mon, Mar 16, 2015 at 1:55 PM, Arunkumar Srinivasan wrote: > Without an MRE my best guess is that you've a NA somewhere. > > -- > Arun > > On 9 Mar 2015 at 19:22:15, Farrel Buchinsky (fjbuch at gmail.com) wrote: > > I used a series of statements such as this > > l = list(specs.post.crash, specimens[,1:26, with=FALSE]) > > to work out which set of columns worked and which did not. > > I found the offending column > > kbdtimespecim is a column from sepcimens. It is the 27 th column. With it > included the whole rbindlist collapses. Without it, it works. > > "15:04:48" "15:02:52" "15:06:44" "16:07:22" "16:07:22" "16:07:22" > > > class(specimens$kbdtimespecim) > [1] "ITime" > > Any idea why? > > Farrel Buchinsky > Google Voice Tel: (412) 567-7870 > > On Mon, Mar 9, 2015 at 2:08 PM, Farrel Buchinsky wrote: > >> I am doing something really simple but getting an error I do not know how >> to troubleshoot. >> >> l = list(specimens, specs.post.crash) >> rbindlist(l, fill = TRUE) >> >> specimens has about 5 columns and specs.post.crash has about 29. >> >> I get this error. I cannot find anyone else with this problem. I cannot >> share my actual data.table since it has private information. >> >> Error in if (any(neg)) res[neg] = paste("-", res[neg], sep = "") : >> missing value where TRUE/FALSE needed >> >> Interested in the off-the-top-of-your-head suggestions of where to >> troubleshoot. >> Using {R package version 1.9.5} >> >> >> Farrel Buchinsky >> Google Voice Tel: (412) 567-7870 <%28412%29%20567-7870> >> > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gsee000 at gmail.com Sat Mar 21 22:36:50 2015 From: gsee000 at gmail.com (G See) Date: Sat, 21 Mar 2015 16:36:50 -0500 Subject: [datatable-help] error in rbindlist In-Reply-To: References: Message-ID: Hi Farrel, MRE = Minimal Reproducible Example. It would go a long way in helping to diagnose the issue. I can generate the error you're getting by calling as.ITime() on a string that doesn't look like a time. > as.ITime("") Error in if (any(neg)) res[neg] = paste("-", res[neg], sep = "") : missing value where TRUE/FALSE needed I'd start by looking to for values in that column that don't look like the rest. For example specimens[grep("\\d+:\\d+:\\d+", kbdtimespecim, invert=TRUE)] That might help you narrow it down enough to be able to create something reproducible with a small set of non-sensitive data. HTH, Garrett On Sat, Mar 21, 2015 at 3:22 PM, Farrel Buchinsky wrote: > Dear Arun > > What is "MRE"? > There was no NA >> anyNA(specimens$kbdtimespecim) > [1] FALSE > > Farrel Buchinsky > Google Voice Tel: (412) 567-7870 > > On Mon, Mar 16, 2015 at 1:55 PM, Arunkumar Srinivasan > wrote: >> >> Without an MRE my best guess is that you?ve a NA somewhere. >> >> -- >> Arun >> >> On 9 Mar 2015 at 19:22:15, Farrel Buchinsky (fjbuch at gmail.com) wrote: >> >> I used a series of statements such as this >> >> l = list(specs.post.crash, specimens[,1:26, with=FALSE]) >> >> to work out which set of columns worked and which did not. >> >> I found the offending column >> >> kbdtimespecim is a column from sepcimens. It is the 27 th column. With it >> included the whole rbindlist collapses. Without it, it works. >> >> "15:04:48" "15:02:52" "15:06:44" "16:07:22" "16:07:22" "16:07:22" >> >> > class(specimens$kbdtimespecim) >> [1] "ITime" >> >> Any idea why? >> >> Farrel Buchinsky >> Google Voice Tel: (412) 567-7870 >> >> On Mon, Mar 9, 2015 at 2:08 PM, Farrel Buchinsky wrote: >>> >>> I am doing something really simple but getting an error I do not know how >>> to troubleshoot. >>> >>> l = list(specimens, specs.post.crash) >>> rbindlist(l, fill = TRUE) >>> >>> specimens has about 5 columns and specs.post.crash has about 29. >>> >>> I get this error. I cannot find anyone else with this problem. I cannot >>> share my actual data.table since it has private information. >>> >>> Error in if (any(neg)) res[neg] = paste("-", res[neg], sep = "") : >>> missing value where TRUE/FALSE needed >>> >>> Interested in the off-the-top-of-your-head suggestions of where to >>> troubleshoot. >>> Using {R package version 1.9.5} >>> >>> >>> Farrel Buchinsky >>> Google Voice Tel: (412) 567-7870 >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From gerald.jean at dgag.ca Mon Mar 23 14:49:14 2015 From: gerald.jean at dgag.ca (Gerald Jean) Date: Mon, 23 Mar 2015 13:49:14 +0000 Subject: [datatable-help] Best way to export Huge data sets. Message-ID: <7889EDA06EB6454D92349FFF17BF790F3C202676@PWPRIMX72.mvt.desjardins.com> Hello, I am currently on a project where I have to read, process, aggregate 10 to 12 millions of files for roughly 10 billions lines of data. The files are arranged in roughly 64000 directories, each directory is one client's data. I have written code importing and "massaging" the data per directory. The code is data.table driven. I am running this on a 24 cores machine with 145 Gb of RAM on a Linux box under RedHat. For testing purpose I have parallelized the code, using the doMC package, runs fine and it seems to be fast. But I haven't tried to output the resulting files, three per client. A small one, a moderate size one and a large one, over 500Gb estimated. My question: what is the best way to output those files without creating bottlenecks?? I thought of breaking the list of input directories into 24 threads, supplying a list of lists to "foreach" where one of the components of each sub-list would be the name of the output files but I am worried that "write.table" would take for ever to write this data to disk, one solution would be to use "save" and keep the output data in Rdata format, but that complicates further analysis by other software. Any suggestions??? By the way "data.table" sure helped so far in processing that data, thanks to the developpers for such an efficient package, G?rald [cid:image001.gif at 01D0654C.B37DA460] Gerald Jean, M. Sc. en statistiques Conseiller senior en statistiques Actuariat corporatif, Mod?lisation et Recherche Assurance de dommages Mouvement Desjardins L?vis (si?ge social) 418 835-4900, poste 5527639 1 877 835-4900, poste 5527639 T?l?copieur : 418 835-6657 Faites bonne impression et imprimez seulement au besoin! Ce courriel est confidentiel, peut ?tre prot?g? par le secret professionnel et est adress? exclusivement au destinataire. Il est strictement interdit ? toute autre personne de diffuser, distribuer ou reproduire ce message. Si vous l'avez re?u par erreur, veuillez imm?diatement le d?truire et aviser l'exp?diteur. Merci. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 6632 bytes Desc: image001.gif URL: From niparisco at gmail.com Mon Mar 23 22:50:33 2015 From: niparisco at gmail.com (Nicolas Paris) Date: Mon, 23 Mar 2015 22:50:33 +0100 Subject: [datatable-help] Best way to export Huge data sets. In-Reply-To: <7889EDA06EB6454D92349FFF17BF790F3C202676@PWPRIMX72.mvt.desjardins.com> References: <7889EDA06EB6454D92349FFF17BF790F3C202676@PWPRIMX72.mvt.desjardins.com> Message-ID: Some test you can do without many code change : 1) transform your data.table as matrix before write 2) use write table + this config to save place& time ->sep = Pipe (1byte& rarely used) ->disable quote (saves "" more than bilion times) ->latin1 instead of utf-8 3) use chunks (say cut in slice output) and append=T (this may work in parallel) If still too long, try installing some database (sqlite) on your 24 core system, and try load it hope this helps 2015-03-23 14:49 GMT+01:00 Gerald Jean : > Hello, > > > > I am currently on a project where I have to read, process, aggregate 10 to > 12 millions of files for roughly 10 billions lines of data. > > > > The files are arranged in roughly 64000 directories, each directory is one > client?s data. > > > > I have written code importing and ?massaging? the data per directory. The > code is data.table driven. I am running this on a 24 cores machine with > 145 Gb of RAM on a Linux box under RedHat. > > > > For testing purpose I have parallelized the code, using the doMC package, > runs fine and it seems to be fast. But I haven?t tried to output the > resulting files, three per client. A small one, a moderate size one and a > large one, over 500Gb estimated. > > > > My question: > > > > what is the best way to output those files without creating bottlenecks?? > > > > I thought of breaking the list of input directories into 24 threads, > supplying a list of lists to ?foreach? where one of the components of each > sub-list would be the name of the output files but I am worried that > ?write.table? would take for ever to write this data to disk, one solution > would be to use ?save? and keep the output data in Rdata format, but that > complicates further analysis by other software. > > > > Any suggestions??? > > > > By the way ?data.table? sure helped so far in processing that data, thanks > to the developpers for such an efficient package, > > > > G?rald > > > > *Gerald Jean, M. Sc. en statistiques* > Conseiller senior en statistiques > > Actuariat corporatif, > Mod?lisation et Recherche > Assurance de dommages > Mouvement Desjardins > > > L?vis (si?ge social) > > 418 835-4900, > > poste 5527639 > 1 877 835-4900, > > poste 5527639 > T?l?copieur : 418 835-6657 > > > > > > > > Faites bonne impression et imprimez seulement au besoin! > > Ce courriel est confidentiel, peut ?tre prot?g? par le secret > professionnel et est adress? exclusivement au destinataire. Il est > strictement interdit ? toute autre personne de diffuser, distribuer ou > reproduire ce message. Si vous l'avez re?u par erreur, veuillez > imm?diatement le d?truire et aviser l'exp?diteur. Merci. > > > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 6632 bytes Desc: not available URL: From gerald.jean at dgag.ca Tue Mar 24 13:03:21 2015 From: gerald.jean at dgag.ca (Gerald Jean) Date: Tue, 24 Mar 2015 12:03:21 +0000 Subject: [datatable-help] Best way to export Huge data sets. In-Reply-To: References: <7889EDA06EB6454D92349FFF17BF790F3C202676@PWPRIMX72.mvt.desjardins.com> Message-ID: <7889EDA06EB6454D92349FFF17BF790F45504866@PWPRIMX72.mvt.desjardins.com> Hello Nicolas, thanks for your suggestions. The data.table can?t be transformed in a matrix as it is of mixed types : POSIX.ct columns, character, logical, factor and numeric columns. Admin is currently installing PostgreSQL on the server, I?ll try to go that route. Too bad data.table doesn?t have, yet, a writing routine as fast as ?fread? is for reading!!! Thanks, G?rald [cid:image001.gif at 01D06608.FEC7C1A0] Gerald Jean, M. Sc. en statistiques Conseiller senior en statistiques Actuariat corporatif, Mod?lisation et Recherche Assurance de dommages Mouvement Desjardins L?vis (si?ge social) 418 835-4900, poste 5527639 1 877 835-4900, poste 5527639 T?l?copieur : 418 835-6657 Faites bonne impression et imprimez seulement au besoin! Ce courriel est confidentiel, peut ?tre prot?g? par le secret professionnel et est adress? exclusivement au destinataire. Il est strictement interdit ? toute autre personne de diffuser, distribuer ou reproduire ce message. Si vous l'avez re?u par erreur, veuillez imm?diatement le d?truire et aviser l'exp?diteur. Merci. De : Nicolas Paris [mailto:niparisco at gmail.com] Envoy? : 23 mars 2015 17:51 ? : Gerald Jean Cc : datatable-help at lists.r-forge.r-project.org Objet : Re: [datatable-help] Best way to export Huge data sets. Some test you can do without many code change : 1) transform your data.table as matrix before write 2) use write table + this config to save place& time ->sep = Pipe (1byte& rarely used) ->disable quote (saves "" more than bilion times) ->latin1 instead of utf-8 3) use chunks (say cut in slice output) and append=T (this may work in parallel) If still too long, try installing some database (sqlite) on your 24 core system, and try load it hope this helps 2015-03-23 14:49 GMT+01:00 Gerald Jean >: Hello, I am currently on a project where I have to read, process, aggregate 10 to 12 millions of files for roughly 10 billions lines of data. The files are arranged in roughly 64000 directories, each directory is one client?s data. I have written code importing and ?massaging? the data per directory. The code is data.table driven. I am running this on a 24 cores machine with 145 Gb of RAM on a Linux box under RedHat. For testing purpose I have parallelized the code, using the doMC package, runs fine and it seems to be fast. But I haven?t tried to output the resulting files, three per client. A small one, a moderate size one and a large one, over 500Gb estimated. My question: what is the best way to output those files without creating bottlenecks?? I thought of breaking the list of input directories into 24 threads, supplying a list of lists to ?foreach? where one of the components of each sub-list would be the name of the output files but I am worried that ?write.table? would take for ever to write this data to disk, one solution would be to use ?save? and keep the output data in Rdata format, but that complicates further analysis by other software. Any suggestions??? By the way ?data.table? sure helped so far in processing that data, thanks to the developpers for such an efficient package, G?rald [cid:image001.gif at 01D06608.FEC7C1A0] Gerald Jean, M. Sc. en statistiques Conseiller senior en statistiques Actuariat corporatif, Mod?lisation et Recherche Assurance de dommages Mouvement Desjardins L?vis (si?ge social) 418 835-4900, poste 5527639 1 877 835-4900, poste 5527639 T?l?copieur : 418 835-6657 Faites bonne impression et imprimez seulement au besoin! Ce courriel est confidentiel, peut ?tre prot?g? par le secret professionnel et est adress? exclusivement au destinataire. Il est strictement interdit ? toute autre personne de diffuser, distribuer ou reproduire ce message. Si vous l'avez re?u par erreur, veuillez imm?diatement le d?truire et aviser l'exp?diteur. Merci. _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 6632 bytes Desc: image001.gif URL: From niparisco at gmail.com Tue Mar 24 13:34:35 2015 From: niparisco at gmail.com (Nicolas Paris) Date: Tue, 24 Mar 2015 13:34:35 +0100 Subject: [datatable-help] Best way to export Huge data sets. In-Reply-To: <7889EDA06EB6454D92349FFF17BF790F45504866@PWPRIMX72.mvt.desjardins.com> References: <7889EDA06EB6454D92349FFF17BF790F3C202676@PWPRIMX72.mvt.desjardins.com> <7889EDA06EB6454D92349FFF17BF790F45504866@PWPRIMX72.mvt.desjardins.com> Message-ID: ? thanks for your suggestions. The data.table can?t be transformed in a matrix as it is of mixed types : POSIX.ct columns, character, logical, factor and numeric columns. ?What about casting all as character ? CSV does not make difference between types as quote is disabled in the config I proposed. About postgresql, I use it, and the faster way to load data is to use the COPY statement. I load 7GB of data in 5 min, but... COPY uses a csv as source. A "binary" file can be used too, but I have never tried. ?The package RSqlite could help too, Some use this instead of CSV writing. Never tried too. ? ? ?? 2015-03-24 13:03 GMT+01:00 Gerald Jean : > Hello Nicolas, > > ?? > > > ?? > thanks for your suggestions. The data.table can?t be transformed in a > matrix as it is of mixed types : POSIX.ct columns, character, logical, > factor and numeric columns. > > > > Admin is currently installing PostgreSQL on the server, I?ll try to go > that route. Too bad data.table doesn?t have, yet, a writing routine as > fast as ?fread? is for reading!!! > > > > Thanks, > > > > G?rald > > > > *Gerald Jean, M. Sc. en statistiques* > Conseiller senior en statistiques > > Actuariat corporatif, > Mod?lisation et Recherche > Assurance de dommages > Mouvement Desjardins > > > L?vis (si?ge social) > > 418 835-4900, > > poste 5527639 > 1 877 835-4900, > > poste 5527639 > T?l?copieur : 418 835-6657 > > > > > > > > Faites bonne impression et imprimez seulement au besoin! > > Ce courriel est confidentiel, peut ?tre prot?g? par le secret > professionnel et est adress? exclusivement au destinataire. Il est > strictement interdit ? toute autre personne de diffuser, distribuer ou > reproduire ce message. Si vous l'avez re?u par erreur, veuillez > imm?diatement le d?truire et aviser l'exp?diteur. Merci. > > > > > > *De :* Nicolas Paris [mailto:niparisco at gmail.com] > *Envoy? :* 23 mars 2015 17:51 > *? :* Gerald Jean > *Cc :* datatable-help at lists.r-forge.r-project.org > *Objet :* Re: [datatable-help] Best way to export Huge data sets. > > > > Some test you can do without many code change : > > 1) transform your data.table as matrix before write > > 2) use write table + this config to save place& time > > ->sep = Pipe (1byte& rarely used) > > ->disable quote (saves "" more than bilion times) > > ->latin1 instead of utf-8 > > 3) use chunks (say cut in slice output) and append=T (this may work in > parallel) > > > > If still too long, try installing some database (sqlite) on your 24 core > system, and try load it > > > > hope this helps > > > > 2015-03-23 14:49 GMT+01:00 Gerald Jean : > > Hello, > > > > I am currently on a project where I have to read, process, aggregate 10 to > 12 millions of files for roughly 10 billions lines of data. > > > > The files are arranged in roughly 64000 directories, each directory is one > client?s data. > > > > I have written code importing and ?massaging? the data per directory. The > code is data.table driven. I am running this on a 24 cores machine with > 145 Gb of RAM on a Linux box under RedHat. > > > > For testing purpose I have parallelized the code, using the doMC package, > runs fine and it seems to be fast. But I haven?t tried to output the > resulting files, three per client. A small one, a moderate size one and a > large one, over 500Gb estimated. > > > > My question: > > > > what is the best way to output those files without creating bottlenecks?? > > > > I thought of breaking the list of input directories into 24 threads, > supplying a list of lists to ?foreach? where one of the components of each > sub-list would be the name of the output files but I am worried that > ?write.table? would take for ever to write this data to disk, one solution > would be to use ?save? and keep the output data in Rdata format, but that > complicates further analysis by other software. > > > > Any suggestions??? > > > > By the way ?data.table? sure helped so far in processing that data, thanks > to the developpers for such an efficient package, > > > > G?rald > > > > *Gerald Jean, M. Sc. en statistiques* > Conseiller senior en statistiques > > Actuariat corporatif, > Mod?lisation et Recherche > Assurance de dommages > Mouvement Desjardins > > > L?vis (si?ge social) > > 418 835-4900, > > poste 5527639 > 1 877 835-4900, > > poste 5527639 > T?l?copieur : 418 835-6657 > > > > > > > Faites bonne impression et imprimez seulement au besoin! > > Ce courriel est confidentiel, peut ?tre prot?g? par le secret > professionnel et est adress? exclusivement au destinataire. Il est > strictement interdit ? toute autre personne de diffuser, distribuer ou > reproduire ce message. Si vous l'avez re?u par erreur, veuillez > imm?diatement le d?truire et aviser l'exp?diteur. Merci. > > > > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 6632 bytes Desc: not available URL: From gerald.jean at dgag.ca Tue Mar 24 13:56:48 2015 From: gerald.jean at dgag.ca (Gerald Jean) Date: Tue, 24 Mar 2015 12:56:48 +0000 Subject: [datatable-help] Best way to export Huge data sets. In-Reply-To: References: <7889EDA06EB6454D92349FFF17BF790F3C202676@PWPRIMX72.mvt.desjardins.com> <7889EDA06EB6454D92349FFF17BF790F45504866@PWPRIMX72.mvt.desjardins.com> Message-ID: <7889EDA06EB6454D92349FFF17BF790F45505886@PWPRIMX72.mvt.desjardins.com> Thanks again Nicolas, in my case, for the time being, the load process has no options!!! The data is supplied to us by an outside firm, there is one directory per device and one file per day of usage, the files are zipped, there is 10-12 millions of them. I read them using ?fread? this way: fread(input = sprintf("zcat %s", x), ?) works fine, much faster than using read.table. All I have left to do before running in parallel on the whole data directories is to find a way to efficiently outputting the resulting aggregated data sets. I thought about SQLite, but was told that it puts a lock on the DB when a process is writing to it and, apparently Postgresql handles that more efficiently. But SQLite has the advantage of being embedded in RSQLite, hence not requiring admin intervention, can?t have everything it seems!!! Thanks again and cheers, G?rald [cid:image001.gif at 01D0660F.CA65B4B0] Gerald Jean, M. Sc. en statistiques Conseiller senior en statistiques Actuariat corporatif, Mod?lisation et Recherche Assurance de dommages Mouvement Desjardins L?vis (si?ge social) 418 835-4900, poste 5527639 1 877 835-4900, poste 5527639 T?l?copieur : 418 835-6657 Faites bonne impression et imprimez seulement au besoin! Ce courriel est confidentiel, peut ?tre prot?g? par le secret professionnel et est adress? exclusivement au destinataire. Il est strictement interdit ? toute autre personne de diffuser, distribuer ou reproduire ce message. Si vous l'avez re?u par erreur, veuillez imm?diatement le d?truire et aviser l'exp?diteur. Merci. De : Nicolas Paris [mailto:niparisco at gmail.com] Envoy? : 24 mars 2015 08:35 ? : Gerald Jean Cc : datatable-help at lists.r-forge.r-project.org Objet : Re: [datatable-help] Best way to export Huge data sets. ? thanks for your suggestions. The data.table can?t be transformed in a matrix as it is of mixed types : POSIX.ct columns, character, logical, factor and numeric columns. ?What about casting all as character ? CSV does not make difference between types as quote is disabled in the config I proposed. About postgresql, I use it, and the faster way to load data is to use the COPY statement. I load 7GB of data in 5 min, but... COPY uses a csv as source. A "binary" file can be used too, but I have never tried. ?The package RSqlite could help too, Some use this instead of CSV writing. Never tried too. ? ? ?? 2015-03-24 13:03 GMT+01:00 Gerald Jean >: Hello Nicolas, ?? ?? thanks for your suggestions. The data.table can?t be transformed in a matrix as it is of mixed types : POSIX.ct columns, character, logical, factor and numeric columns. Admin is currently installing PostgreSQL on the server, I?ll try to go that route. Too bad data.table doesn?t have, yet, a writing routine as fast as ?fread? is for reading!!! Thanks, G?rald [cid:image001.gif at 01D0660F.CA65B4B0] Gerald Jean, M. Sc. en statistiques Conseiller senior en statistiques Actuariat corporatif, Mod?lisation et Recherche Assurance de dommages Mouvement Desjardins L?vis (si?ge social) 418 835-4900, poste 5527639 1 877 835-4900, poste 5527639 T?l?copieur : 418 835-6657 Faites bonne impression et imprimez seulement au besoin! Ce courriel est confidentiel, peut ?tre prot?g? par le secret professionnel et est adress? exclusivement au destinataire. Il est strictement interdit ? toute autre personne de diffuser, distribuer ou reproduire ce message. Si vous l'avez re?u par erreur, veuillez imm?diatement le d?truire et aviser l'exp?diteur. Merci. De : Nicolas Paris [mailto:niparisco at gmail.com] Envoy? : 23 mars 2015 17:51 ? : Gerald Jean Cc : datatable-help at lists.r-forge.r-project.org Objet : Re: [datatable-help] Best way to export Huge data sets. Some test you can do without many code change : 1) transform your data.table as matrix before write 2) use write table + this config to save place& time ->sep = Pipe (1byte& rarely used) ->disable quote (saves "" more than bilion times) ->latin1 instead of utf-8 3) use chunks (say cut in slice output) and append=T (this may work in parallel) If still too long, try installing some database (sqlite) on your 24 core system, and try load it hope this helps 2015-03-23 14:49 GMT+01:00 Gerald Jean >: Hello, I am currently on a project where I have to read, process, aggregate 10 to 12 millions of files for roughly 10 billions lines of data. The files are arranged in roughly 64000 directories, each directory is one client?s data. I have written code importing and ?massaging? the data per directory. The code is data.table driven. I am running this on a 24 cores machine with 145 Gb of RAM on a Linux box under RedHat. For testing purpose I have parallelized the code, using the doMC package, runs fine and it seems to be fast. But I haven?t tried to output the resulting files, three per client. A small one, a moderate size one and a large one, over 500Gb estimated. My question: what is the best way to output those files without creating bottlenecks?? I thought of breaking the list of input directories into 24 threads, supplying a list of lists to ?foreach? where one of the components of each sub-list would be the name of the output files but I am worried that ?write.table? would take for ever to write this data to disk, one solution would be to use ?save? and keep the output data in Rdata format, but that complicates further analysis by other software. Any suggestions??? By the way ?data.table? sure helped so far in processing that data, thanks to the developpers for such an efficient package, G?rald [cid:image001.gif at 01D0660F.CA65B4B0] Gerald Jean, M. Sc. en statistiques Conseiller senior en statistiques Actuariat corporatif, Mod?lisation et Recherche Assurance de dommages Mouvement Desjardins L?vis (si?ge social) 418 835-4900, poste 5527639 1 877 835-4900, poste 5527639 T?l?copieur : 418 835-6657 Faites bonne impression et imprimez seulement au besoin! Ce courriel est confidentiel, peut ?tre prot?g? par le secret professionnel et est adress? exclusivement au destinataire. Il est strictement interdit ? toute autre personne de diffuser, distribuer ou reproduire ce message. Si vous l'avez re?u par erreur, veuillez imm?diatement le d?truire et aviser l'exp?diteur. Merci. _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 6632 bytes Desc: image001.gif URL: From joaomlanna at gmail.com Tue Mar 24 18:22:51 2015 From: joaomlanna at gmail.com (LANNA) Date: Tue, 24 Mar 2015 10:22:51 -0700 (PDT) Subject: [datatable-help] Conditioning and excluding rows in a table Message-ID: <1427217771614-4705044.post@n4.nabble.com> I have a table with 600.000 rows and I want to exclude some rows before to make a graphic. R script: rb<-jabot.dwco_rb # "jabot.dwco_rb" is the CSV dataset states<-table(rb$stateProvince) # "stateProvince" is a column on the dataset br_states<-states(rb$stateProvince[rb$countryCode=='BR']) # "countryCode" is a column on the dataset At this point I'm making something wrong, because the rows with "countryCode" != of zero are not being excluded. Its still on the table, but without values of ocurrende. How can I exclude this rows to make a graph just with selected rows? Thanks -- View this message in context: http://r.789695.n4.nabble.com/Conditioning-and-excluding-rows-in-a-table-tp4705044.html Sent from the datatable-help mailing list archive at Nabble.com.