[datatable-help] R beginner

Arunkumar Srinivasan aragorn168b at gmail.com
Mon Mar 16 18:49:15 CET 2015


Cyrile,
See `?foverlaps` function from data.table package or `?findOverlaps` from GenomicRanges package. These implement algorithms specifically designed for operating on interval ranges efficiently.

-- 
Arun

On 15 Mar 2015 at 22:41:18, jim holtman (jholtman at gmail.com) wrote:

I was off by a factor of 10; I thought it said 200,000 but it was only 20,000 so it only takes 10 seconds to solves

> n <- 20000
> mi <- 4500
> start <- sample(n * 10, n)  # start times
> int <- sample(1000, n, TRUE)  # interval between start and end
> genes <- data.frame(gene = paste0('gene', 1:n)
+                 , start = start
+                 , end = start + int
+                 , stringsAsFactors = FALSE
+                 )
> miRNA <- data.frame(name = paste0('mi', 1:mi)
+                 , pos = sample(n * 9, mi)
+                 , stringsAsFactors = FALSE
+                 )
> require(sqldf)
>
> system.time({
+ matches <- sqldf("
+     select m.*, g.*
+     from miRNA as m
+     join genes as g
+         on m.pos between g.start and g.end
+ ")
+ })       
   user  system elapsed
  10.91    0.02   10.96
> head(matches, 10)
   name  pos     gene start  end
1   mi1 3825  gene200  3634 4134
2   mi1 3825  gene385  3616 4241
3   mi1 3825  gene410  3492 4089
4   mi1 3825 gene1172  3707 3847
5   mi1 3825 gene1228  3825 3919
6   mi1 3825 gene1726  3586 4552
7   mi1 3825 gene1859  3633 4163
8   mi1 3825 gene1869  3269 4138
9   mi1 3825 gene2061  3812 4094
10  mi1 3825 gene2248  3225 3939
> str(matches)
'data.frame':   224028 obs. of  5 variables:
 $ name : chr  "mi1" "mi1" "mi1" "mi1" ...
 $ pos  : int  3825 3825 3825 3825 3825 3825 3825 3825 3825 3825 ...
 $ gene : chr  "gene200" "gene385" "gene410" "gene1172" ...
 $ start: int  3634 3616 3492 3707 3825 3586 3633 3269 3812 3225 ...
 $ end  : int  4134 4241 4089 3847 3919 4552 4163 4138 4094 3939 ...


Jim Holtman
Data Munger Guru
 
What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

On Sun, Mar 15, 2015 at 3:41 PM, Papysounours <Cyrille.laurent.sage at gmail.com> wrote:
Hi

I am just starting R programming because i need it to analyse new sequencing
data. I got two list of data (excel table) one is gene list with chromosomal
position (like start:123456 end:124567), the other is miRNA list with only
one position (like 123789).
 In the first liste i have around 20000 row (meaning 20000 gene name to
compare to) and for the second around 4500 row (4500 miRNA).
I want to compare the position of each individual miRNA position (
genestart<=miRNA<=geneend ) to the entire list of gene in order to get in a
new table the name of the miRNA (first colum of the miRNA list) and the name
of the gene  (first colum of the gene list) related to the miRNA.
Hope thisis not to much to ask.
Papy



--
View this message in context: http://r.789695.n4.nabble.com/R-beginner-tp4704684.html
Sent from the datatable-help mailing list archive at Nabble.com.
_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________  
datatable-help mailing list  
datatable-help at lists.r-forge.r-project.org  
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20150316/374427a8/attachment-0001.html>


More information about the datatable-help mailing list