[datatable-help] R beginner

Sun Mar 15 22:16:49 CET 2015

You didn't provide any test data, so I made some up with the sizes you
gave.  This uses the 'sqldf' package and took about 2 minutes to come up
with the matches.

> n <- 200000
> mi <- 4500
> start <- sample(n * 10, n)  # start times
> int <- sample(1000, n, TRUE)  # interval between start and end
> genes <- data.frame(gene = paste0('gene', 1:n)
+                 , start = start
+                 , end = start + int
+                 , stringsAsFactors = FALSE
+                 )
> miRNA <- data.frame(name = paste0('mi', 1:mi)
+                 , pos = sample(n * 9, mi)
+                 , stringsAsFactors = FALSE
+                 )
> require(sqldf)
Loading required package: sqldf
Loading required package: gsubfn
Loading required package: proto
Loading required package: RSQLite
Loading required package: DBI
> matches <- sqldf("
+     select m.*, g.*
+     from miRNA as m
+     join genes as g
+         on m.pos between g.start and g.end
+ ")
Loading required package: tcltk
>
> str(matches)
'data.frame':   225045 obs. of  5 variables:
 $ name : chr  "mi1" "mi1" "mi1" "mi1" ...
 $ pos  : int  279341 279341 279341 279341 279341 279341 279341 279341
279341 279341 ...
 $ gene : chr  "gene3133" "gene14326" "gene14997" "gene17652" ...
 $ start: int  279000 278623 279157 279296 278379 279055 279180 279273
278938 278960 ...
 $ end  : int  279924 279444 280150 279930 279347 279861 279782 280268
279791 279796 ...
> head(matches)
  name    pos      gene  start    end
1  mi1 279341  gene3133 279000 279924
2  mi1 279341 gene14326 278623 279444
3  mi1 279341 gene14997 279157 280150
4  mi1 279341 gene17652 279296 279930
5  mi1 279341 gene21208 278379 279347
6  mi1 279341 gene30889 279055 279861

Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

On Sun, Mar 15, 2015 at 3:41 PM, Papysounours <
Cyrille.laurent.sage at gmail.com> wrote:

> Hi
>
> I am just starting R programming because i need it to analyse new
> sequencing
> data. I got two list of data (excel table) one is gene list with
> chromosomal
> position (like start:123456 end:124567), the other is miRNA list with only
> one position (like 123789).
>  In the first liste i have around 20000 row (meaning 20000 gene name to
> compare to) and for the second around 4500 row (4500 miRNA).
> I want to compare the position of each individual miRNA position (
> genestart<=miRNA<=geneend ) to the entire list of gene in order to get in a
> new table the name of the miRNA (first colum of the miRNA list) and the
> name
> of the gene  (first colum of the gene list) related to the miRNA.
> Hope thisis not to much to ask.
> Papy
>
>
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/R-beginner-tp4704684.html
> Sent from the datatable-help mailing list archive at Nabble.com.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20150315/81ef488c/attachment.html>