<div dir="ltr"><div>You didn't provide any test data, so I made some up with the sizes you gave. This uses the 'sqldf' package and took about 2 minutes to come up with the matches.</div><div><br></div><div>> n <- 200000<br>> mi <- 4500<br>> start <- sample(n * 10, n) # start times<br>> int <- sample(1000, n, TRUE) # interval between start and end<br>> genes <- data.frame(gene = paste0('gene', 1:n)<br>+ , start = start<br>+ , end = start + int<br>+ , stringsAsFactors = FALSE<br>+ )<br>> miRNA <- data.frame(name = paste0('mi', 1:mi)<br>+ , pos = sample(n * 9, mi)<br>+ , stringsAsFactors = FALSE<br>+ )<br>> require(sqldf)<br>Loading required package: sqldf<br>Loading required package: gsubfn<br>Loading required package: proto<br>Loading required package: RSQLite<br>Loading required package: DBI<br>> matches <- sqldf("<br>+ select m.*, g.*<br>+ from miRNA as m<br>+ join genes as g<br>+ on m.pos between g.start and g.end<br>+ ")<br>Loading required package: tcltk<br>> <br>> str(matches)<br>'data.frame': 225045 obs. of 5 variables:<br> $ name : chr "mi1" "mi1" "mi1" "mi1" ...<br> $ pos : int 279341 279341 279341 279341 279341 279341 279341 279341 279341 279341 ...<br> $ gene : chr "gene3133" "gene14326" "gene14997" "gene17652" ...<br> $ start: int 279000 278623 279157 279296 278379 279055 279180 279273 278938 278960 ...<br> $ end : int 279924 279444 280150 279930 279347 279861 279782 280268 279791 279796 ...<br>> head(matches)<br> name pos gene start end<br>1 mi1 279341 gene3133 279000 279924<br>2 mi1 279341 gene14326 278623 279444<br>3 mi1 279341 gene14997 279157 280150<br>4 mi1 279341 gene17652 279296 279930<br>5 mi1 279341 gene21208 278379 279347<br>6 mi1 279341 gene30889 279055 279861</div><div><br></div></div><div class="gmail_extra"><br clear="all"><div><div class="gmail_signature"><br>Jim Holtman<br>Data Munger Guru<br> <br>What is the problem that you are trying to solve?<br>Tell me what you want to do, not how you want to do it.</div></div>
<br><div class="gmail_quote">On Sun, Mar 15, 2015 at 3:41 PM, Papysounours <span dir="ltr"><<a href="mailto:Cyrille.laurent.sage@gmail.com" target="_blank">Cyrille.laurent.sage@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi<br>
<br>
I am just starting R programming because i need it to analyse new sequencing<br>
data. I got two list of data (excel table) one is gene list with chromosomal<br>
position (like start:123456 end:124567), the other is miRNA list with only<br>
one position (like 123789).<br>
In the first liste i have around 20000 row (meaning 20000 gene name to<br>
compare to) and for the second around 4500 row (4500 miRNA).<br>
I want to compare the position of each individual miRNA position (<br>
genestart<=miRNA<=geneend ) to the entire list of gene in order to get in a<br>
new table the name of the miRNA (first colum of the miRNA list) and the name<br>
of the gene (first colum of the gene list) related to the miRNA.<br>
Hope thisis not to much to ask.<br>
Papy<br>
<br>
<br>
<br>
--<br>
View this message in context: <a href="http://r.789695.n4.nabble.com/R-beginner-tp4704684.html" target="_blank">http://r.789695.n4.nabble.com/R-beginner-tp4704684.html</a><br>
Sent from the datatable-help mailing list archive at Nabble.com.<br>
_______________________________________________<br>
datatable-help mailing list<br>
<a href="mailto:datatable-help@lists.r-forge.r-project.org">datatable-help@lists.r-forge.r-project.org</a><br>
<a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help</a><br>
</blockquote></div><br></div>