[datatable-help] R beginner
jim holtman
jholtman at gmail.com
Sun Mar 15 22:41:07 CET 2015
I was off by a factor of 10; I thought it said 200,000 but it was only
20,000 so it only takes 10 seconds to solves
> n <- 20000
> mi <- 4500
> start <- sample(n * 10, n) # start times
> int <- sample(1000, n, TRUE) # interval between start and end
> genes <- data.frame(gene = paste0('gene', 1:n)
+ , start = start
+ , end = start + int
+ , stringsAsFactors = FALSE
+ )
> miRNA <- data.frame(name = paste0('mi', 1:mi)
+ , pos = sample(n * 9, mi)
+ , stringsAsFactors = FALSE
+ )
> require(sqldf)
>
> system.time({
+ matches <- sqldf("
+ select m.*, g.*
+ from miRNA as m
+ join genes as g
+ on m.pos between g.start and g.end
+ ")
+ })
user system elapsed
10.91 0.02 10.96
> head(matches, 10)
name pos gene start end
1 mi1 3825 gene200 3634 4134
2 mi1 3825 gene385 3616 4241
3 mi1 3825 gene410 3492 4089
4 mi1 3825 gene1172 3707 3847
5 mi1 3825 gene1228 3825 3919
6 mi1 3825 gene1726 3586 4552
7 mi1 3825 gene1859 3633 4163
8 mi1 3825 gene1869 3269 4138
9 mi1 3825 gene2061 3812 4094
10 mi1 3825 gene2248 3225 3939
> str(matches)
'data.frame': 224028 obs. of 5 variables:
$ name : chr "mi1" "mi1" "mi1" "mi1" ...
$ pos : int 3825 3825 3825 3825 3825 3825 3825 3825 3825 3825 ...
$ gene : chr "gene200" "gene385" "gene410" "gene1172" ...
$ start: int 3634 3616 3492 3707 3825 3586 3633 3269 3812 3225 ...
$ end : int 4134 4241 4089 3847 3919 4552 4163 4138 4094 3939 ...
Jim Holtman
Data Munger Guru
What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.
On Sun, Mar 15, 2015 at 3:41 PM, Papysounours <
Cyrille.laurent.sage at gmail.com> wrote:
> Hi
>
> I am just starting R programming because i need it to analyse new
> sequencing
> data. I got two list of data (excel table) one is gene list with
> chromosomal
> position (like start:123456 end:124567), the other is miRNA list with only
> one position (like 123789).
> In the first liste i have around 20000 row (meaning 20000 gene name to
> compare to) and for the second around 4500 row (4500 miRNA).
> I want to compare the position of each individual miRNA position (
> genestart<=miRNA<=geneend ) to the entire list of gene in order to get in a
> new table the name of the miRNA (first colum of the miRNA list) and the
> name
> of the gene (first colum of the gene list) related to the miRNA.
> Hope thisis not to much to ask.
> Papy
>
>
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/R-beginner-tp4704684.html
> Sent from the datatable-help mailing list archive at Nabble.com.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20150315/946ec9a4/attachment-0001.html>
More information about the datatable-help
mailing list