[Rcpp-devel] add new components to list without specifying list size initially

Steve Lianoglou mailinglist.honeypot at gmail.com
Fri Aug 12 23:03:02 CEST 2011


Hi Walrus,

While I'm a huge fan of Rcpp, I think you'll be a bit better served
(for the time being) to read up on some of the bioconductor packages
that are suited for these types of things.

In particular I am thinking about the IRanges and GenomicRanges
packages. They R wrappers to what is basically an IntervalTree that
you can annotate, and then use to perform fast overlap/intersection
queries.

For instance:

R> library(GenomicRanges)
R> probes <- GRanges('chr1', IRanges(c(81,85), c(85,100)), strand='*')
R> genes <- GRanges(c('chr1', 'chr1', 'chr2'), IRanges(c(11, 111, 11),
c(90, 190, 90)), strand='*',
     name=c('g1', 'g2', 'g3'))

R> genes
GRanges with 3 ranges and 1 elementMetadata value
    seqnames     ranges strand |        name
       <Rle>  <IRanges>  <Rle> | <character>
[1]     chr1 [ 11,  90]      * |          g1
[2]     chr1 [111, 190]      * |          g2
[3]     chr2 [ 11,  90]      * |          g3

## How many probes does each gene have land in it?
R> countOverlaps(genes, probes)
[1] 2 0 0

## Which probes are these?

R> subsetByOverlaps(probes, genes)
GRanges with 2 ranges and 0 elementMetadata values
    seqnames    ranges strand |
       <Rle> <IRanges>  <Rle> |
[1]     chr1 [81,  85]      * |
[2]     chr1 [85, 100]      * |

## and much more stuff

There's a mess load of functionality in IRanges, GenomicRanges,
Biostrings packages that you'll likely find very useful, and efficient
(much of the core of these packages are written in C) if you're doing
a lot of bioinformatics/genomics work. So, taking some time to get
familiar with those will be useful -- you'll find that you'll also
need to drop into Rcpp for other stuff (as I do, too), so it will
still be useful for you in the future.

That's just my 2 cents.

-steve


On Thu, Aug 11, 2011 at 9:44 PM, Walrus Foolhill
<walrus.foolhill at gmail.com> wrote:
> Ok, thanks for your answer, but I wasn't clear enough. So here are more
> details of what I want to do.
>
> I have one list named "probes":
> probes <- list(chr1=data.frame(name=c("p1","p2"),
>                  start=c(81,95),
>                  end=c(85,100),
>                  stringsAsFactors=FALSE))
>
> I also have one list named "genes":
> genes <- list(chr1=data.frame(name=c("g1","g2"), start=c(11,111),
> end=c(90,190)),
>                 chr2=data.frame(name="g3", start=11, end=90))
>
> I need to compare those two lists in order to obtain the following list
> which contains, for each gene, the name of the probes included in it:
> links <- list(chr1=list(g1=c("p1")))
>
> Here is my R function (assuming that the probes are sorted based on their
> start and end coordinates):
>
> fun.l <- function(genes, probes){
>   links <- lapply(names(genes), function(chr.name){
>     if(! chr.name %in% names(probes))
>       return(NULL)
>
>     res <- list()
>
>     genes.c <- genes[[chr.name]]
>     probes.c <- probes[[chr.name]]
>
>     for(gene.name in genes.c$name){
>       gene <- genes.c[genes.c$name == gene.name,]
>       res[[gene.name]] <- vector()
>       for(probe.name in probes.c$name){
>         probe <- probes.c[probes.c$name == probe.name,]
>         if(probe$start >= gene$start && probe$end <= gene$end)
>           res[[gene.name]] <- append(res[[gene.name]], probe.name)
>         else if(probe$start > gene$end)
>           break
>       }
>       if(length(res[[gene.name]]) == 0)
>         res[[gene.name]] <- NULL
>     }
>
>     if(length(res) == 0)
>       res <- NA
>     return(res)
>   })
>   names(links) <- names(genes)
>   links <- Filter(function(links.c){!is.null(links.c)}, links)
>   return(links)
> }
>
> And here is the beginning of my attempt using Rcpp:
>
> src <- '
> using namespace Rcpp;
>
> List genes = List(genes_in);
> int genes_nb_chr = genes.length();
> std::vector<std::string> genes_chr = genes.names();
>
> List probes = List(probes_in);
> int probes_nb_chr = probes.length();
>
> std::vector< std::vector<std::string> > links;
>
> // the main task is performed in this loop
> for(int chrnum=0; chrnum<genes_nb_chr; ++chrnum){
>   DataFrame genes_c = DataFrame(genes[chrnum]);
>   // ... add code to map probes on genes, that is fill "links" ...
> }
>
> return wrap(links);
> '
>
> funC <- cxxfunction(signature(genes_in="list",
>                                 probes_in="list"),
>                       body=src, plugin="Rcpp")
>
> The problem starts quite early: when I compile this piece of code, I get
> "error: call of overloaded ‘DataFrame(Rcpp::internal::generic_proxy<19>)’ is
> ambiguous".
>
> What should I do to go through the "probes" and "genes" lists given as
> input? Maybe more generically, how can we go through a list of lists (of
> lists...) with Rcpp?
>
> 2nd (small) question, I don't manage to use Rprintf when using inline, for
> instance Rprintf("%d\n", i);, it complains about the quotes. What should I
> do to print statement from within the for loop?
>
> Thanks in advance. As my question is very long, I won't mind if you tell me
> to find another way by myself. But maybe one of you can put me on the good
> track.
>
> On Thu, Aug 11, 2011 at 7:00 AM, Dirk Eddelbuettel <edd at debian.org> wrote:
>>
>> On 11 August 2011 at 03:06, Walrus Foolhill wrote:
>> | Hello,
>> | I need to create a list and then fill it sequentially by adding
>> components in a
>> | for loop. Here is an example that works:
>> |
>> | library(inline)
>> | src <- '
>> | Rcpp::List mylist(2);
>> | for(int i=0; i<2; ++i)
>> |   mylist[i] = i;
>> | mylist.names() = CharacterVector::create("a","b");
>> | return mylist;
>> | '
>> | fun <- cxxfunction(body=src, plugin="Rcpp")
>> | print(fun())
>> |
>> | But what I really want is to create an empty list and then fill it, that
>> is
>> | without specifying its number of components before hand... This is
>> because I
>> | don't know in advance at which step of the for loop I will need to
>> create a new
>> | component. Here is an example, that obviously doesn't work, but that
>> should
>> | show what I am looking for:
>> |
>> | Rcpp::List mylist;
>> | CharacterVector names = CharacterVector::create("a", "b");
>>
>> If you know how long names is, you know how long mylist going to be ....
>>
>> | for(int i=0; i<2; ++i){
>> |   mylist.add(names[i], IntegerVector::create());
>> |   mylist[names[i]].push_back(i);
>>
>> I don't understand what that is trying to do.
>>
>> | }
>> | return mylist;
>> |
>> | Do you know how I could achieve this? Thanks.
>>
>> Rcpp::List is an alias for Rcpp::GenericVector, and derives from Vector.
>> You
>> can look at the public member functions -- there are things like
>>
>>    push_back()
>>    push_front()
>>    insert()
>>
>> etc that behave like STL functions __but are inefficient as we (almost
>> always) need to copy the whole object__ so they are not recommended.
>>
>> When I had to deal with 'unknown quantities of data' returning I was
>> mostly
>> able to either turn it into a 'fixed or known columns, unknow rows'
>> problem
>> (easy, just grow row-wise) or I 'cached' in a C++ data structure first
>> before
>> returning to R via Rcpp structures -- and then I knew the dimensions for
>> the
>> to-be-created object too.
>>
>> Dirk
>>
>>
>> --
>> Two new Rcpp master classes for R and C++ integration scheduled for
>> New York (Sep 24) and San Francisco (Oct 8), more details are at
>>
>> http://dirk.eddelbuettel.com/blog/2011/08/04#rcpp_classes_2011-09_and_2011-10
>
>
> _______________________________________________
> Rcpp-devel mailing list
> Rcpp-devel at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel
>
>



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact


More information about the Rcpp-devel mailing list