[datatable-help] subset between data.table list and single data.table object

Steve Lianoglou lianoglou.steve at gene.com
Thu Aug 8 18:52:57 CEST 2013


Hi,

On Thu, Aug 8, 2013 at 9:12 AM, Irucka Embry <iruckaE at mail2world.com> wrote:
> Hi Matthew, thank you for your advice.
>
> I went over the examples in data.table, thank you for the suggestion. I also
> got rid of the lapply statements too.
>
>
> big <- lapply(sitefiles,freadDataRatingDepotFiles)
> big <- rbindlist(big)
> setnames(big,c("y", "shift", "x", "stor", "site_no"))
>
> big <- big[, y:=as.numeric(y)]
> big <- big[, x:=as.numeric(x)]
> big <- big[, shift:=as.numeric(shift)]
> big <- big[, stor:=NULL]
>
> big <- na.omit(big)
> big <- big[,y:=y+shift]
> big <- big[,shift:=NULL]
> big <- setkey(big, site_no)
>
> I have used dput as people on the main R help list had suggested that dput
> be used instead of unformatted tables due to text-based e-mail and help
> list. Based on your suggestions I have the input, intermediate table, and
> the output tables.

I would argue that dput is still useful. People prefer it so they can
copy paste an text from and email into an R session and help you work
out concrete advice based on your data. It is still your
responsibility to produce an example that people can work with, though
-- Matthew suggested making a good toy example with tables that are
much smaller than your real data so we can easily distill what you
want there -- in constructing these examples, you may (likely) figure
out how to fix the problem yourself and not require more hand holding.
Provide two toy tables to play with -- not your current problem that
has 50

These examples are too involved for me to follow along as they are,
sorry. That having been said, I'll simply point out that you need to
fix the `site_no` column in your `big` table -- I suspect you might
want to use that column to join across several of your data.tables
(since it looks like they all have `site_no`.

But look at `big`:

> big
> y x site_no
> 1: 14.80 7900 /tried/02437100.exsa.rdb
> 2: 14.81 7920 /tried/02437100.exsa.rdb
> 3: 14.82 7930 /tried/02437100.exsa.rdb
> 4: 14.83 7950 /tried/02437100.exsa.rdb
> 5: 14.84 7970 /tried/02437100.exsa.rdb
> ---
> 112249: 57.86 2400000 /tried/07289000.exsa.rdb
> 112250: 57.87 2410000 /tried/07289000.exsa.rdb
> 112251: 57.88 2410000 /tried/07289000.exsa.rdb
> 112252: 57.89 2420000 /tried/07289000.exsa.rdb
> 112253: 57.90 2430000 /tried/07289000.exsa.rdb

Now look at the rest of your table `site_no` values:

> aimjoin
> site_no mean p50
> 1: 02437100 3882.65 1830.0
> 2: 02446500 819.82 382.0
> 3: 02467000 23742.37 10400.0
> 4: 03217500 224.72 50.0
> 5: 03219500 496.79 140.0

You (obviously) need to strip the "/tried/" from the beginning and
".exsa.rdb" from the end of your big$sit_no column for it to work with
the rest of your data.

You can do that like so:

R> big[, site_no := sub(".exsa.rdb", "", basename(site), fixed=TRUE)]

Once that's done, you can easily get from `big` and `aimjoin` to
`bigintermediate` ... as for the rest, it's not clear to me what your
queries of "where mean of site_no > min(x)" really mean -- you can
still use `subset` on data.tables like you can data.frames, and it
seems to me that calling those judiciously gets you where you want to
go, so I'm not quite sure what the problem is -- it's probably my
understanding of what you want.

-steve

-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech


More information about the datatable-help mailing list