[datatable-help] subset between data.table list and single data.table object

Matthew Dowle mdowle at mdowle.plus.com
Thu Aug 8 19:04:29 CEST 2013


On 08/08/13 17:12, Irucka Embry wrote:
> Hi Matthew, thank you for your advice.
>
> I went over the examples in data.table, thank you for the suggestion. 
> I also got rid of the lapply statements too.
>
> big <- lapply(sitefiles,freadDataRatingDepotFiles)
> big <- rbindlist(big)
> setnames(big,c("y", "shift", "x", "stor", "site_no"))
> big <- big[, y:=as.numeric(y)]
> big <- big[, x:=as.numeric(x)]
> big <- big[, shift:=as.numeric(shift)]
> big <- big[, stor:=NULL]
> big <- na.omit(big)
> big <- big[,y:=y+shift]
> big <- big[,shift:=NULL]
> big <- setkey(big, site_no)

Great. Looks good now.  Although, you don't need to assign to big on the 
left.  := already updates by reference.  Same for setkey, no need to 
assign the result, it changes big by reference.  Only the na.omit 
returns a new object, so that result does need to be assigned to big.
>
> I have used dput as people on the main R help list had suggested that 
> dput be used instead of unformatted tables due to text-based e-mail 
> and help list.
That's true and appropriate when the question is to do with a specific 
error message.  In that case your aim is for potential answers to be 
able to reproduce the error message as easily and quickly as possible, 
so it needs to be pasteable.    You're asking a different type of 
question; i.e., how-do-I-do-this?  To answer quickly we can just type 
some pseudo code to give you a hint, possibly within 30 seconds from a 
mobile phone.  Is there any reason why you don't want to ask this on 
Stack Overflow? These type of questions are better asked there.

But also a table of data is easily readable using fread or readLines, if 
the answerer wants to test his answer before posting.

> Based on your suggestions I have the input, intermediate table, and 
> the output tables.

Ok, further comment below ...

>
> Thank you.
>
> Irucka
>
>
>
> INPUT
> big
> y x site_no
> 1: 14.80 7900 /tried/02437100.exsa.rdb
> 2: 14.81 7920 /tried/02437100.exsa.rdb
> 3: 14.82 7930 /tried/02437100.exsa.rdb
> 4: 14.83 7950 /tried/02437100.exsa.rdb
> 5: 14.84 7970 /tried/02437100.exsa.rdb
> ---
> 112249: 57.86 2400000 /tried/07289000.exsa.rdb
> 112250: 57.87 2410000 /tried/07289000.exsa.rdb
> 112251: 57.88 2410000 /tried/07289000.exsa.rdb
> 112252: 57.89 2420000 /tried/07289000.exsa.rdb
> 112253: 57.90 2430000 /tried/07289000.exsa.rdb
>
>
> aimjoin
> site_no mean p50
> 1: 02437100 3882.65 1830.0
> 2: 02446500 819.82 382.0
> 3: 02467000 23742.37 10400.0
> 4: 03217500 224.72 50.0
> 5: 03219500 496.79 140.0
> ---
> 54: 06889000 5632.70 2620.0
> 55: 06891000 7018.45 3300.0
> 56: 06893000 52604.19 43200.0
> 57: 06934500 81758.03 61200.0
> 58: 07010000 186504.25 147000.0
> 59: 07289000 755685.30 687000.0
> site_no mean p50
>

What you've done is provide a small subset of your real large data. 
Easier and quicker for you, but harder for us to see. For example, I 
can't see all the data for site 02437100's mean.  The 2 groups of 5 rows 
that I suggested needs to be the entire dataset.  A new one, a dummy 
one: a toy tiny example with simple data.  Please search for guidance on 
how to ask good questions. Please spend longer on Stack Overflow looking 
at other's questions for inspiration and guidance.


>
>
> INTERMEDIATE
> bigintermediate
> y x site_no mean p50
> 1: 14.80 7900 02437100 3882.65 1830.0
> 2: 14.81 7920 02437100 3882.65 1830.0
> 3: 14.82 7930 02437100 3882.65 1830.0
> 4: 14.83 7950 02437100 3882.65 1830.0
> 5: 14.84 7970 02437100 3882.65 1830.0
> ---
> 112249: 57.86 2400000 07289000 755685.30 687000.0
> 112250: 57.87 2410000 07289000 755685.30 687000.0
> 112251: 57.88 2410000 07289000 755685.30 687000.0
> 112252: 57.89 2420000 07289000 755685.30 687000.0
> 112253: 57.90 2430000 07289000 755685.30 687000.0
>
>
>
> OUTPUT
> bigintermean [where mean of site_no > min(x)]
> y x site_no mean
> ---
>
> ...
> 112249: 57.86 2400000 07289000 755685.30
> 112250: 57.87 2410000 07289000 755685.30
> 112251: 57.88 2410000 07289000 755685.30
> 112252: 57.89 2420000 07289000 755685.30
> 112253: 57.90 2430000 07289000 755685.30
>
> total of 109,452 rows
>
>
>
> bigintermedian [where p50 of site_no > min(x)]
> y x site_no p50
> ---
>
> ...
> 112249: 57.86 2400000 07289000 687000.0
> 112250: 57.87 2410000 07289000 687000.0
> 112251: 57.88 2410000 07289000 687000.0
> 112252: 57.89 2420000 07289000 687000.0
> 112253: 57.90 2430000 07289000 687000.0
>
> total of 109,452 rows
>
>
>
>
> bigextramean [where mean of site_no < min(x)]
> y x site_no mean
> 1: 14.80 7900 02437100 3882.65
> 2: 14.81 7920 02437100 3882.65
> 3: 14.82 7930 02437100 3882.65
> 4: 14.83 7950 02437100 3882.65
> 5: 14.84 7970 02437100 3882.65
>
> total of 2671 rows
>
>
> bigextramedian [where p50 of site_no < min(x)]
> y x site_no p50
> 1: 14.80 7900 02437100 1830.0
> 2: 14.81 7920 02437100 1830.0
> 3: 14.82 7930 02437100 1830.0
> 4: 14.83 7950 02437100 1830.0
> 5: 14.84 7970 02437100 1830.0
>
> total of 2671 rows
>
>
>
> bigextrameanmax [where mean of site_no > max(x)]
> y x site_no mean
> 1: 14.80 7900 02437100 3882.65
> 2: 14.81 7920 02437100 3882.65
> 3: 14.82 7930 02437100 3882.65
> 4: 14.83 7950 02437100 3882.65
> 5: 14.84 7970 02437100 3882.65
>
> total of 2671 rows
>
>
> bigextramedianmax [where p50 of site_no > max(x)]
> y x site_no p50
> 1: 14.80 7900 02437100 1830.0
> 2: 14.81 7920 02437100 1830.0
> 3: 14.82 7930 02437100 1830.0
> 4: 14.83 7950 02437100 1830.0
> 5: 14.84 7970 02437100 1830.0
>
> total of 2671 rows
>
>
>
>
>
>
> <-----Original Message----->
> >From: Matthew Dowle [mdowle at mdowle.plus.com]
> >Sent: 8/7/2013 11:16:37 PM
> >To: iruckaE at mail2world.com
> >Cc: datatable-help at lists.r-forge.r-project.org
> >Subject: Re: [datatable-help] subset between data.table list and 
> single data.table object
> >
> >Hm. Have you worked through the examples of data.table? Type
> >example(data.table) and try to thoroughly understand each and every
> >example. Just forget your immediate problem for the moment, then come
> >back to it once you've looked at the examples.
> >
> >Further comments inline ...
> >
> >
> >On 07/08/13 23:44, iembry wrote:
> >> Hi Steve and Matthew, thank you both for your suggestions. This is 
> the code
> >> that I have now:
> >>
> >> freadDataRatingDepotFiles <- function (file)
> >> {
> >> RDdatatmp <- fread(file, autostart=40)
> >> RDdatatmp[, site:= file]
> >> }
> >>
> >> big <- lapply(sitefiles,freadDataRatingDepotFiles)
> >> big <- rbindlist(big)
> >> big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u)
> >> setnames(big[[u]], c("y", "shift", "x", "stor", "site_no")))
> >That lapply and big[[u]] doesn't make much sense. big is one big table,
> >with one set of column names. Why loop setnames?
> >> big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) 
> big[[u]][,
> >> y:=as.numeric(y)])
> >> big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) 
> big[[u]][,
> >> x:=as.numeric(x)])
> >> big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) 
> big[[u]][,
> >> shift:=as.numeric(shift)])
> >> big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u) 
> big[[u]][,
> >> stor:=NULL])
> >> big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u)
> >> na.omit(big[[u]]))
> >> big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u)
> >> big[[u]][,y:=y+shift])
> >> big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u)
> >> big[[u]][,shift:=NULL])
> >> big <- lapply(seq_along(dailyvaluesneednew$site_no), function(u)
> >> setkey(big[[u]], site_no))
> >Again, all these lapply don't make much sense now big is one big table.
> >>
> >> I am trying to subset big based on the mean and median values in 
> aimjoin (as
> >> described previously in this message thread).
> >
> >But that part of the message thread is no longer here. So I'd have to
> >go and hunt for it.
> >
> >>
> >> This is the first row of aimjoin:
> >> dput(aimjoin[1])
> >> structure(list(site_no = "02437100", mean = 3882.65, p50 = 1830), 
> .Names =
> >> c("site_no",
> >> "mean", "p50"), sorted = "site_no", class = c("data.table", 
> "data.frame"
> >> ), row.names = c(NA, -1L), .internal.selfref = <pointer: 0x1bb7d88>)
> >>
> >> This is one element of big:
> >> tempbigdata <- data.frame(c(14.80, 14.81, 14.82), c(7900, 7920, 7930),
> >> c("/tried/02437100.exsa.rdb", "/tried/02437100.exsa.rdb",
> >> "/tried/02437100.exsa.rdb"), stringsAsFactors = FALSE)
> >> names(tempbigdata) <- c("y", "x", "site_no")
> >> tempbigdat <- gsub("/tried/", "", tempbigdata)
> >> tempbigdat <- gsub(".exsa.rdb", "", tempbigdat)
> >
> >Please paste the data itself laid out just like you see it at the
> >prompt. I find it difficult to parse dput output in emails. And longer
> >to paste it into an R session before I see. I often read and reply from
> >a mobile phone, as do others I guess. Questions like this are better
> >presented on stack overflow.
> >
> >> # I tried to remove all
> >> characters in the column site_no except for the actual site number, 
> but I
> >> ended up with a character vector instead of a data.table
> >>
> >> This is a revised version of the code that I had written previously to
> >> perform the subsetting (prior to using data.table):
> >> mp <- lapply(seq_along(dailyvaluesneednew$site_no), function(u)
> >> {ifelse(aimjoin[1]$mean[u] < min(big[[u]]$x), subset(getratings[[u]],
> >> aimjoin[1]$mean[u] > min(big[[u]]$x) & aimjoin[1]$mean[u],
> >> aimjoin[u]$mean[u] > min(big[[u]]$x)), aimjoin[1]$mean[u])})
> >Again, maybe by big[[u]] you mean big[u] if big is keyed, but I didn't
> >see a setkey above. Seems like you maybe want [,...,by=site].
> >>
> >>
> >> I have tried to join aimjoin and big, but I received the error message
> >> below:
> >>
> >> aimjoin[J(big$site_no)]
> >> Error in `[.data.table`(aimjoin, J(big$site_no)) :
> >> x.'site_no' is a character column being joined to i.'V1' which is type
> >> 'NULL'. Character columns must join to factor or character columns.
> >I guess that 'site_no' isn't a column of big ... typo of 'site_no'?
> >anyList$notthere is NULL in R and only NULL itself is type NULL, hence
> >the guess.
> >>
> >>
> >> I also tried to merge aimjoin and big, but it was not what I 
> wanted. I would
> >> like for the mean and p50 values -- for each site number -- to be 
> joined to
> >> the site number in big. I figure that would make it easier to 
> perform the
> >> subsetting.
> >Please see examples of good questions on Stack Overflow. There you see
> >people put examples of their input and what their desired output is for
> >that input data. I really can't see what you're trying to do.
> >>
> >> I want to subset big based on whether or not the mean or median in 
> aimjoin
> >> is less than the minimum value of x in big. Those mean or median 
> values in
> >> aimjoin that are smaller than x in big will have to be grouped 
> together for
> >> a future step & those mean or median values in aimjoin that are 
> equal to or
> >> larger than the x in big will be grouped together for a future step.
> >>
> >> Can you provide me with advice on how to proceed with the subsetting?
> >Try to construct a really good toy example that demonstrates what you
> >want. Show input and desired output. In this case 2 groups of 5 rows
> >each should be enough to demonstrate.
> >
> >>
> >> Thank you.
> >>
> >> Irucka
> >>
> >>
> >>
> >> --
> >> View this message in context: 
> http://r.789695.n4.nabble.com/subset-between-data-table-list-
> >and-single-data-table-object-tp4673202p4673308.html
> >> Sent from the datatable-help mailing list archive at Nabble.com.
> >> _______________________________________________
> >> datatable-help mailing list
> >> datatable-help at lists.r-forge.r-project.org
> >> 
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >>
> >
> >.
> >
>
> _______________________________________________________________
> Get the Free email that has everyone talking at http://www.mail2world.com
> Unlimited Email Storage -- POP3 -- Calendar -- SMS -- Translator -- 
> Much More!
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130808/784b9a14/attachment-0001.html>


More information about the datatable-help mailing list