[datatable-help] Subsetting columns in data.table
Matthew Dowle
mdowle at mdowle.plus.com
Mon Nov 19 23:06:51 CET 2012
Hi Berto,
Thanks. I think I get it now. How about this :
DT5 = data.table(x=rep(c("a","b","c"), each=3), y=c(1,3,6), s=4:12,
t=0:8, u=2:10, v=1:9, w=3:11)
othercols = setdiff(names(DT5),c("x","y"))
DT5[y<=3 & Reduce(`&`,lapply(DT5[,othercols,with=FALSE],`<=`,7)),
max(y), by=x]
x V1
1: a 3
2: b 1
HTH
Matthew
On 17.11.2012 07:47, Berto wrote:
> Matthew Dowle wrote
>> Hi Berto,
>>
>> I think we may be experiencing a language barrier here. This is
>> datatable-help; i.e.,
>> not r-help. You *can* write in another language on this list, if
>> you'd
>> like to, in
>> case someone else here understands better. The rules are less strict
>> here. Nobody has yet
>> done so, but there is no rule against it. Why not?
>>
>> Proceeding in English for now ...
>
> Sorry for my loose English writing, I'll try to be more precise this
> time.
>
> If this does not work, I'll switch the language.
>
>
>
> Matthew Dowle wrote
>> I like the lack of spaces, but what do the * mean? In other words,
>> you've presented
>> a line of code :
>> DT[y>=3&*v<=7&w<=7*,sum(y), by=x]
>> but that doesn't actually evaluate to anything, does it? So that's
>> pseudo-code. I don't even need to copy and paste that into R to know
>> it's invalid. That cannot possible give the expected result,
>> because of
>> the "*" characters.
>
> I don't have these asteriks in my message, but this part of the code
> was
> highlighted in bold as *here*.
>
> I've run all code and copy-pasted in the post.
>
>
>
> Matthew Dowle wrote
>> Might you be looking for something like :
>>
>> sapply(.SD, `<`, 7)
>>
>> ? Dunno. Guessing.
>
> I've tried the sapply suggestion but the result is a logical matrix,
> not
> what I am looking for:
>
> DT3 = data.table( x=rep ( c("a","b","c"), each=3), y = c(1,3,6), u =
> 2:10,
> v = 1:9, w = 3:11)
>
> DT3[y>=3, sapply(.SD, `<`, 7),]
>
> x y u v w
> [1,] FALSE TRUE TRUE TRUE TRUE
> [2,] FALSE TRUE TRUE TRUE TRUE
> [3,] FALSE TRUE TRUE TRUE FALSE
> [4,] FALSE TRUE FALSE TRUE FALSE
> [5,] FALSE TRUE FALSE FALSE FALSE
> [6,] FALSE TRUE FALSE FALSE FALSE
>
>
> Matthew Dowle wrote
>> But, focussing on this part of your email :
>>
>>> But if the number of columns grows, I can't specify all columns
>>> anymore,
>>> maybe should I use column names?
>>
>> You actually do, really, honestly, need to show us, physically, in
>> email, what you mean. Columns
>> of what? Growing how? Show us 2,3,4,5 columns. Show us the manual
>> way.
>> Show us the input and
>> show us the output.
>>
>> Your email can be very long. It can contain very little English.
>> But
>> you
>> actually need to show what the output is you would like, for me (at
>> least)
>> to understand.
>>
>> What I am certain of is that whatever you want to do is possible.
>> And
>> if it isn't, then
>> we will likely enhance data.table to do it.
>
> I see the verb "grow" is misleading here, I meant "if the number of
> columns
> is high".
>
> I want this code for a data.table with 1000s of rows and 10s - 100s
> of
> "other numeric columns".
>
> Find below 2 examples: DT3 and DT5 (input, subset, output).
>
>
> #Example data table (DT3) with 1 character column (x), 1 numeric
> column of
> interest (y) and 3 other numeric columns (u, v, w) - input
>
> DT3 = data.table( x=rep ( c("a","b","c"), each=3), y = c(1,3,6), u =
> 2:10,
> v = 1:9, w = 3:11)
>
> DT3
>
> x y u v w
> 1: a 1 2 1 3
> 2: a 3 3 2 4
> 3: a 6 4 3 5
> 4: b 1 5 4 6
> 5: b 3 6 5 7
> 6: b 6 7 6 8
> 7: c 1 8 7 9
> 8: c 3 9 8 10
> 9: c 6 10 9 11
>
> # What I want to subset from DT3:
>
> DT3[y>=3 & u <=7 & v <=7 & w <=7, ] # but this way, I need to write
> all
> column names
>
> x y u v w
> 1: a 3 3 2 4
> 2: a 6 4 3 5
> 3: b 3 6 5 7
>
> # Sum(y) calculation in this subset from DT3 - output
>
> DT3[y>=3 & u <=7 & v <=7 & w <=7, sum(y), by=x]
>
> x V1
> 1: a 9
> 2: b 3
>
> # Mean(y) calculation in this subset from DT3 - output
>
> DT3[y>=3 & u <=7 & v <=7 & w <=7, mean(y), by=x]
>
> x V1
> 1: a 4.5
> 2: b 3.0
>
> # Max(y) in this subset from DT3 - output
>
> DT3[y>=3 & u <=7 & v <=7 & w <=7, max(y), by=x]
>
> x V1
> 1: a 6
> 2: b 3
>
>
> #Another example data table (DT5) with 1 character column (x), 1
> numeric
> column of interest (y) and 5 other numeric columns (s, t, u, v, w) -
> input
>
> DT5 = data.table( x=rep ( c("a","b","c"), each=3), y = c(1,3,6), s =
> 4:12, t
> = 0:8, u = 2:10, v = 1:9, w = 3:11)
>
> DT5
>
> x y s t u v w
> 1: a 1 4 0 2 1 3
> 2: a 3 5 1 3 2 4
> 3: a 6 6 2 4 3 5
> 4: b 1 7 3 5 4 6
> 5: b 3 8 4 6 5 7
> 6: b 6 9 5 7 6 8
> 7: c 1 10 6 8 7 9
> 8: c 3 11 7 9 8 10
> 9: c 6 12 8 10 9 11
>
> # Same subset from DT5 as used in DT3:
>
> DT5[y>=3 & u <=7 & v <=7 & w <=7,]
>
> x y s t u v w
> 1: a 3 5 1 3 2 4
> 2: a 6 6 2 4 3 5
> 3: b 3 8 4 6 5 7
>
> # Now filtering by s and t columns as well, excludes "3: b 3 8 4 6 5
> 7" as
> s=8.
>
> DT5[y>=3 & s <=7 & t <=7 & u <=7 & v <=7 & w <=7,] ## it works, but
> inconvenient for 10s-100s of columns
>
> x y s t u v w
> 1: a 3 5 1 3 2 4
> 2: a 6 6 2 4 3 5
>
> # Doing a sum(y) calculation over this subset from DT5 - output
>
> DT5[y>=3 & s <=7 & t <=7 & u <=7 & v <=7 & w <=7, sum(y), by=x]
>
> x V1
> 1: a 9
>
> # Doing a mean(y) calculation over this subset from DT5 - output
>
> DT5[y>=3 & s <=7 & t <=7 & u <=7 & v <=7 & w <=7, mean(y), by=x]
>
> x V1
> 1: a 4.5
>
> All the outputs shown here match the expectations, but it's not
> useful for
> 10s-100s of columns.
>
> I hope the desired output is clear now; ignore my past attempts to
> avoid
> confusion.
>
> Thanks again!
>
> B
>
>
>
> --
> View this message in context:
>
> http://r.789695.n4.nabble.com/Subsetting-columns-in-data-table-tp4649736p4649841.html
> Sent from the datatable-help mailing list archive at Nabble.com.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
More information about the datatable-help
mailing list