[datatable-help] Subsetting columns in data.table

Matthew Dowle mdowle at mdowle.plus.com
Mon Nov 19 23:06:51 CET 2012


Hi Berto,

Thanks. I think I get it now. How about this :

DT5 = data.table(x=rep(c("a","b","c"), each=3), y=c(1,3,6), s=4:12, 
t=0:8, u=2:10, v=1:9, w=3:11)
othercols = setdiff(names(DT5),c("x","y"))
DT5[y<=3 & Reduce(`&`,lapply(DT5[,othercols,with=FALSE],`<=`,7)), 
max(y), by=x]
    x V1
1: a  3
2: b  1

HTH
Matthew


On 17.11.2012 07:47, Berto wrote:
> Matthew Dowle wrote
>> Hi Berto,
>>
>> I think we may be experiencing a language barrier here. This is
>> datatable-help; i.e.,
>> not r-help. You *can* write in another language on this list, if 
>> you'd
>> like to, in
>> case someone else here understands better. The rules are less strict
>> here. Nobody has yet
>> done so, but there is no rule against it. Why not?
>>
>> Proceeding in English for now ...
>
> Sorry for my loose English writing, I'll try to be more precise this 
> time.
>
> If this does not work, I'll switch the language.
>
>
>
> Matthew Dowle wrote
>> I like the lack of spaces, but what do the * mean?  In other words,
>> you've presented
>> a line of code :
>>      DT[y>=3&*v<=7&w<=7*,sum(y), by=x]
>> but that doesn't actually evaluate to anything, does it? So that's
>> pseudo-code. I don't even need to copy and paste that into R to know
>> it's invalid.  That cannot possible give the expected result, 
>> because of
>> the "*" characters.
>
> I don't have these asteriks in my message, but this part of the code 
> was
> highlighted in bold as *here*.
>
> I've run all code and copy-pasted in the post.
>
>
>
> Matthew Dowle wrote
>> Might you be looking for something like :
>>
>>      sapply(.SD, `<`, 7)
>>
>> ?   Dunno. Guessing.
>
> I've tried the sapply suggestion but the result is a logical matrix, 
> not
> what I am looking for:
>
> DT3 = data.table( x=rep ( c("a","b","c"), each=3), y = c(1,3,6),  u = 
> 2:10,
> v = 1:9, w = 3:11)
>
> DT3[y>=3, sapply(.SD, `<`, 7),]
>
>          x    y     u     v     w
> [1,] FALSE TRUE  TRUE  TRUE  TRUE
> [2,] FALSE TRUE  TRUE  TRUE  TRUE
> [3,] FALSE TRUE  TRUE  TRUE FALSE
> [4,] FALSE TRUE FALSE  TRUE FALSE
> [5,] FALSE TRUE FALSE FALSE FALSE
> [6,] FALSE TRUE FALSE FALSE FALSE
>
>
> Matthew Dowle wrote
>> But, focussing on this part of your email :
>>
>>> But if the number of columns grows, I can't specify all columns
>>> anymore,
>>> maybe should I use column names?
>>
>> You actually do, really, honestly, need to show us, physically, in
>> email, what you mean. Columns
>> of what? Growing how? Show us 2,3,4,5 columns. Show us the manual 
>> way.
>> Show us the input and
>> show us the output.
>>
>> Your email can be very long. It can contain very little English.  
>> But
>> you
>> actually need to show what the output is you would like, for me (at
>> least)
>> to understand.
>>
>> What I am certain of is that whatever you want to do is possible. 
>> And
>> if it isn't, then
>> we will likely enhance data.table to do it.
>
> I see the verb "grow" is misleading here, I meant "if the number of 
> columns
> is high".
>
> I want this code for a data.table with 1000s of rows and 10s - 100s 
> of
> "other numeric columns".
>
> Find below 2 examples: DT3 and DT5 (input, subset, output).
>
>
> #Example data table (DT3) with 1 character column (x), 1 numeric 
> column of
> interest (y) and 3 other numeric columns (u, v, w) - input
>
> DT3 = data.table( x=rep ( c("a","b","c"), each=3), y = c(1,3,6),  u = 
> 2:10,
> v = 1:9, w = 3:11)
>
> DT3
>
>    x y  u v  w
> 1: a 1  2 1  3
> 2: a 3  3 2  4
> 3: a 6  4 3  5
> 4: b 1  5 4  6
> 5: b 3  6 5  7
> 6: b 6  7 6  8
> 7: c 1  8 7  9
> 8: c 3  9 8 10
> 9: c 6 10 9 11
>
> # What I want to subset from DT3:
>
> DT3[y>=3 & u <=7 & v <=7 & w <=7, ]  # but this way, I need to write 
> all
> column names
>
> x y u v w
> 1: a 3 3 2 4
> 2: a 6 4 3 5
> 3: b 3 6 5 7
>
> # Sum(y) calculation in this subset from DT3 - output
>
> DT3[y>=3 & u <=7 & v <=7 & w <=7, sum(y), by=x]
>
>    x V1
> 1: a  9
> 2: b  3
>
> # Mean(y) calculation in this subset from DT3 - output
>
> DT3[y>=3 & u <=7 & v <=7 & w <=7, mean(y), by=x]
>
>    x  V1
> 1: a 4.5
> 2: b 3.0
>
> # Max(y) in this subset from DT3 - output
>
> DT3[y>=3 & u <=7 & v <=7 & w <=7, max(y), by=x]
>
>    x V1
> 1: a  6
> 2: b  3
>
>
> #Another example data table (DT5) with 1 character column (x), 1 
> numeric
> column of interest (y) and 5 other numeric columns (s, t, u, v, w) - 
> input
>
> DT5 = data.table( x=rep ( c("a","b","c"), each=3), y = c(1,3,6), s = 
> 4:12, t
> = 0:8, u = 2:10, v = 1:9, w = 3:11)
>
> DT5
>
>    x y  s t  u v  w
> 1: a 1  4 0  2 1  3
> 2: a 3  5 1  3 2  4
> 3: a 6  6 2  4 3  5
> 4: b 1  7 3  5 4  6
> 5: b 3  8 4  6 5  7
> 6: b 6  9 5  7 6  8
> 7: c 1 10 6  8 7  9
> 8: c 3 11 7  9 8 10
> 9: c 6 12 8 10 9 11
>
> # Same subset from DT5 as used in DT3:
>
> DT5[y>=3 & u <=7 & v <=7 & w <=7,]
>
>    x y s t u v w
> 1: a 3 5 1 3 2 4
> 2: a 6 6 2 4 3 5
> 3: b 3 8 4 6 5 7
>
> # Now filtering by s and t columns as well, excludes "3: b 3 8 4 6 5 
> 7" as
> s=8.
>
> DT5[y>=3 & s <=7 & t <=7 & u <=7 & v <=7 & w <=7,]  ## it works, but
> inconvenient for 10s-100s of columns
>
>   x y s t u v w
> 1: a 3 5 1 3 2 4
> 2: a 6 6 2 4 3 5
>
> # Doing a sum(y) calculation over this subset from DT5 - output
>
> DT5[y>=3 & s <=7 & t <=7 & u <=7 & v <=7 & w <=7, sum(y), by=x]
>
>    x V1
> 1: a  9
>
> # Doing a mean(y) calculation over this subset from DT5 - output
>
> DT5[y>=3 & s <=7 & t <=7 & u <=7 & v <=7 & w <=7, mean(y), by=x]
>
>    x  V1
> 1: a 4.5
>
> All the outputs shown here match the expectations, but it's not 
> useful for
> 10s-100s of columns.
>
> I hope the desired output is clear now; ignore my past attempts to 
> avoid
> confusion.
>
> Thanks again!
>
> B
>
>
>
> --
> View this message in context:
> 
> http://r.789695.n4.nabble.com/Subsetting-columns-in-data-table-tp4649736p4649841.html
> Sent from the datatable-help mailing list archive at Nabble.com.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> 
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help



More information about the datatable-help mailing list