[datatable-help] Subsetting columns in data.table

Berto cubalibro at gmail.com
Sat Nov 17 08:47:33 CET 2012


Matthew Dowle wrote
> Hi Berto,
> 
> I think we may be experiencing a language barrier here. This is 
> datatable-help; i.e.,
> not r-help. You *can* write in another language on this list, if you'd 
> like to, in
> case someone else here understands better. The rules are less strict 
> here. Nobody has yet
> done so, but there is no rule against it. Why not?
> 
> Proceeding in English for now ...

Sorry for my loose English writing, I'll try to be more precise this time.

If this does not work, I'll switch the language.



Matthew Dowle wrote
> I like the lack of spaces, but what do the * mean?  In other words, 
> you've presented
> a line of code :
>      DT[y>=3&*v<=7&w<=7*,sum(y), by=x]
> but that doesn't actually evaluate to anything, does it? So that's 
> pseudo-code. I don't even need to copy and paste that into R to know 
> it's invalid.  That cannot possible give the expected result, because of 
> the "*" characters.

I don't have these asteriks in my message, but this part of the code was
highlighted in bold as *here*.

I've run all code and copy-pasted in the post.



Matthew Dowle wrote
> Might you be looking for something like :
> 
>      sapply(.SD, `<`, 7)
> 
> ?   Dunno. Guessing.

I've tried the sapply suggestion but the result is a logical matrix, not
what I am looking for:

DT3 = data.table( x=rep ( c("a","b","c"), each=3), y = c(1,3,6),  u = 2:10,
v = 1:9, w = 3:11)

DT3[y>=3, sapply(.SD, `<`, 7),] 

         x    y     u     v     w
[1,] FALSE TRUE  TRUE  TRUE  TRUE
[2,] FALSE TRUE  TRUE  TRUE  TRUE
[3,] FALSE TRUE  TRUE  TRUE FALSE
[4,] FALSE TRUE FALSE  TRUE FALSE
[5,] FALSE TRUE FALSE FALSE FALSE
[6,] FALSE TRUE FALSE FALSE FALSE


Matthew Dowle wrote
> But, focussing on this part of your email :
> 
>> But if the number of columns grows, I can't specify all columns 
>> anymore,
>> maybe should I use column names?
> 
> You actually do, really, honestly, need to show us, physically, in 
> email, what you mean. Columns
> of what? Growing how? Show us 2,3,4,5 columns. Show us the manual way. 
> Show us the input and
> show us the output.
> 
> Your email can be very long. It can contain very little English.  But 
> you
> actually need to show what the output is you would like, for me (at 
> least)
> to understand.
> 
> What I am certain of is that whatever you want to do is possible. And 
> if it isn't, then
> we will likely enhance data.table to do it.

I see the verb "grow" is misleading here, I meant "if the number of columns
is high".

I want this code for a data.table with 1000s of rows and 10s - 100s of
"other numeric columns".

Find below 2 examples: DT3 and DT5 (input, subset, output).


#Example data table (DT3) with 1 character column (x), 1 numeric column of
interest (y) and 3 other numeric columns (u, v, w) - input

DT3 = data.table( x=rep ( c("a","b","c"), each=3), y = c(1,3,6),  u = 2:10,
v = 1:9, w = 3:11)

DT3

   x y  u v  w
1: a 1  2 1  3
2: a 3  3 2  4
3: a 6  4 3  5
4: b 1  5 4  6
5: b 3  6 5  7
6: b 6  7 6  8
7: c 1  8 7  9
8: c 3  9 8 10
9: c 6 10 9 11

# What I want to subset from DT3:

DT3[y>=3 & u <=7 & v <=7 & w <=7, ]  # but this way, I need to write all
column names 

x y u v w
1: a 3 3 2 4
2: a 6 4 3 5
3: b 3 6 5 7

# Sum(y) calculation in this subset from DT3 - output

DT3[y>=3 & u <=7 & v <=7 & w <=7, sum(y), by=x]

   x V1
1: a  9
2: b  3

# Mean(y) calculation in this subset from DT3 - output

DT3[y>=3 & u <=7 & v <=7 & w <=7, mean(y), by=x]

   x  V1
1: a 4.5
2: b 3.0

# Max(y) in this subset from DT3 - output

DT3[y>=3 & u <=7 & v <=7 & w <=7, max(y), by=x]

   x V1
1: a  6
2: b  3


#Another example data table (DT5) with 1 character column (x), 1 numeric
column of interest (y) and 5 other numeric columns (s, t, u, v, w) - input

DT5 = data.table( x=rep ( c("a","b","c"), each=3), y = c(1,3,6), s = 4:12, t
= 0:8, u = 2:10, v = 1:9, w = 3:11)

DT5

   x y  s t  u v  w
1: a 1  4 0  2 1  3
2: a 3  5 1  3 2  4
3: a 6  6 2  4 3  5
4: b 1  7 3  5 4  6
5: b 3  8 4  6 5  7
6: b 6  9 5  7 6  8
7: c 1 10 6  8 7  9
8: c 3 11 7  9 8 10
9: c 6 12 8 10 9 11

# Same subset from DT5 as used in DT3:

DT5[y>=3 & u <=7 & v <=7 & w <=7,]

   x y s t u v w
1: a 3 5 1 3 2 4
2: a 6 6 2 4 3 5
3: b 3 8 4 6 5 7

# Now filtering by s and t columns as well, excludes "3: b 3 8 4 6 5 7" as
s=8.

DT5[y>=3 & s <=7 & t <=7 & u <=7 & v <=7 & w <=7,]  ## it works, but
inconvenient for 10s-100s of columns
 
  x y s t u v w
1: a 3 5 1 3 2 4
2: a 6 6 2 4 3 5 

# Doing a sum(y) calculation over this subset from DT5 - output

DT5[y>=3 & s <=7 & t <=7 & u <=7 & v <=7 & w <=7, sum(y), by=x]

   x V1
1: a  9

# Doing a mean(y) calculation over this subset from DT5 - output

DT5[y>=3 & s <=7 & t <=7 & u <=7 & v <=7 & w <=7, mean(y), by=x]

   x  V1
1: a 4.5

All the outputs shown here match the expectations, but it's not useful for
10s-100s of columns.

I hope the desired output is clear now; ignore my past attempts to avoid
confusion.

Thanks again!

B



--
View this message in context: http://r.789695.n4.nabble.com/Subsetting-columns-in-data-table-tp4649736p4649841.html
Sent from the datatable-help mailing list archive at Nabble.com.


More information about the datatable-help mailing list