[datatable-help] Data table syntax

Mon Sep 6 00:58:06 CEST 2010

David,

Thanks!!! Your previous example, which I'm just playing with now seems to work. Given that the file I'm working with has, usually, over a million cases, I want to make sure it works on something smaller before setting it to work on the bigger table. I'll try optimizing later.

Thanks a TON,

Damian

Damian Betebenner
Center for Assessment
PO Box 351
Dover, NH   03821-0351

Phone (office): (603) 516-7900
Phone (cell): (857) 234-2474
Fax: (603) 516-7910

dbetebenner at nciea.org
www.nciea.org

-----Original Message-----
From: David Winsemius [mailto:dwinsemius at comcast.net] 
Sent: Sunday, September 05, 2010 6:56 PM
To: datatable-help at lists.r-forge.r-project.org
Cc: Damian Betebenner; David Winsemius
Subject: Re: [datatable-help] Data table syntax

I went back and reviewed the merge.data.frame method and then  
(re-)reviewed the data.table documents. I was disappointed to find the  
there were relatively few worked examples of data.table merges, but  
finally got something that seems to make sense:

dftxt <- as.data.frame(dtxt)
# The dataframe method
 > merge(dftxt[dftxt$YEAR==2008, c(1,2,4)], dftxt[dftxt$YEAR==2009,  
c(1,2,4)], by="STUDENT_ID")
    STUDENT_ID SCHOOL_NUMBER.x SCORE.x SCHOOL_NUMBER.y SCORE.y
1           1             100      39             200      48
2           2             100      64             200      73
3           3             100      35             200      35
4           4             100      52             200      61
5           5             100      51             200      58
6           6             300      45             400      55
7           7             300      69             400      77
8           8             300      47             400      47
9           9             300      57             400      58
10         10             300      47             400      53

#The datatable method:

 > setkey(dtxt) <- "STUDENT_ID"   #as far as I can tell this is needed  
to repalce the by= argument in merge.data.frame()

#( I tried several failed efforts at incorporating the key= statement  
within the data.table calls.)

 > merge(dtxt[YEAR==2008,  ], dtxt[YEAR==2009,  ])
       STUDENT_ID SCHOOL_NUMBER YEAR SCORE SCHOOL_NUMBER.1 YEAR.1  
SCORE.1
  [1,]          1           100 2008    39             200   2009       
48
  [2,]          2           100 2008    64             200   2009       
73
  [3,]          3           100 2008    35             200   2009       
35
  [4,]          4           100 2008    52             200   2009       
61
  [5,]          5           100 2008    51             200   2009       
58
  [6,]          6           300 2008    45             400   2009       
55
  [7,]          7           300 2008    69             400   2009       
77
  [8,]          8           300 2008    47             400   2009       
47
  [9,]          9           300 2008    57             400   2009       
58
[10,]         10           300 2008    47             400   2009      53

 > mtxt <- merge(dtxt[YEAR==2008,  ], dtxt[YEAR==2009,  ])

 > mtxt[, mean(SCORE) , by=list("2009_SCH"=SCHOOL_NUMBER.1)]
      X2009_SCH   V1
[1,]       200 48.2
[2,]       400 53.0

On Sep 5, 2010, at 4:43 PM, David Winsemius wrote:

>
> On Sep 5, 2010, at 3:02 PM, Damian Betebenner wrote:
>
>> Hi David,
>>
>> Thanks for the quick and thoughtful reply.
>>
>> Sorry for not being clearer.
>>
>> There are two schools that the students attended in 2009 (200 and  
>> 400). I'd like to break on those, and calculate the mean for all  
>> the students in those two schools but for their 2008 scores.
>>
>> Thus, the output would have 2 rows:
>>
>> YEAR, 2009_SCHOOL_NUMBER, 2008_SCORE_MEAN
>>
>> 2009  200    54.4
>> 2009  400    53
>
> I dont see what sort of linkage you have between the 2008 and 2009  
> school-numbers but see if this satisfies:
>
> > dtxt[YEAR==2008 & STUDENT_ID %in% dtxt[YEAR==2009, STUDENT_ID],  
> mean(SCORE), by=SCHOOL_NUMBER]
>     SCHOOL_NUMBER   V1
> [1,]           100 48.2
> [2,]           300 53.0
>
>>
>> Thanks for considering this,
>>
>> Damian
>>
>>
>> Damian Betebenner
>> Center for Assessment
>> PO Box 351
>> Dover, NH   03821-0351
>>
>> Phone (office): (603) 516-7900
>> Phone (cell): (857) 234-2474
>> Fax: (603) 516-7910
>>
>> dbetebenner at nciea.org
>> www.nciea.org
>>
>>
>>
>>
>> -----Original Message-----
>> From: David Winsemius [mailto:dwinsemius at comcast.net]
>> Sent: Sunday, September 05, 2010 1:03 PM
>> To: David Winsemius
>> Cc: Damian Betebenner; datatable-help at lists.r-forge.r-project.org
>> Subject: Re: [datatable-help] Data table syntax
>>
>>
>> On Sep 5, 2010, at 12:43 PM, David Winsemius wrote:
>>
>>>
>>> On Sep 5, 2010, at 11:38 AM, Damian Betebenner wrote:
>>>
>>>> Thanks for the invaluable help on my previous questions. The speed
>>>> up in create summary tables has been immense and I'm enthused about
>>>> all the possibilities going forward.
>>>>
>>>> I'm currently stuck in trying to put together syntax for a "long"
>>>> for table. In the example below, each case is a unique Student by
>>>> Year combination. What I'm trying to do is take
>>>> such a table, aggregate on the student's  current year (i.e., 2009
>>>> in this data) SCHOOL_NUMBER, and calculate their mean score in the
>>>> previous year (i.e., 2008 in this data).
>>>>
>>>> If the file were "wide", with each case representing a unique
>>>> student with separate variables for the year, then it would be easy
>>>> to break on the 2009 SCHOOL_NUMBER and take the
>>>> mean of the 2008 SCORE.
>>
>> But there is only one 2008 SCORE for each student???
>>
>>>>
>>>> Is conversion of long to wide necessary to do this?
>>>
>>> Probably not. Are you familiar with the "ave" function in base R?
>>
>> I am having some difficulty understanding the structure of the  
>> desired
>> output. I initially thought it might be something like:
>> rd.txt <-
>> function(txt, header=TRUE, ...) {
>>     rd <- read.table(textConnection(txt), header=header, ...)
>>       closeAllConnections()
>>     rd }
>> txt <- rd.txt("STUDENT_ID SCHOOL_NUMBER YEAR SCORE
>>         1           100 2008    39
>>         1           200 2009    48
>>         2           100 2008    64
>>         2           200 2009    73
>>         3           100 2008    35
>>         3           200 2009    35
>>         4           100 2008    52
>>         4           200 2009    61
>>         5           100 2008    51
>>         5           200 2009    58
>>         6           300 2008    45
>>         6           400 2009    55
>>         7           300 2008    69
>>         7           400 2009    77
>>         8           300 2008    47
>>         8           400 2009    47
>>         9           300 2008    57
>>         9           400 2009    58
>>        10           300 2008    47
>>        10           400 2009    53")
>> dtxt <- data.table(txt)
>>
>>> dtxt$avScr <- dtxt[ , ave(SCORE, list(STUDENT_ID))]  # returns a
>> vector as long as its input
>>> dtxt
>>
>> But now I am wondering if you wanted:
>>
>>> dtxt[ , tapply(SCORE, list(STUDENT_ID), mean)]  # returns vector
>> only as long as product of category levels.
>>   1    2    3    4    5    6    7    8    9   10
>> 43.5 68.5 35.0 56.5 54.5 50.0 73.0 47.0 57.5 50.0
>>
>>>
>>>>
>>>>
>>>>    STUDENT_ID SCHOOL_NUMBER YEAR SCORE
>>>> [1,]          1           100 2008    39
>>>> [2,]          1           200 2009    48
>>>> [3,]          2           100 2008    64
>>>> [4,]          2           200 2009    73
>>>> [5,]          3           100 2008    35
>>>> [6,]          3           200 2009    35
>>>> [7,]          4           100 2008    52
>>>> [8,]          4           200 2009    61
>>>> [9,]          5           100 2008    51
>>>> [10,]          5           200 2009    58
>>>> [11,]          6           300 2008    45
>>>> [12,]          6           400 2009    55
>>>> [13,]          7           300 2008    69
>>>> [14,]          7           400 2009    77
>>>> [15,]          8           300 2008    47
>>>> [16,]          8           400 2009    47
>>>> [17,]          9           300 2008    57
>>>> [18,]          9           400 2009    58
>>>> [19,]         10           300 2008    47
>>>> [20,]         10           400 2009    53
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Damian
>>>
>>
>>
>> David Winsemius, MD
>> West Hartford, CT
>>
>
> David Winsemius, MD
> West Hartford, CT
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

David Winsemius, MD
West Hartford, CT