[datatable-help] Data table syntax
David Winsemius
dwinsemius at comcast.net
Mon Sep 6 01:12:02 CEST 2010
On Sep 5, 2010, at 6:58 PM, Damian Betebenner wrote:
> David,
>
> Thanks!!! Your previous example, which I'm just playing with now
> seems to work. Given that the file I'm working with has, usually,
> over a million cases, I want to make sure it works on something
> smaller before setting it to work on the bigger table. I'll try
> optimizing later.
The problem with my first solution is that there is no method to
associate or test for the prior or next school attended. The second
method should allow looking at the cross-tabulation of 2008 and 2009
schools if you constructed a more realistic test table and also used a
by= argument that had both years in the list().
--
David.
>
> Thanks a TON,
>
> Damian
>
> Damian Betebenner
> Center for Assessment
> PO Box 351
> Dover, NH 03821-0351
>
> Phone (office): (603) 516-7900
> Phone (cell): (857) 234-2474
> Fax: (603) 516-7910
>
> dbetebenner at nciea.org
> www.nciea.org
>
>
>
>
> -----Original Message-----
> From: David Winsemius [mailto:dwinsemius at comcast.net]
> Sent: Sunday, September 05, 2010 6:56 PM
> To: datatable-help at lists.r-forge.r-project.org
> Cc: Damian Betebenner; David Winsemius
> Subject: Re: [datatable-help] Data table syntax
>
> I went back and reviewed the merge.data.frame method and then
> (re-)reviewed the data.table documents. I was disappointed to find the
> there were relatively few worked examples of data.table merges, but
> finally got something that seems to make sense:
>
> dftxt <- as.data.frame(dtxt)
> # The dataframe method
>> merge(dftxt[dftxt$YEAR==2008, c(1,2,4)], dftxt[dftxt$YEAR==2009,
> c(1,2,4)], by="STUDENT_ID")
> STUDENT_ID SCHOOL_NUMBER.x SCORE.x SCHOOL_NUMBER.y SCORE.y
> 1 1 100 39 200 48
> 2 2 100 64 200 73
> 3 3 100 35 200 35
> 4 4 100 52 200 61
> 5 5 100 51 200 58
> 6 6 300 45 400 55
> 7 7 300 69 400 77
> 8 8 300 47 400 47
> 9 9 300 57 400 58
> 10 10 300 47 400 53
>
> #The datatable method:
>
>> setkey(dtxt) <- "STUDENT_ID" #as far as I can tell this is needed
> to repalce the by= argument in merge.data.frame()
>
> #( I tried several failed efforts at incorporating the key= statement
> within the data.table calls.)
>
>> merge(dtxt[YEAR==2008, ], dtxt[YEAR==2009, ])
> STUDENT_ID SCHOOL_NUMBER YEAR SCORE SCHOOL_NUMBER.1 YEAR.1
> SCORE.1
> [1,] 1 100 2008 39 200 2009
> 48
> [2,] 2 100 2008 64 200 2009
> 73
> [3,] 3 100 2008 35 200 2009
> 35
> [4,] 4 100 2008 52 200 2009
> 61
> [5,] 5 100 2008 51 200 2009
> 58
> [6,] 6 300 2008 45 400 2009
> 55
> [7,] 7 300 2008 69 400 2009
> 77
> [8,] 8 300 2008 47 400 2009
> 47
> [9,] 9 300 2008 57 400 2009
> 58
> [10,] 10 300 2008 47 400
> 2009 53
>
>
>> mtxt <- merge(dtxt[YEAR==2008, ], dtxt[YEAR==2009, ])
>
>> mtxt[, mean(SCORE) , by=list("2009_SCH"=SCHOOL_NUMBER.1)]
> X2009_SCH V1
> [1,] 200 48.2
> [2,] 400 53.0
>
>
> On Sep 5, 2010, at 4:43 PM, David Winsemius wrote:
>
>>
>> On Sep 5, 2010, at 3:02 PM, Damian Betebenner wrote:
>>
>>> Hi David,
>>>
>>> Thanks for the quick and thoughtful reply.
>>>
>>> Sorry for not being clearer.
>>>
>>> There are two schools that the students attended in 2009 (200 and
>>> 400). I'd like to break on those, and calculate the mean for all
>>> the students in those two schools but for their 2008 scores.
>>>
>>> Thus, the output would have 2 rows:
>>>
>>> YEAR, 2009_SCHOOL_NUMBER, 2008_SCORE_MEAN
>>>
>>> 2009 200 54.4
>>> 2009 400 53
>>
>> I dont see what sort of linkage you have between the 2008 and 2009
>> school-numbers but see if this satisfies:
>>
>>> dtxt[YEAR==2008 & STUDENT_ID %in% dtxt[YEAR==2009, STUDENT_ID],
>> mean(SCORE), by=SCHOOL_NUMBER]
>> SCHOOL_NUMBER V1
>> [1,] 100 48.2
>> [2,] 300 53.0
>>
>>>
>>> Thanks for considering this,
>>>
>>> Damian
>>>
>>>
>>> Damian Betebenner
>>> Center for Assessment
>>> PO Box 351
>>> Dover, NH 03821-0351
>>>
>>> Phone (office): (603) 516-7900
>>> Phone (cell): (857) 234-2474
>>> Fax: (603) 516-7910
>>>
>>> dbetebenner at nciea.org
>>> www.nciea.org
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: David Winsemius [mailto:dwinsemius at comcast.net]
>>> Sent: Sunday, September 05, 2010 1:03 PM
>>> To: David Winsemius
>>> Cc: Damian Betebenner; datatable-help at lists.r-forge.r-project.org
>>> Subject: Re: [datatable-help] Data table syntax
>>>
>>>
>>> On Sep 5, 2010, at 12:43 PM, David Winsemius wrote:
>>>
>>>>
>>>> On Sep 5, 2010, at 11:38 AM, Damian Betebenner wrote:
>>>>
>>>>> Thanks for the invaluable help on my previous questions. The speed
>>>>> up in create summary tables has been immense and I'm enthused
>>>>> about
>>>>> all the possibilities going forward.
>>>>>
>>>>> I'm currently stuck in trying to put together syntax for a "long"
>>>>> for table. In the example below, each case is a unique Student by
>>>>> Year combination. What I'm trying to do is take
>>>>> such a table, aggregate on the student's current year (i.e., 2009
>>>>> in this data) SCHOOL_NUMBER, and calculate their mean score in the
>>>>> previous year (i.e., 2008 in this data).
>>>>>
>>>>> If the file were "wide", with each case representing a unique
>>>>> student with separate variables for the year, then it would be
>>>>> easy
>>>>> to break on the 2009 SCHOOL_NUMBER and take the
>>>>> mean of the 2008 SCORE.
>>>
>>> But there is only one 2008 SCORE for each student???
>>>
>>>>>
>>>>> Is conversion of long to wide necessary to do this?
>>>>
>>>> Probably not. Are you familiar with the "ave" function in base R?
>>>
>>> I am having some difficulty understanding the structure of the
>>> desired
>>> output. I initially thought it might be something like:
>>> rd.txt <-
>>> function(txt, header=TRUE, ...) {
>>> rd <- read.table(textConnection(txt), header=header, ...)
>>> closeAllConnections()
>>> rd }
>>> txt <- rd.txt("STUDENT_ID SCHOOL_NUMBER YEAR SCORE
>>> 1 100 2008 39
>>> 1 200 2009 48
>>> 2 100 2008 64
>>> 2 200 2009 73
>>> 3 100 2008 35
>>> 3 200 2009 35
>>> 4 100 2008 52
>>> 4 200 2009 61
>>> 5 100 2008 51
>>> 5 200 2009 58
>>> 6 300 2008 45
>>> 6 400 2009 55
>>> 7 300 2008 69
>>> 7 400 2009 77
>>> 8 300 2008 47
>>> 8 400 2009 47
>>> 9 300 2008 57
>>> 9 400 2009 58
>>> 10 300 2008 47
>>> 10 400 2009 53")
>>> dtxt <- data.table(txt)
>>>
>>>> dtxt$avScr <- dtxt[ , ave(SCORE, list(STUDENT_ID))] # returns a
>>> vector as long as its input
>>>> dtxt
>>>
>>> But now I am wondering if you wanted:
>>>
>>>> dtxt[ , tapply(SCORE, list(STUDENT_ID), mean)] # returns vector
>>> only as long as product of category levels.
>>> 1 2 3 4 5 6 7 8 9 10
>>> 43.5 68.5 35.0 56.5 54.5 50.0 73.0 47.0 57.5 50.0
>>>
>>>>
>>>>>
>>>>>
>>>>> STUDENT_ID SCHOOL_NUMBER YEAR SCORE
>>>>> [1,] 1 100 2008 39
>>>>> [2,] 1 200 2009 48
>>>>> [3,] 2 100 2008 64
>>>>> [4,] 2 200 2009 73
>>>>> [5,] 3 100 2008 35
>>>>> [6,] 3 200 2009 35
>>>>> [7,] 4 100 2008 52
>>>>> [8,] 4 200 2009 61
>>>>> [9,] 5 100 2008 51
>>>>> [10,] 5 200 2009 58
>>>>> [11,] 6 300 2008 45
>>>>> [12,] 6 400 2009 55
>>>>> [13,] 7 300 2008 69
>>>>> [14,] 7 400 2009 77
>>>>> [15,] 8 300 2008 47
>>>>> [16,] 8 400 2009 47
>>>>> [17,] 9 300 2008 57
>>>>> [18,] 9 400 2009 58
>>>>> [19,] 10 300 2008 47
>>>>> [20,] 10 400 2009 53
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Damian
>>>>
>>>
>>>
>>> David Winsemius, MD
>>> West Hartford, CT
>>>
>>
>> David Winsemius, MD
>> West Hartford, CT
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
> David Winsemius, MD
> West Hartford, CT
>
David Winsemius, MD
West Hartford, CT
More information about the datatable-help
mailing list