[datatable-help] Data table syntax
Damian Betebenner
dbetebenner at nciea.org
Mon Sep 6 00:58:06 CEST 2010
David,
Thanks!!! Your previous example, which I'm just playing with now seems to work. Given that the file I'm working with has, usually, over a million cases, I want to make sure it works on something smaller before setting it to work on the bigger table. I'll try optimizing later.
Thanks a TON,
Damian
Damian Betebenner
Center for Assessment
PO Box 351
Dover, NH 03821-0351
Phone (office): (603) 516-7900
Phone (cell): (857) 234-2474
Fax: (603) 516-7910
dbetebenner at nciea.org
www.nciea.org
-----Original Message-----
From: David Winsemius [mailto:dwinsemius at comcast.net]
Sent: Sunday, September 05, 2010 6:56 PM
To: datatable-help at lists.r-forge.r-project.org
Cc: Damian Betebenner; David Winsemius
Subject: Re: [datatable-help] Data table syntax
I went back and reviewed the merge.data.frame method and then
(re-)reviewed the data.table documents. I was disappointed to find the
there were relatively few worked examples of data.table merges, but
finally got something that seems to make sense:
dftxt <- as.data.frame(dtxt)
# The dataframe method
> merge(dftxt[dftxt$YEAR==2008, c(1,2,4)], dftxt[dftxt$YEAR==2009,
c(1,2,4)], by="STUDENT_ID")
STUDENT_ID SCHOOL_NUMBER.x SCORE.x SCHOOL_NUMBER.y SCORE.y
1 1 100 39 200 48
2 2 100 64 200 73
3 3 100 35 200 35
4 4 100 52 200 61
5 5 100 51 200 58
6 6 300 45 400 55
7 7 300 69 400 77
8 8 300 47 400 47
9 9 300 57 400 58
10 10 300 47 400 53
#The datatable method:
> setkey(dtxt) <- "STUDENT_ID" #as far as I can tell this is needed
to repalce the by= argument in merge.data.frame()
#( I tried several failed efforts at incorporating the key= statement
within the data.table calls.)
> merge(dtxt[YEAR==2008, ], dtxt[YEAR==2009, ])
STUDENT_ID SCHOOL_NUMBER YEAR SCORE SCHOOL_NUMBER.1 YEAR.1
SCORE.1
[1,] 1 100 2008 39 200 2009
48
[2,] 2 100 2008 64 200 2009
73
[3,] 3 100 2008 35 200 2009
35
[4,] 4 100 2008 52 200 2009
61
[5,] 5 100 2008 51 200 2009
58
[6,] 6 300 2008 45 400 2009
55
[7,] 7 300 2008 69 400 2009
77
[8,] 8 300 2008 47 400 2009
47
[9,] 9 300 2008 57 400 2009
58
[10,] 10 300 2008 47 400 2009 53
> mtxt <- merge(dtxt[YEAR==2008, ], dtxt[YEAR==2009, ])
> mtxt[, mean(SCORE) , by=list("2009_SCH"=SCHOOL_NUMBER.1)]
X2009_SCH V1
[1,] 200 48.2
[2,] 400 53.0
On Sep 5, 2010, at 4:43 PM, David Winsemius wrote:
>
> On Sep 5, 2010, at 3:02 PM, Damian Betebenner wrote:
>
>> Hi David,
>>
>> Thanks for the quick and thoughtful reply.
>>
>> Sorry for not being clearer.
>>
>> There are two schools that the students attended in 2009 (200 and
>> 400). I'd like to break on those, and calculate the mean for all
>> the students in those two schools but for their 2008 scores.
>>
>> Thus, the output would have 2 rows:
>>
>> YEAR, 2009_SCHOOL_NUMBER, 2008_SCORE_MEAN
>>
>> 2009 200 54.4
>> 2009 400 53
>
> I dont see what sort of linkage you have between the 2008 and 2009
> school-numbers but see if this satisfies:
>
> > dtxt[YEAR==2008 & STUDENT_ID %in% dtxt[YEAR==2009, STUDENT_ID],
> mean(SCORE), by=SCHOOL_NUMBER]
> SCHOOL_NUMBER V1
> [1,] 100 48.2
> [2,] 300 53.0
>
>>
>> Thanks for considering this,
>>
>> Damian
>>
>>
>> Damian Betebenner
>> Center for Assessment
>> PO Box 351
>> Dover, NH 03821-0351
>>
>> Phone (office): (603) 516-7900
>> Phone (cell): (857) 234-2474
>> Fax: (603) 516-7910
>>
>> dbetebenner at nciea.org
>> www.nciea.org
>>
>>
>>
>>
>> -----Original Message-----
>> From: David Winsemius [mailto:dwinsemius at comcast.net]
>> Sent: Sunday, September 05, 2010 1:03 PM
>> To: David Winsemius
>> Cc: Damian Betebenner; datatable-help at lists.r-forge.r-project.org
>> Subject: Re: [datatable-help] Data table syntax
>>
>>
>> On Sep 5, 2010, at 12:43 PM, David Winsemius wrote:
>>
>>>
>>> On Sep 5, 2010, at 11:38 AM, Damian Betebenner wrote:
>>>
>>>> Thanks for the invaluable help on my previous questions. The speed
>>>> up in create summary tables has been immense and I'm enthused about
>>>> all the possibilities going forward.
>>>>
>>>> I'm currently stuck in trying to put together syntax for a "long"
>>>> for table. In the example below, each case is a unique Student by
>>>> Year combination. What I'm trying to do is take
>>>> such a table, aggregate on the student's current year (i.e., 2009
>>>> in this data) SCHOOL_NUMBER, and calculate their mean score in the
>>>> previous year (i.e., 2008 in this data).
>>>>
>>>> If the file were "wide", with each case representing a unique
>>>> student with separate variables for the year, then it would be easy
>>>> to break on the 2009 SCHOOL_NUMBER and take the
>>>> mean of the 2008 SCORE.
>>
>> But there is only one 2008 SCORE for each student???
>>
>>>>
>>>> Is conversion of long to wide necessary to do this?
>>>
>>> Probably not. Are you familiar with the "ave" function in base R?
>>
>> I am having some difficulty understanding the structure of the
>> desired
>> output. I initially thought it might be something like:
>> rd.txt <-
>> function(txt, header=TRUE, ...) {
>> rd <- read.table(textConnection(txt), header=header, ...)
>> closeAllConnections()
>> rd }
>> txt <- rd.txt("STUDENT_ID SCHOOL_NUMBER YEAR SCORE
>> 1 100 2008 39
>> 1 200 2009 48
>> 2 100 2008 64
>> 2 200 2009 73
>> 3 100 2008 35
>> 3 200 2009 35
>> 4 100 2008 52
>> 4 200 2009 61
>> 5 100 2008 51
>> 5 200 2009 58
>> 6 300 2008 45
>> 6 400 2009 55
>> 7 300 2008 69
>> 7 400 2009 77
>> 8 300 2008 47
>> 8 400 2009 47
>> 9 300 2008 57
>> 9 400 2009 58
>> 10 300 2008 47
>> 10 400 2009 53")
>> dtxt <- data.table(txt)
>>
>>> dtxt$avScr <- dtxt[ , ave(SCORE, list(STUDENT_ID))] # returns a
>> vector as long as its input
>>> dtxt
>>
>> But now I am wondering if you wanted:
>>
>>> dtxt[ , tapply(SCORE, list(STUDENT_ID), mean)] # returns vector
>> only as long as product of category levels.
>> 1 2 3 4 5 6 7 8 9 10
>> 43.5 68.5 35.0 56.5 54.5 50.0 73.0 47.0 57.5 50.0
>>
>>>
>>>>
>>>>
>>>> STUDENT_ID SCHOOL_NUMBER YEAR SCORE
>>>> [1,] 1 100 2008 39
>>>> [2,] 1 200 2009 48
>>>> [3,] 2 100 2008 64
>>>> [4,] 2 200 2009 73
>>>> [5,] 3 100 2008 35
>>>> [6,] 3 200 2009 35
>>>> [7,] 4 100 2008 52
>>>> [8,] 4 200 2009 61
>>>> [9,] 5 100 2008 51
>>>> [10,] 5 200 2009 58
>>>> [11,] 6 300 2008 45
>>>> [12,] 6 400 2009 55
>>>> [13,] 7 300 2008 69
>>>> [14,] 7 400 2009 77
>>>> [15,] 8 300 2008 47
>>>> [16,] 8 400 2009 47
>>>> [17,] 9 300 2008 57
>>>> [18,] 9 400 2009 58
>>>> [19,] 10 300 2008 47
>>>> [20,] 10 400 2009 53
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Damian
>>>
>>
>>
>> David Winsemius, MD
>> West Hartford, CT
>>
>
> David Winsemius, MD
> West Hartford, CT
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
David Winsemius, MD
West Hartford, CT
More information about the datatable-help
mailing list