[datatable-help] Data table syntax
Damian Betebenner
dbetebenner at nciea.org
Tue Sep 7 11:15:01 CEST 2010
Thanks Matthew and David,
The joining to the previous score, in the method you suggested, is really nice and, I think, efficient.
I had tried what David initially proposed as well as "flattening" the file in a manner you suggested to an earlier problem I had:
dtxt[,as.list(SCORE[match(2008, YEAR)]), by=SCHOOL_NUMBER]
I like your solution because it keeps the original "long" file intact and just adds on the relevant data right where it belongs.
The file I'm working with has about 6 million records so getting something that works, at least, relatively efficiently was a priority.
Best regards,
Damian
Damian Betebenner
Center for Assessment
PO Box 351
Dover, NH 03821-0351
Phone (office): (603) 516-7900
Phone (cell): (857) 234-2474
Fax: (603) 516-7910
dbetebenner at nciea.org
www.nciea.org
-----Original Message-----
From: Matthew Dowle [mailto:mdowle at mdowle.plus.com]
Sent: Monday, September 06, 2010 7:09 PM
To: Damian Betebenner
Cc: dwinsemius at comcast.net; datatable-help at lists.r-forge.r-project.org
Subject: Re: [datatable-help] Data table syntax
Damian,
I was also struggling with the example data but I think
I get it now. All these children moved from one school
to another then; none stayed at the same school. Also,
in the desired result :
YEAR, 2009_SCHOOL_NUMBER, 2008_SCORE_MEAN
2009 200 54.4
2009 400 53
it seems the 54.4 should be 48.2 as David obtained :
SCHOOL_NUMBER V1
100 48.2
300 53.0
If I understand that correctly so far then ...
It's usually good to include time as the last column
of the key. I pasted David's code (thanks) to create
the data then did :
> setkey(dtxt, STUDENT_ID, YEAR)
> dtxt
STUDENT_ID SCHOOL_NUMBER YEAR SCORE
[1,] 1 100 2008 39
[2,] 1 200 2009 48
[3,] 2 100 2008 64
[4,] 2 200 2009 73
[5,] 3 100 2008 35
[6,] 3 200 2009 35
[7,] 4 100 2008 52
[8,] 4 200 2009 61
[9,] 5 100 2008 51
[10,] 5 200 2009 58
[11,] 6 300 2008 45
[12,] 6 400 2009 55
[13,] 7 300 2008 69
[14,] 7 400 2009 77
[15,] 8 300 2008 47
[16,] 8 400 2009 47
[17,] 9 300 2008 57
[18,] 9 400 2009 58
[19,] 10 300 2008 47
[20,] 10 400 2009 53
Self join each student to their previous year :
> dtxt[J(STUDENT_ID,YEAR-1)]
STUDENT_ID SCHOOL_NUMBER YEAR SCORE
[1,] NA NA NA NA
[2,] 1 100 2008 39
[3,] NA NA NA NA
[4,] 2 100 2008 64
[5,] NA NA NA NA
[6,] 3 100 2008 35
[7,] NA NA NA NA
[8,] 4 100 2008 52
[9,] NA NA NA NA
[10,] 5 100 2008 51
[11,] NA NA NA NA
[12,] 6 300 2008 45
[13,] NA NA NA NA
[14,] 7 300 2008 69
[15,] NA NA NA NA
[16,] 8 300 2008 47
[17,] NA NA NA NA
[18,] 9 300 2008 57
[19,] NA NA NA NA
[20,] 10 300 2008 47
Pick out the prior year's score and add it as column :
> dtxt$PRIOR_SCORE = dtxt[J(STUDENT_ID,YEAR-1),SCORE]
> dtxt
STUDENT_ID SCHOOL_NUMBER YEAR SCORE PRIOR_SCORE
[1,] 1 100 2008 39 NA
[2,] 1 200 2009 48 39
[3,] 2 100 2008 64 NA
[4,] 2 200 2009 73 64
[5,] 3 100 2008 35 NA
[6,] 3 200 2009 35 35
[7,] 4 100 2008 52 NA
[8,] 4 200 2009 61 52
[9,] 5 100 2008 51 NA
[10,] 5 200 2009 58 51
[11,] 6 300 2008 45 NA
[12,] 6 400 2009 55 45
[13,] 7 300 2008 69 NA
[14,] 7 400 2009 77 69
[15,] 8 300 2008 47 NA
[16,] 8 400 2009 47 47
[17,] 9 300 2008 57 NA
[18,] 9 400 2009 58 57
[19,] 10 300 2008 47 NA
[20,] 10 400 2009 53 47
> dtxt[,mean(PRIOR_SCORE),by=list(SCHOOL_NUMBER,YEAR)]
SCHOOL_NUMBER YEAR V1
[1,] 100 2008 NA
[2,] 200 2009 48.2
[3,] 300 2008 NA
[4,] 400 2009 53.0
or the 'by without by' direct way :
> setkey(dtxt,SCHOOL_NUMBER,YEAR)
> dtxt[J(c(200,400),2009),mean(PRIOR_SCORE),mult="all"]
SCHOOL_NUMBER YEAR V1
[1,] 200 2009 48.2
[2,] 400 2009 53.0
Hope that helps?
Matthew
On Sun, 2010-09-05 at 19:12 -0400, David Winsemius wrote:
> On Sep 5, 2010, at 6:58 PM, Damian Betebenner wrote:
>
> > David,
> >
> > Thanks!!! Your previous example, which I'm just playing with now
> > seems to work. Given that the file I'm working with has, usually,
> > over a million cases, I want to make sure it works on something
> > smaller before setting it to work on the bigger table. I'll try
> > optimizing later.
>
> The problem with my first solution is that there is no method to
> associate or test for the prior or next school attended. The second
> method should allow looking at the cross-tabulation of 2008 and 2009
> schools if you constructed a more realistic test table and also used a
> by= argument that had both years in the list().
>
> --
> David.
>
>
> >
> > Thanks a TON,
> >
> > Damian
> >
> > Damian Betebenner
> > Center for Assessment
> > PO Box 351
> > Dover, NH 03821-0351
> >
> > Phone (office): (603) 516-7900
> > Phone (cell): (857) 234-2474
> > Fax: (603) 516-7910
> >
> > dbetebenner at nciea.org
> > www.nciea.org
> >
> >
> >
> >
> > -----Original Message-----
> > From: David Winsemius [mailto:dwinsemius at comcast.net]
> > Sent: Sunday, September 05, 2010 6:56 PM
> > To: datatable-help at lists.r-forge.r-project.org
> > Cc: Damian Betebenner; David Winsemius
> > Subject: Re: [datatable-help] Data table syntax
> >
> > I went back and reviewed the merge.data.frame method and then
> > (re-)reviewed the data.table documents. I was disappointed to find the
> > there were relatively few worked examples of data.table merges, but
> > finally got something that seems to make sense:
> >
> > dftxt <- as.data.frame(dtxt)
> > # The dataframe method
> >> merge(dftxt[dftxt$YEAR==2008, c(1,2,4)], dftxt[dftxt$YEAR==2009,
> > c(1,2,4)], by="STUDENT_ID")
> > STUDENT_ID SCHOOL_NUMBER.x SCORE.x SCHOOL_NUMBER.y SCORE.y
> > 1 1 100 39 200 48
> > 2 2 100 64 200 73
> > 3 3 100 35 200 35
> > 4 4 100 52 200 61
> > 5 5 100 51 200 58
> > 6 6 300 45 400 55
> > 7 7 300 69 400 77
> > 8 8 300 47 400 47
> > 9 9 300 57 400 58
> > 10 10 300 47 400 53
> >
> > #The datatable method:
> >
> >> setkey(dtxt) <- "STUDENT_ID" #as far as I can tell this is needed
> > to repalce the by= argument in merge.data.frame()
> >
> > #( I tried several failed efforts at incorporating the key= statement
> > within the data.table calls.)
> >
> >> merge(dtxt[YEAR==2008, ], dtxt[YEAR==2009, ])
> > STUDENT_ID SCHOOL_NUMBER YEAR SCORE SCHOOL_NUMBER.1 YEAR.1
> > SCORE.1
> > [1,] 1 100 2008 39 200 2009
> > 48
> > [2,] 2 100 2008 64 200 2009
> > 73
> > [3,] 3 100 2008 35 200 2009
> > 35
> > [4,] 4 100 2008 52 200 2009
> > 61
> > [5,] 5 100 2008 51 200 2009
> > 58
> > [6,] 6 300 2008 45 400 2009
> > 55
> > [7,] 7 300 2008 69 400 2009
> > 77
> > [8,] 8 300 2008 47 400 2009
> > 47
> > [9,] 9 300 2008 57 400 2009
> > 58
> > [10,] 10 300 2008 47 400
> > 2009 53
> >
> >
> >> mtxt <- merge(dtxt[YEAR==2008, ], dtxt[YEAR==2009, ])
> >
> >> mtxt[, mean(SCORE) , by=list("2009_SCH"=SCHOOL_NUMBER.1)]
> > X2009_SCH V1
> > [1,] 200 48.2
> > [2,] 400 53.0
> >
> >
> > On Sep 5, 2010, at 4:43 PM, David Winsemius wrote:
> >
> >>
> >> On Sep 5, 2010, at 3:02 PM, Damian Betebenner wrote:
> >>
> >>> Hi David,
> >>>
> >>> Thanks for the quick and thoughtful reply.
> >>>
> >>> Sorry for not being clearer.
> >>>
> >>> There are two schools that the students attended in 2009 (200 and
> >>> 400). I'd like to break on those, and calculate the mean for all
> >>> the students in those two schools but for their 2008 scores.
> >>>
> >>> Thus, the output would have 2 rows:
> >>>
> >>> YEAR, 2009_SCHOOL_NUMBER, 2008_SCORE_MEAN
> >>>
> >>> 2009 200 54.4
> >>> 2009 400 53
> >>
> >> I dont see what sort of linkage you have between the 2008 and 2009
> >> school-numbers but see if this satisfies:
> >>
> >>> dtxt[YEAR==2008 & STUDENT_ID %in% dtxt[YEAR==2009, STUDENT_ID],
> >> mean(SCORE), by=SCHOOL_NUMBER]
> >> SCHOOL_NUMBER V1
> >> [1,] 100 48.2
> >> [2,] 300 53.0
> >>
> >>>
> >>> Thanks for considering this,
> >>>
> >>> Damian
> >>>
> >>>
> >>> Damian Betebenner
> >>> Center for Assessment
> >>> PO Box 351
> >>> Dover, NH 03821-0351
> >>>
> >>> Phone (office): (603) 516-7900
> >>> Phone (cell): (857) 234-2474
> >>> Fax: (603) 516-7910
> >>>
> >>> dbetebenner at nciea.org
> >>> www.nciea.org
> >>>
> >>>
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: David Winsemius [mailto:dwinsemius at comcast.net]
> >>> Sent: Sunday, September 05, 2010 1:03 PM
> >>> To: David Winsemius
> >>> Cc: Damian Betebenner; datatable-help at lists.r-forge.r-project.org
> >>> Subject: Re: [datatable-help] Data table syntax
> >>>
> >>>
> >>> On Sep 5, 2010, at 12:43 PM, David Winsemius wrote:
> >>>
> >>>>
> >>>> On Sep 5, 2010, at 11:38 AM, Damian Betebenner wrote:
> >>>>
> >>>>> Thanks for the invaluable help on my previous questions. The speed
> >>>>> up in create summary tables has been immense and I'm enthused
> >>>>> about
> >>>>> all the possibilities going forward.
> >>>>>
> >>>>> I'm currently stuck in trying to put together syntax for a "long"
> >>>>> for table. In the example below, each case is a unique Student by
> >>>>> Year combination. What I'm trying to do is take
> >>>>> such a table, aggregate on the student's current year (i.e., 2009
> >>>>> in this data) SCHOOL_NUMBER, and calculate their mean score in the
> >>>>> previous year (i.e., 2008 in this data).
> >>>>>
> >>>>> If the file were "wide", with each case representing a unique
> >>>>> student with separate variables for the year, then it would be
> >>>>> easy
> >>>>> to break on the 2009 SCHOOL_NUMBER and take the
> >>>>> mean of the 2008 SCORE.
> >>>
> >>> But there is only one 2008 SCORE for each student???
> >>>
> >>>>>
> >>>>> Is conversion of long to wide necessary to do this?
> >>>>
> >>>> Probably not. Are you familiar with the "ave" function in base R?
> >>>
> >>> I am having some difficulty understanding the structure of the
> >>> desired
> >>> output. I initially thought it might be something like:
> >>> rd.txt <-
> >>> function(txt, header=TRUE, ...) {
> >>> rd <- read.table(textConnection(txt), header=header, ...)
> >>> closeAllConnections()
> >>> rd }
> >>> txt <- rd.txt("STUDENT_ID SCHOOL_NUMBER YEAR SCORE
> >>> 1 100 2008 39
> >>> 1 200 2009 48
> >>> 2 100 2008 64
> >>> 2 200 2009 73
> >>> 3 100 2008 35
> >>> 3 200 2009 35
> >>> 4 100 2008 52
> >>> 4 200 2009 61
> >>> 5 100 2008 51
> >>> 5 200 2009 58
> >>> 6 300 2008 45
> >>> 6 400 2009 55
> >>> 7 300 2008 69
> >>> 7 400 2009 77
> >>> 8 300 2008 47
> >>> 8 400 2009 47
> >>> 9 300 2008 57
> >>> 9 400 2009 58
> >>> 10 300 2008 47
> >>> 10 400 2009 53")
> >>> dtxt <- data.table(txt)
> >>>
> >>>> dtxt$avScr <- dtxt[ , ave(SCORE, list(STUDENT_ID))] # returns a
> >>> vector as long as its input
> >>>> dtxt
> >>>
> >>> But now I am wondering if you wanted:
> >>>
> >>>> dtxt[ , tapply(SCORE, list(STUDENT_ID), mean)] # returns vector
> >>> only as long as product of category levels.
> >>> 1 2 3 4 5 6 7 8 9 10
> >>> 43.5 68.5 35.0 56.5 54.5 50.0 73.0 47.0 57.5 50.0
> >>>
> >>>>
> >>>>>
> >>>>>
> >>>>> STUDENT_ID SCHOOL_NUMBER YEAR SCORE
> >>>>> [1,] 1 100 2008 39
> >>>>> [2,] 1 200 2009 48
> >>>>> [3,] 2 100 2008 64
> >>>>> [4,] 2 200 2009 73
> >>>>> [5,] 3 100 2008 35
> >>>>> [6,] 3 200 2009 35
> >>>>> [7,] 4 100 2008 52
> >>>>> [8,] 4 200 2009 61
> >>>>> [9,] 5 100 2008 51
> >>>>> [10,] 5 200 2009 58
> >>>>> [11,] 6 300 2008 45
> >>>>> [12,] 6 400 2009 55
> >>>>> [13,] 7 300 2008 69
> >>>>> [14,] 7 400 2009 77
> >>>>> [15,] 8 300 2008 47
> >>>>> [16,] 8 400 2009 47
> >>>>> [17,] 9 300 2008 57
> >>>>> [18,] 9 400 2009 58
> >>>>> [19,] 10 300 2008 47
> >>>>> [20,] 10 400 2009 53
> >>>>>
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Damian
> >>>>
> >>>
> >>>
> >>> David Winsemius, MD
> >>> West Hartford, CT
> >>>
> >>
> >> David Winsemius, MD
> >> West Hartford, CT
> >>
> >> _______________________________________________
> >> datatable-help mailing list
> >> datatable-help at lists.r-forge.r-project.org
> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >
> > David Winsemius, MD
> > West Hartford, CT
> >
>
> David Winsemius, MD
> West Hartford, CT
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
More information about the datatable-help
mailing list