[datatable-help] Data table syntax

Tue Sep 7 11:15:01 CEST 2010

Thanks Matthew and David,

The joining to the previous score, in the method you suggested, is really nice and, I think, efficient.

I had tried what David initially proposed as well as "flattening" the file in a manner you suggested to an earlier problem I had:

dtxt[,as.list(SCORE[match(2008, YEAR)]), by=SCHOOL_NUMBER]

I like your solution because it keeps the original "long" file intact and just adds on the relevant data right where it belongs.
The file I'm working with has about 6 million records so getting something that works, at least, relatively efficiently was a priority.

Best regards,

Damian

Damian Betebenner
Center for Assessment
PO Box 351
Dover, NH   03821-0351

Phone (office): (603) 516-7900
Phone (cell): (857) 234-2474
Fax: (603) 516-7910

dbetebenner at nciea.org
www.nciea.org

-----Original Message-----
From: Matthew Dowle [mailto:mdowle at mdowle.plus.com]
Sent: Monday, September 06, 2010 7:09 PM
To: Damian Betebenner
Cc: dwinsemius at comcast.net; datatable-help at lists.r-forge.r-project.org
Subject: Re: [datatable-help] Data table syntax

Damian,

I was also struggling with the example data but I think
I get it now. All these children moved from one school
to another then; none stayed at the same school. Also,
in the desired result :

   YEAR, 2009_SCHOOL_NUMBER, 2008_SCORE_MEAN
   2009  200    54.4
   2009  400    53

it seems the 54.4 should be 48.2 as David obtained :

   SCHOOL_NUMBER   V1
             100 48.2
             300 53.0

If I understand that correctly so far then ...

It's usually good to include time as the last column
of the key. I pasted David's code (thanks) to create
the data then did :

> setkey(dtxt, STUDENT_ID, YEAR)
> dtxt
      STUDENT_ID SCHOOL_NUMBER YEAR SCORE
 [1,]          1           100 2008    39
 [2,]          1           200 2009    48
 [3,]          2           100 2008    64
 [4,]          2           200 2009    73
 [5,]          3           100 2008    35
 [6,]          3           200 2009    35
 [7,]          4           100 2008    52
 [8,]          4           200 2009    61
 [9,]          5           100 2008    51
[10,]          5           200 2009    58
[11,]          6           300 2008    45
[12,]          6           400 2009    55
[13,]          7           300 2008    69
[14,]          7           400 2009    77
[15,]          8           300 2008    47
[16,]          8           400 2009    47
[17,]          9           300 2008    57
[18,]          9           400 2009    58
[19,]         10           300 2008    47
[20,]         10           400 2009    53

Self join each student to their previous year :

> dtxt[J(STUDENT_ID,YEAR-1)]
      STUDENT_ID SCHOOL_NUMBER YEAR SCORE
 [1,]         NA            NA   NA    NA
 [2,]          1           100 2008    39
 [3,]         NA            NA   NA    NA
 [4,]          2           100 2008    64
 [5,]         NA            NA   NA    NA
 [6,]          3           100 2008    35
 [7,]         NA            NA   NA    NA
 [8,]          4           100 2008    52
 [9,]         NA            NA   NA    NA
[10,]          5           100 2008    51
[11,]         NA            NA   NA    NA
[12,]          6           300 2008    45
[13,]         NA            NA   NA    NA
[14,]          7           300 2008    69
[15,]         NA            NA   NA    NA
[16,]          8           300 2008    47
[17,]         NA            NA   NA    NA
[18,]          9           300 2008    57
[19,]         NA            NA   NA    NA
[20,]         10           300 2008    47

Pick out the prior year's score and add it as column :

> dtxt$PRIOR_SCORE = dtxt[J(STUDENT_ID,YEAR-1),SCORE]
> dtxt
      STUDENT_ID SCHOOL_NUMBER YEAR SCORE PRIOR_SCORE
 [1,]          1           100 2008    39          NA
 [2,]          1           200 2009    48          39
 [3,]          2           100 2008    64          NA
 [4,]          2           200 2009    73          64
 [5,]          3           100 2008    35          NA
 [6,]          3           200 2009    35          35
 [7,]          4           100 2008    52          NA
 [8,]          4           200 2009    61          52
 [9,]          5           100 2008    51          NA
[10,]          5           200 2009    58          51
[11,]          6           300 2008    45          NA
[12,]          6           400 2009    55          45
[13,]          7           300 2008    69          NA
[14,]          7           400 2009    77          69
[15,]          8           300 2008    47          NA
[16,]          8           400 2009    47          47
[17,]          9           300 2008    57          NA
[18,]          9           400 2009    58          57
[19,]         10           300 2008    47          NA
[20,]         10           400 2009    53          47

> dtxt[,mean(PRIOR_SCORE),by=list(SCHOOL_NUMBER,YEAR)]
     SCHOOL_NUMBER YEAR   V1
[1,]           100 2008   NA
[2,]           200 2009 48.2
[3,]           300 2008   NA
[4,]           400 2009 53.0

or the 'by without by' direct way :

> setkey(dtxt,SCHOOL_NUMBER,YEAR)
> dtxt[J(c(200,400),2009),mean(PRIOR_SCORE),mult="all"]
     SCHOOL_NUMBER YEAR   V1
[1,]           200 2009 48.2
[2,]           400 2009 53.0

Hope that helps?
Matthew

On Sun, 2010-09-05 at 19:12 -0400, David Winsemius wrote:
> On Sep 5, 2010, at 6:58 PM, Damian Betebenner wrote:
>
> > David,
> >
> > Thanks!!! Your previous example, which I'm just playing with now
> > seems to work. Given that the file I'm working with has, usually,
> > over a million cases, I want to make sure it works on something
> > smaller before setting it to work on the bigger table. I'll try
> > optimizing later.
>
> The problem with my first solution is that there is no method to
> associate or test for the prior or next school attended. The second
> method should allow looking at the cross-tabulation of 2008 and 2009
> schools if you constructed a more realistic test table and also used a
> by= argument that had both years in the list().
>
> --
> David.
>
>
> >
> > Thanks a TON,
> >
> > Damian
> >
> > Damian Betebenner
> > Center for Assessment
> > PO Box 351
> > Dover, NH   03821-0351
> >
> > Phone (office): (603) 516-7900
> > Phone (cell): (857) 234-2474
> > Fax: (603) 516-7910
> >
> > dbetebenner at nciea.org
> > www.nciea.org
> >
> >
> >
> >
> > -----Original Message-----
> > From: David Winsemius [mailto:dwinsemius at comcast.net]
> > Sent: Sunday, September 05, 2010 6:56 PM
> > To: datatable-help at lists.r-forge.r-project.org
> > Cc: Damian Betebenner; David Winsemius
> > Subject: Re: [datatable-help] Data table syntax
> >
> > I went back and reviewed the merge.data.frame method and then
> > (re-)reviewed the data.table documents. I was disappointed to find the
> > there were relatively few worked examples of data.table merges, but
> > finally got something that seems to make sense:
> >
> > dftxt <- as.data.frame(dtxt)
> > # The dataframe method
> >> merge(dftxt[dftxt$YEAR==2008, c(1,2,4)], dftxt[dftxt$YEAR==2009,
> > c(1,2,4)], by="STUDENT_ID")
> >    STUDENT_ID SCHOOL_NUMBER.x SCORE.x SCHOOL_NUMBER.y SCORE.y
> > 1           1             100      39             200      48
> > 2           2             100      64             200      73
> > 3           3             100      35             200      35
> > 4           4             100      52             200      61
> > 5           5             100      51             200      58
> > 6           6             300      45             400      55
> > 7           7             300      69             400      77
> > 8           8             300      47             400      47
> > 9           9             300      57             400      58
> > 10         10             300      47             400      53
> >
> > #The datatable method:
> >
> >> setkey(dtxt) <- "STUDENT_ID"   #as far as I can tell this is needed
> > to repalce the by= argument in merge.data.frame()
> >
> > #( I tried several failed efforts at incorporating the key= statement
> > within the data.table calls.)
> >
> >> merge(dtxt[YEAR==2008,  ], dtxt[YEAR==2009,  ])
> >       STUDENT_ID SCHOOL_NUMBER YEAR SCORE SCHOOL_NUMBER.1 YEAR.1
> > SCORE.1
> >  [1,]          1           100 2008    39             200   2009
> > 48
> >  [2,]          2           100 2008    64             200   2009
> > 73
> >  [3,]          3           100 2008    35             200   2009
> > 35
> >  [4,]          4           100 2008    52             200   2009
> > 61
> >  [5,]          5           100 2008    51             200   2009
> > 58
> >  [6,]          6           300 2008    45             400   2009
> > 55
> >  [7,]          7           300 2008    69             400   2009
> > 77
> >  [8,]          8           300 2008    47             400   2009
> > 47
> >  [9,]          9           300 2008    57             400   2009
> > 58
> > [10,]         10           300 2008    47             400
> > 2009      53
> >
> >
> >> mtxt <- merge(dtxt[YEAR==2008,  ], dtxt[YEAR==2009,  ])
> >
> >> mtxt[, mean(SCORE) , by=list("2009_SCH"=SCHOOL_NUMBER.1)]
> >      X2009_SCH   V1
> > [1,]       200 48.2
> > [2,]       400 53.0
> >
> >
> > On Sep 5, 2010, at 4:43 PM, David Winsemius wrote:
> >
> >>
> >> On Sep 5, 2010, at 3:02 PM, Damian Betebenner wrote:
> >>
> >>> Hi David,
> >>>
> >>> Thanks for the quick and thoughtful reply.
> >>>
> >>> Sorry for not being clearer.
> >>>
> >>> There are two schools that the students attended in 2009 (200 and
> >>> 400). I'd like to break on those, and calculate the mean for all
> >>> the students in those two schools but for their 2008 scores.
> >>>
> >>> Thus, the output would have 2 rows:
> >>>
> >>> YEAR, 2009_SCHOOL_NUMBER, 2008_SCORE_MEAN
> >>>
> >>> 2009  200    54.4
> >>> 2009  400    53
> >>
> >> I dont see what sort of linkage you have between the 2008 and 2009
> >> school-numbers but see if this satisfies:
> >>
> >>> dtxt[YEAR==2008 & STUDENT_ID %in% dtxt[YEAR==2009, STUDENT_ID],
> >> mean(SCORE), by=SCHOOL_NUMBER]
> >>    SCHOOL_NUMBER   V1
> >> [1,]           100 48.2
> >> [2,]           300 53.0
> >>
> >>>
> >>> Thanks for considering this,
> >>>
> >>> Damian
> >>>
> >>>
> >>> Damian Betebenner
> >>> Center for Assessment
> >>> PO Box 351
> >>> Dover, NH   03821-0351
> >>>
> >>> Phone (office): (603) 516-7900
> >>> Phone (cell): (857) 234-2474
> >>> Fax: (603) 516-7910
> >>>
> >>> dbetebenner at nciea.org
> >>> www.nciea.org
> >>>
> >>>
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: David Winsemius [mailto:dwinsemius at comcast.net]
> >>> Sent: Sunday, September 05, 2010 1:03 PM
> >>> To: David Winsemius
> >>> Cc: Damian Betebenner; datatable-help at lists.r-forge.r-project.org
> >>> Subject: Re: [datatable-help] Data table syntax
> >>>
> >>>
> >>> On Sep 5, 2010, at 12:43 PM, David Winsemius wrote:
> >>>
> >>>>
> >>>> On Sep 5, 2010, at 11:38 AM, Damian Betebenner wrote:
> >>>>
> >>>>> Thanks for the invaluable help on my previous questions. The speed
> >>>>> up in create summary tables has been immense and I'm enthused
> >>>>> about
> >>>>> all the possibilities going forward.
> >>>>>
> >>>>> I'm currently stuck in trying to put together syntax for a "long"
> >>>>> for table. In the example below, each case is a unique Student by
> >>>>> Year combination. What I'm trying to do is take
> >>>>> such a table, aggregate on the student's  current year (i.e., 2009
> >>>>> in this data) SCHOOL_NUMBER, and calculate their mean score in the
> >>>>> previous year (i.e., 2008 in this data).
> >>>>>
> >>>>> If the file were "wide", with each case representing a unique
> >>>>> student with separate variables for the year, then it would be
> >>>>> easy
> >>>>> to break on the 2009 SCHOOL_NUMBER and take the
> >>>>> mean of the 2008 SCORE.
> >>>
> >>> But there is only one 2008 SCORE for each student???
> >>>
> >>>>>
> >>>>> Is conversion of long to wide necessary to do this?
> >>>>
> >>>> Probably not. Are you familiar with the "ave" function in base R?
> >>>
> >>> I am having some difficulty understanding the structure of the
> >>> desired
> >>> output. I initially thought it might be something like:
> >>> rd.txt <-
> >>> function(txt, header=TRUE, ...) {
> >>>    rd <- read.table(textConnection(txt), header=header, ...)
> >>>      closeAllConnections()
> >>>    rd }
> >>> txt <- rd.txt("STUDENT_ID SCHOOL_NUMBER YEAR SCORE
> >>>        1           100 2008    39
> >>>        1           200 2009    48
> >>>        2           100 2008    64
> >>>        2           200 2009    73
> >>>        3           100 2008    35
> >>>        3           200 2009    35
> >>>        4           100 2008    52
> >>>        4           200 2009    61
> >>>        5           100 2008    51
> >>>        5           200 2009    58
> >>>        6           300 2008    45
> >>>        6           400 2009    55
> >>>        7           300 2008    69
> >>>        7           400 2009    77
> >>>        8           300 2008    47
> >>>        8           400 2009    47
> >>>        9           300 2008    57
> >>>        9           400 2009    58
> >>>       10           300 2008    47
> >>>       10           400 2009    53")
> >>> dtxt <- data.table(txt)
> >>>
> >>>> dtxt$avScr <- dtxt[ , ave(SCORE, list(STUDENT_ID))]  # returns a
> >>> vector as long as its input
> >>>> dtxt
> >>>
> >>> But now I am wondering if you wanted:
> >>>
> >>>> dtxt[ , tapply(SCORE, list(STUDENT_ID), mean)]  # returns vector
> >>> only as long as product of category levels.
> >>>  1    2    3    4    5    6    7    8    9   10
> >>> 43.5 68.5 35.0 56.5 54.5 50.0 73.0 47.0 57.5 50.0
> >>>
> >>>>
> >>>>>
> >>>>>
> >>>>>   STUDENT_ID SCHOOL_NUMBER YEAR SCORE
> >>>>> [1,]          1           100 2008    39
> >>>>> [2,]          1           200 2009    48
> >>>>> [3,]          2           100 2008    64
> >>>>> [4,]          2           200 2009    73
> >>>>> [5,]          3           100 2008    35
> >>>>> [6,]          3           200 2009    35
> >>>>> [7,]          4           100 2008    52
> >>>>> [8,]          4           200 2009    61
> >>>>> [9,]          5           100 2008    51
> >>>>> [10,]          5           200 2009    58
> >>>>> [11,]          6           300 2008    45
> >>>>> [12,]          6           400 2009    55
> >>>>> [13,]          7           300 2008    69
> >>>>> [14,]          7           400 2009    77
> >>>>> [15,]          8           300 2008    47
> >>>>> [16,]          8           400 2009    47
> >>>>> [17,]          9           300 2008    57
> >>>>> [18,]          9           400 2009    58
> >>>>> [19,]         10           300 2008    47
> >>>>> [20,]         10           400 2009    53
> >>>>>
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Damian
> >>>>
> >>>
> >>>
> >>> David Winsemius, MD
> >>> West Hartford, CT
> >>>
> >>
> >> David Winsemius, MD
> >> West Hartford, CT
> >>
> >> _______________________________________________
> >> datatable-help mailing list
> >> datatable-help at lists.r-forge.r-project.org
> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >
> > David Winsemius, MD
> > West Hartford, CT
> >
>
> David Winsemius, MD
> West Hartford, CT
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help