[datatable-help] Data table syntax

Matthew Dowle mdowle at mdowle.plus.com
Wed Sep 8 00:01:16 CEST 2010


In terms of efficiency I should add that

  dtxt[SJ(STUDENT_ID,YEAR-1)]

should be faster than what I used :

  dtxt[J(STUDENT_ID,YEAR-1)]

The SJ() makes the i data.table key'd. When 2 key'd data.tables are
joined the binary search doesn't start from scratch for each row of i.
It sweeps through both tables side by side. If you have a long list of
1,000 words to look up in the English dictionary you would sort them
first and look through the dictionary starting at the beginning looking
forward for the next word you know comes afterwards. That's what joining
2 key'd data.tables is doing. You might be lucky and several words could
be on the same page so you can find a few words at a time before turning
the page. When the i data.table isn't key'd the rows could be any order
so it has to start a new binary search for each and every row of i.

This difference is probably most noticeable in self joins, where the
entire table is joined to itself, as we have here.

Matthew


On Tue, 2010-09-07 at 04:15 -0500, Damian Betebenner wrote:
> Thanks Matthew and David,
> 
> The joining to the previous score, in the method you suggested, is really nice and, I think, efficient.
> 
> I had tried what David initially proposed as well as "flattening" the file in a manner you suggested to an earlier problem I had:
> 
> dtxt[,as.list(SCORE[match(2008, YEAR)]), by=SCHOOL_NUMBER]
> 
> I like your solution because it keeps the original "long" file intact and just adds on the relevant data right where it belongs.
> The file I'm working with has about 6 million records so getting something that works, at least, relatively efficiently was a priority.
> 
> Best regards,
> 
> Damian
> 
> Damian Betebenner
> Center for Assessment
> PO Box 351
> Dover, NH   03821-0351
> 
> Phone (office): (603) 516-7900
> Phone (cell): (857) 234-2474
> Fax: (603) 516-7910
> 
> dbetebenner at nciea.org
> www.nciea.org
> 
> 
> 
> 
> -----Original Message-----
> From: Matthew Dowle [mailto:mdowle at mdowle.plus.com]
> Sent: Monday, September 06, 2010 7:09 PM
> To: Damian Betebenner
> Cc: dwinsemius at comcast.net; datatable-help at lists.r-forge.r-project.org
> Subject: Re: [datatable-help] Data table syntax
> 
> Damian,
> 
> I was also struggling with the example data but I think
> I get it now. All these children moved from one school
> to another then; none stayed at the same school. Also,
> in the desired result :
> 
>    YEAR, 2009_SCHOOL_NUMBER, 2008_SCORE_MEAN
>    2009  200    54.4
>    2009  400    53
> 
> it seems the 54.4 should be 48.2 as David obtained :
> 
>    SCHOOL_NUMBER   V1
>              100 48.2
>              300 53.0
> 
> If I understand that correctly so far then ...
> 
> It's usually good to include time as the last column
> of the key. I pasted David's code (thanks) to create
> the data then did :
> 
> > setkey(dtxt, STUDENT_ID, YEAR)
> > dtxt
>       STUDENT_ID SCHOOL_NUMBER YEAR SCORE
>  [1,]          1           100 2008    39
>  [2,]          1           200 2009    48
>  [3,]          2           100 2008    64
>  [4,]          2           200 2009    73
>  [5,]          3           100 2008    35
>  [6,]          3           200 2009    35
>  [7,]          4           100 2008    52
>  [8,]          4           200 2009    61
>  [9,]          5           100 2008    51
> [10,]          5           200 2009    58
> [11,]          6           300 2008    45
> [12,]          6           400 2009    55
> [13,]          7           300 2008    69
> [14,]          7           400 2009    77
> [15,]          8           300 2008    47
> [16,]          8           400 2009    47
> [17,]          9           300 2008    57
> [18,]          9           400 2009    58
> [19,]         10           300 2008    47
> [20,]         10           400 2009    53
> 
> Self join each student to their previous year :
> 
> > dtxt[J(STUDENT_ID,YEAR-1)]
>       STUDENT_ID SCHOOL_NUMBER YEAR SCORE
>  [1,]         NA            NA   NA    NA
>  [2,]          1           100 2008    39
>  [3,]         NA            NA   NA    NA
>  [4,]          2           100 2008    64
>  [5,]         NA            NA   NA    NA
>  [6,]          3           100 2008    35
>  [7,]         NA            NA   NA    NA
>  [8,]          4           100 2008    52
>  [9,]         NA            NA   NA    NA
> [10,]          5           100 2008    51
> [11,]         NA            NA   NA    NA
> [12,]          6           300 2008    45
> [13,]         NA            NA   NA    NA
> [14,]          7           300 2008    69
> [15,]         NA            NA   NA    NA
> [16,]          8           300 2008    47
> [17,]         NA            NA   NA    NA
> [18,]          9           300 2008    57
> [19,]         NA            NA   NA    NA
> [20,]         10           300 2008    47
> 
> Pick out the prior year's score and add it as column :
> 
> > dtxt$PRIOR_SCORE = dtxt[J(STUDENT_ID,YEAR-1),SCORE]
> > dtxt
>       STUDENT_ID SCHOOL_NUMBER YEAR SCORE PRIOR_SCORE
>  [1,]          1           100 2008    39          NA
>  [2,]          1           200 2009    48          39
>  [3,]          2           100 2008    64          NA
>  [4,]          2           200 2009    73          64
>  [5,]          3           100 2008    35          NA
>  [6,]          3           200 2009    35          35
>  [7,]          4           100 2008    52          NA
>  [8,]          4           200 2009    61          52
>  [9,]          5           100 2008    51          NA
> [10,]          5           200 2009    58          51
> [11,]          6           300 2008    45          NA
> [12,]          6           400 2009    55          45
> [13,]          7           300 2008    69          NA
> [14,]          7           400 2009    77          69
> [15,]          8           300 2008    47          NA
> [16,]          8           400 2009    47          47
> [17,]          9           300 2008    57          NA
> [18,]          9           400 2009    58          57
> [19,]         10           300 2008    47          NA
> [20,]         10           400 2009    53          47
> 
> > dtxt[,mean(PRIOR_SCORE),by=list(SCHOOL_NUMBER,YEAR)]
>      SCHOOL_NUMBER YEAR   V1
> [1,]           100 2008   NA
> [2,]           200 2009 48.2
> [3,]           300 2008   NA
> [4,]           400 2009 53.0
> 
> or the 'by without by' direct way :
> 
> > setkey(dtxt,SCHOOL_NUMBER,YEAR)
> > dtxt[J(c(200,400),2009),mean(PRIOR_SCORE),mult="all"]
>      SCHOOL_NUMBER YEAR   V1
> [1,]           200 2009 48.2
> [2,]           400 2009 53.0
> 
> Hope that helps?
> Matthew
> 
> 
> On Sun, 2010-09-05 at 19:12 -0400, David Winsemius wrote:
> > On Sep 5, 2010, at 6:58 PM, Damian Betebenner wrote:
> >
> > > David,
> > >
> > > Thanks!!! Your previous example, which I'm just playing with now
> > > seems to work. Given that the file I'm working with has, usually,
> > > over a million cases, I want to make sure it works on something
> > > smaller before setting it to work on the bigger table. I'll try
> > > optimizing later.
> >
> > The problem with my first solution is that there is no method to
> > associate or test for the prior or next school attended. The second
> > method should allow looking at the cross-tabulation of 2008 and 2009
> > schools if you constructed a more realistic test table and also used a
> > by= argument that had both years in the list().
> >
> > --
> > David.
> >
> >
> > >
> > > Thanks a TON,
> > >
> > > Damian
> > >
> > > Damian Betebenner
> > > Center for Assessment
> > > PO Box 351
> > > Dover, NH   03821-0351
> > >
> > > Phone (office): (603) 516-7900
> > > Phone (cell): (857) 234-2474
> > > Fax: (603) 516-7910
> > >
> > > dbetebenner at nciea.org
> > > www.nciea.org
> > >
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: David Winsemius [mailto:dwinsemius at comcast.net]
> > > Sent: Sunday, September 05, 2010 6:56 PM
> > > To: datatable-help at lists.r-forge.r-project.org
> > > Cc: Damian Betebenner; David Winsemius
> > > Subject: Re: [datatable-help] Data table syntax
> > >
> > > I went back and reviewed the merge.data.frame method and then
> > > (re-)reviewed the data.table documents. I was disappointed to find the
> > > there were relatively few worked examples of data.table merges, but
> > > finally got something that seems to make sense:
> > >
> > > dftxt <- as.data.frame(dtxt)
> > > # The dataframe method
> > >> merge(dftxt[dftxt$YEAR==2008, c(1,2,4)], dftxt[dftxt$YEAR==2009,
> > > c(1,2,4)], by="STUDENT_ID")
> > >    STUDENT_ID SCHOOL_NUMBER.x SCORE.x SCHOOL_NUMBER.y SCORE.y
> > > 1           1             100      39             200      48
> > > 2           2             100      64             200      73
> > > 3           3             100      35             200      35
> > > 4           4             100      52             200      61
> > > 5           5             100      51             200      58
> > > 6           6             300      45             400      55
> > > 7           7             300      69             400      77
> > > 8           8             300      47             400      47
> > > 9           9             300      57             400      58
> > > 10         10             300      47             400      53
> > >
> > > #The datatable method:
> > >
> > >> setkey(dtxt) <- "STUDENT_ID"   #as far as I can tell this is needed
> > > to repalce the by= argument in merge.data.frame()
> > >
> > > #( I tried several failed efforts at incorporating the key= statement
> > > within the data.table calls.)
> > >
> > >> merge(dtxt[YEAR==2008,  ], dtxt[YEAR==2009,  ])
> > >       STUDENT_ID SCHOOL_NUMBER YEAR SCORE SCHOOL_NUMBER.1 YEAR.1
> > > SCORE.1
> > >  [1,]          1           100 2008    39             200   2009
> > > 48
> > >  [2,]          2           100 2008    64             200   2009
> > > 73
> > >  [3,]          3           100 2008    35             200   2009
> > > 35
> > >  [4,]          4           100 2008    52             200   2009
> > > 61
> > >  [5,]          5           100 2008    51             200   2009
> > > 58
> > >  [6,]          6           300 2008    45             400   2009
> > > 55
> > >  [7,]          7           300 2008    69             400   2009
> > > 77
> > >  [8,]          8           300 2008    47             400   2009
> > > 47
> > >  [9,]          9           300 2008    57             400   2009
> > > 58
> > > [10,]         10           300 2008    47             400
> > > 2009      53
> > >
> > >
> > >> mtxt <- merge(dtxt[YEAR==2008,  ], dtxt[YEAR==2009,  ])
> > >
> > >> mtxt[, mean(SCORE) , by=list("2009_SCH"=SCHOOL_NUMBER.1)]
> > >      X2009_SCH   V1
> > > [1,]       200 48.2
> > > [2,]       400 53.0
> > >
> > >
> > > On Sep 5, 2010, at 4:43 PM, David Winsemius wrote:
> > >
> > >>
> > >> On Sep 5, 2010, at 3:02 PM, Damian Betebenner wrote:
> > >>
> > >>> Hi David,
> > >>>
> > >>> Thanks for the quick and thoughtful reply.
> > >>>
> > >>> Sorry for not being clearer.
> > >>>
> > >>> There are two schools that the students attended in 2009 (200 and
> > >>> 400). I'd like to break on those, and calculate the mean for all
> > >>> the students in those two schools but for their 2008 scores.
> > >>>
> > >>> Thus, the output would have 2 rows:
> > >>>
> > >>> YEAR, 2009_SCHOOL_NUMBER, 2008_SCORE_MEAN
> > >>>
> > >>> 2009  200    54.4
> > >>> 2009  400    53
> > >>
> > >> I dont see what sort of linkage you have between the 2008 and 2009
> > >> school-numbers but see if this satisfies:
> > >>
> > >>> dtxt[YEAR==2008 & STUDENT_ID %in% dtxt[YEAR==2009, STUDENT_ID],
> > >> mean(SCORE), by=SCHOOL_NUMBER]
> > >>    SCHOOL_NUMBER   V1
> > >> [1,]           100 48.2
> > >> [2,]           300 53.0
> > >>
> > >>>
> > >>> Thanks for considering this,
> > >>>
> > >>> Damian
> > >>>
> > >>>
> > >>> Damian Betebenner
> > >>> Center for Assessment
> > >>> PO Box 351
> > >>> Dover, NH   03821-0351
> > >>>
> > >>> Phone (office): (603) 516-7900
> > >>> Phone (cell): (857) 234-2474
> > >>> Fax: (603) 516-7910
> > >>>
> > >>> dbetebenner at nciea.org
> > >>> www.nciea.org
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> -----Original Message-----
> > >>> From: David Winsemius [mailto:dwinsemius at comcast.net]
> > >>> Sent: Sunday, September 05, 2010 1:03 PM
> > >>> To: David Winsemius
> > >>> Cc: Damian Betebenner; datatable-help at lists.r-forge.r-project.org
> > >>> Subject: Re: [datatable-help] Data table syntax
> > >>>
> > >>>
> > >>> On Sep 5, 2010, at 12:43 PM, David Winsemius wrote:
> > >>>
> > >>>>
> > >>>> On Sep 5, 2010, at 11:38 AM, Damian Betebenner wrote:
> > >>>>
> > >>>>> Thanks for the invaluable help on my previous questions. The speed
> > >>>>> up in create summary tables has been immense and I'm enthused
> > >>>>> about
> > >>>>> all the possibilities going forward.
> > >>>>>
> > >>>>> I'm currently stuck in trying to put together syntax for a "long"
> > >>>>> for table. In the example below, each case is a unique Student by
> > >>>>> Year combination. What I'm trying to do is take
> > >>>>> such a table, aggregate on the student's  current year (i.e., 2009
> > >>>>> in this data) SCHOOL_NUMBER, and calculate their mean score in the
> > >>>>> previous year (i.e., 2008 in this data).
> > >>>>>
> > >>>>> If the file were "wide", with each case representing a unique
> > >>>>> student with separate variables for the year, then it would be
> > >>>>> easy
> > >>>>> to break on the 2009 SCHOOL_NUMBER and take the
> > >>>>> mean of the 2008 SCORE.
> > >>>
> > >>> But there is only one 2008 SCORE for each student???
> > >>>
> > >>>>>
> > >>>>> Is conversion of long to wide necessary to do this?
> > >>>>
> > >>>> Probably not. Are you familiar with the "ave" function in base R?
> > >>>
> > >>> I am having some difficulty understanding the structure of the
> > >>> desired
> > >>> output. I initially thought it might be something like:
> > >>> rd.txt <-
> > >>> function(txt, header=TRUE, ...) {
> > >>>    rd <- read.table(textConnection(txt), header=header, ...)
> > >>>      closeAllConnections()
> > >>>    rd }
> > >>> txt <- rd.txt("STUDENT_ID SCHOOL_NUMBER YEAR SCORE
> > >>>        1           100 2008    39
> > >>>        1           200 2009    48
> > >>>        2           100 2008    64
> > >>>        2           200 2009    73
> > >>>        3           100 2008    35
> > >>>        3           200 2009    35
> > >>>        4           100 2008    52
> > >>>        4           200 2009    61
> > >>>        5           100 2008    51
> > >>>        5           200 2009    58
> > >>>        6           300 2008    45
> > >>>        6           400 2009    55
> > >>>        7           300 2008    69
> > >>>        7           400 2009    77
> > >>>        8           300 2008    47
> > >>>        8           400 2009    47
> > >>>        9           300 2008    57
> > >>>        9           400 2009    58
> > >>>       10           300 2008    47
> > >>>       10           400 2009    53")
> > >>> dtxt <- data.table(txt)
> > >>>
> > >>>> dtxt$avScr <- dtxt[ , ave(SCORE, list(STUDENT_ID))]  # returns a
> > >>> vector as long as its input
> > >>>> dtxt
> > >>>
> > >>> But now I am wondering if you wanted:
> > >>>
> > >>>> dtxt[ , tapply(SCORE, list(STUDENT_ID), mean)]  # returns vector
> > >>> only as long as product of category levels.
> > >>>  1    2    3    4    5    6    7    8    9   10
> > >>> 43.5 68.5 35.0 56.5 54.5 50.0 73.0 47.0 57.5 50.0
> > >>>
> > >>>>
> > >>>>>
> > >>>>>
> > >>>>>   STUDENT_ID SCHOOL_NUMBER YEAR SCORE
> > >>>>> [1,]          1           100 2008    39
> > >>>>> [2,]          1           200 2009    48
> > >>>>> [3,]          2           100 2008    64
> > >>>>> [4,]          2           200 2009    73
> > >>>>> [5,]          3           100 2008    35
> > >>>>> [6,]          3           200 2009    35
> > >>>>> [7,]          4           100 2008    52
> > >>>>> [8,]          4           200 2009    61
> > >>>>> [9,]          5           100 2008    51
> > >>>>> [10,]          5           200 2009    58
> > >>>>> [11,]          6           300 2008    45
> > >>>>> [12,]          6           400 2009    55
> > >>>>> [13,]          7           300 2008    69
> > >>>>> [14,]          7           400 2009    77
> > >>>>> [15,]          8           300 2008    47
> > >>>>> [16,]          8           400 2009    47
> > >>>>> [17,]          9           300 2008    57
> > >>>>> [18,]          9           400 2009    58
> > >>>>> [19,]         10           300 2008    47
> > >>>>> [20,]         10           400 2009    53
> > >>>>>
> > >>>>>
> > >>>>> Thanks,
> > >>>>>
> > >>>>> Damian
> > >>>>
> > >>>
> > >>>
> > >>> David Winsemius, MD
> > >>> West Hartford, CT
> > >>>
> > >>
> > >> David Winsemius, MD
> > >> West Hartford, CT
> > >>
> > >> _______________________________________________
> > >> datatable-help mailing list
> > >> datatable-help at lists.r-forge.r-project.org
> > >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > >
> > > David Winsemius, MD
> > > West Hartford, CT
> > >
> >
> > David Winsemius, MD
> > West Hartford, CT
> >
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 
> 
> 
> 




More information about the datatable-help mailing list