[datatable-help] Memory issue

Gene Leynes gleynes+r at gmail.com
Tue Oct 23 22:57:07 CEST 2012


Hello Matt,

Unfortunately I can't share the data.  And, as I found out today, even if I
could get permission to share part of the data, that would not reproduce
the problem.

Here's why I think that a sample of the data won't prove the point.

I started with two sets of data:
dat     : A data table with 3,103,314 rows and 43 columns
dat_1000: Same table, but with a new column to define 1000 equally sized
groups

*Whole File Results:*
If I set the key of dat to be column 1  and save it, the resulting file
size is  77.731 MB
If I set the key of dat to be column 9  and save it, the resulting file
size is 206.874 MB

In other words, the second file is inflated by 266% of the original when
the key is set to column 9.  The 9th column is an iso date time value".
 However, if I use column 10, which is just a integer, the results are
similar.  For column 10 the inflation factor is 196% of the original.

However, this inflation disappears when I save the file in chunks.

*Split Results:*
Then I wanted to see how much the file size changed for a small chunk of
the data.  So, I used the grouping column to split dat_1000 into 1,000
temporary data tables (one at a time).  Then I set the key for each copy,
and saved each copy, and recorded the size of that file.  I did this for
column 1 and column 9 as the key.

When using column 1 as the key the average file size is 88 MB and the files
add up to 88,281 MB.
When using column 9 as the key the average file size is 128 MB and the
files add up to 128,395 MB.

So the 1000 files only get inflated by 145%.

*Analysis on the best and worst chunks of data:*
The chunk with the least amount of inflation is chunk 143, which is 96% of
the original.
The chunk with the most inflation is chunk 666 (go figure), which is 184%
of the original.

                     Chunk  Size(Key1) Size(Key9) Inflation
Smallest Inflation:   743      44,741     42,785      96%
Largest Inflation:    666     243,612    448,621     184%

Some information about these chunks:

Both of these chunks have 3,107 rows.
Column 9 has  347 unique values for chunk 743
Column 9 has 2,432 unique values for chunk 666

Generally speaking chunk 666 has a lot more unique values across all the
columns.

At this point, I'm just going to do a work around.

Thank you,

- Gene

On Tue, Oct 23, 2012 at 11:50 AM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:

>
> Hi Gene,
>
> Thanks for all this. Sorry for the delay. Have looked through. It does
> seem likely to do with those very long character strings. Could you save
> head() of the data, before setting the key, and either email it or save
> online somewhere please?
>
> Matthew
>
>
> > Ok, here is my very lengthy reply with lots of diagnostics.
> >
> >
> >>
> >> ## Clear the workspace
> >> rm(list=ls())
> >>
> >> ## I use a function called "loader" to load single data objects
> >> if(!require('geneorama')){
> > +   source('https://raw.github.com/geneorama/geneorama/master/R/loader.R
> ')
> > +   cat('loading function \"loader\"')
> > + }
> >>
> >> ## Load the data
> >> Small = loader('test0')
> >> Large = loader('test1')
> >>
> >> ## The two files will be different because their order is different
> >> str(Small)
> > Classes ‘data.table’ and 'data.frame': 3103314 obs. of  42 variables:
> >  $ index : int  1 2 3 4 5 6 7 8 9 10 ...
> >  $ char1 : chr  "http://conradhotels3.hilton.com" "
> > http://conradhotels3.hilton.com" "http://conradhotels3.hilton.com" "
> > http://conradhotels3.hilton.com" ...
> >  $ char2 : chr  "/en/index.html" "/en/index.html" "/en/index.html"
> > "/en/index.html" ...
> >  $ char3 : chr  "" "" "" "" ...
> >  $ int1  : int  44903 44903 44903 44903 44903 44903 44903 44903 44903
> > 44903
> > ...
> >  $ int2  : int  411 411 254 254 336 336 118 118 386 386 ...
> >  $ char4 : chr  "2012-05-09 20:17:40.587" "2012-05-09 21:17:54.427"
> > "2012-05-09 20:10:49.560" "2012-05-09 21:11:05.107" ...
> >  $ int3  : int  0 0 0 0 0 0 0 0 0 0 ...
> >  $ int4  : int  0 0 0 0 0 0 0 0 0 0 ...
> >  $ int5  : int  69 69 69 69 69 69 69 68 68 68 ...
> >  $ int6  : int  68 68 68 68 68 68 68 67 67 67 ...
> >  $ int7  : int  35 35 37 35 35 35 33 38 38 40 ...
> >  $ int8  : int  0 0 0 0 0 0 0 0 0 0 ...
> >  $ int9  : int  0 0 0 0 0 0 0 0 0 0 ...
> >  $ int10 : int  0 0 0 0 0 0 0 0 0 0 ...
> >  $ int11 : int  1 1 1 1 1 1 1 1 1 1 ...
> >  $ int12 : int  334830 334847 335102 334838 334836 342687 334521 318626
> > 318578 326800 ...
> >  $ int13 : int  36 36 37 36 36 36 35 38 37 39 ...
> >  $ int14 : int  44 44 49 47 45 45 45 46 45 48 ...
> >  $ char5 : chr  "" "" "" "" ...
> >  $ int15 : int  NA NA NA NA NA NA NA NA NA NA ...
> >  $ int16 : int  0 0 0 0 0 0 0 0 0 0 ...
> >  $ int17 : int  0 0 0 0 0 0 0 0 0 0 ...
> >  $ int18 : int  2 2 2 2 2 2 2 2 2 2 ...
> >  $ int19 : int  1381 1152 424 3728 1772 921 385 725 401 314 ...
> >  $ int20 : int  36 36 37 36 36 36 35 38 37 39 ...
> >  $ int21 : int  2199 2201 1492 1448 2559 2529 1084 1432 1876 1984 ...
> >  $ int22 : int  44 44 49 47 45 45 45 46 45 48 ...
> >  $ int23 : int  2203 2188 1199 1162 2324 2346 821 897 1386 1189 ...
> >  $ int24 : int  13 13 14 13 13 13 12 13 13 14 ...
> >  $ int25 : int  5166 5761 3755 3794 5614 7779 2830 3971 4637 5871 ...
> >  $ int26 : int  103 103 105 103 103 103 101 105 105 107 ...
> >  $ int27 : int  70 183 159 197 217 165 153 232 92 102 ...
> >  $ int28 : int  103 103 105 103 103 103 101 105 105 107 ...
> >  $ int29 : int  0 0 0 0 0 0 0 0 0 0 ...
> >  $ int30 : int  161 146 200 158 150 160 190 161 163 169 ...
> >  $ char6 : chr  "Limelight" "Limelight" "Fusepoint/Savvis"
> > "Fusepoint/Savvis" ...
> >  $ char7 : chr  "Paris" "Paris" "Toronto" "Toronto" ...
> >  $ char8 : chr  "-1" "-1" "-1" "-1" ...
> >  $ char9 : chr  "FRANCE" "FRANCE" "CANADA" "CANADA" ...
> >  $ char10: chr  "FR" "FR" "CA" "CA" ...
> >  $ char11: chr  "FRANCE" "FRANCE" "CANADA" "CANADA" ...
> >  - attr(*, ".internal.selfref")=<externalptr>
> >> str(Large)
> > Classes ‘data.table’ and 'data.frame': 3103314 obs. of  42 variables:
> >  $ index : int  716234 716235 1007651 2679944 1550732 1932010 2879445
> > 1007670 1736006 666363 ...
> >  $ char1 : chr  "http://go.compuware.com" "http://go.compuware.com" "
> > http://www.achmeacollectief.nl" "https://db3.notify.windows.com" ...
> >  $ char2 : chr  "/default.aspx" "/dynaTraceMonitor" "/unilever/" "/ping"
> > ...
> >  $ char3 : chr  "?rurl=
> > http://frontline.compuware.com//products/BU/default.aspx"
> > "?url=http%3A%2F%
> > 2Fgo.compuware.com%2Fdefault.aspx%3Frurl%3Dhttp%3A%2F%
> > 2Ffrontline.compuware.com%2F%2Fproducts%2FBU%2Fdefault.as"|
> __truncated__
> > "" "" ...
> >  $ int1  : int  2812881 2812881 3149757 4286896 3618836 3861870 4315803
> > 3149760 3779387 2754629 ...
> >  $ int2  : int  133 133 133 133 340 340 326 133 133 340 ...
> >  $ char4 : chr  "2012-05-09 20:00:00.000" "2012-05-09 20:00:00.000"
> > "2012-05-09 20:00:00.000" "2012-05-09 20:00:00.000" ...
> >  $ int3  : int  0 1 0 0 0 0 0 0 0 0 ...
> >  $ int4  : int  2264 2496 1782 461 1953 1418 641 1207 167 278 ...
> >  $ int5  : int  26 20 6 1 71 64 1 6 1 15 ...
> >  $ int6  : int  26 20 6 1 69 64 1 6 1 15 ...
> >  $ int7  : int  2 2 4 0 2 12 0 2 0 0 ...
> >  $ int8  : int  0 0 0 0 2 0 0 0 0 0 ...
> >  $ int9  : int  0 0 0 0 0 0 0 0 0 0 ...
> >  $ int10 : int  0 0 0 0 0 0 0 0 0 0 ...
> >  $ int11 : int  0 0 0 0 0 0 0 0 0 0 ...
> >  $ int12 : int  392752 417195 43107 0 1419015 1031349 187344 62969 43
> > 428189 ...
> >  $ int13 : int  4 4 5 1 8 22 1 3 1 1 ...
> >  $ int14 : int  9 11 8 1 17 38 1 6 1 15 ...
> >  $ char5 : chr  "" "" "" "" ...
> >  $ int15 : int  NA NA NA NA 0 NA NA NA NA 0 ...
> >  $ int16 : int  0 0 0 0 0 0 0 0 0 0 ...
> >  $ int17 : int  0 0 0 0 0 0 0 0 0 0 ...
> >  $ int18 : int  2 28 3 0 0 1 0 1 0 0 ...
> >  $ int19 : int  137 0 136 298 277 255 147 141 137 209 ...
> >  $ int20 : int  4 0 5 1 8 22 1 3 1 1 ...
> >  $ int21 : int  945 612 59 22 689 1153 54 29 13 59 ...
> >  $ int22 : int  9 5 8 1 17 38 1 6 1 15 ...
> >  $ int23 : int  0 0 0 118 0 0 0 0 0 0 ...
> >  $ int24 : int  0 0 0 1 0 0 0 0 0 0 ...
> >  $ int25 : int  3243 2653 1585 22 3292 3076 64 1043 13 81 ...
> >  $ int26 : int  28 22 10 1 73 76 1 8 1 15 ...
> >  $ int27 : int  2060 3365 257 1 3304 1038 376 258 4 80 ...
> >  $ int28 : int  28 22 10 1 73 76 1 8 1 15 ...
> >  $ int29 : int  0 0 0 0 0 0 0 0 0 0 ...
> >  $ int30 : int  921 750 203 578 609 1078 234 187 31 140 ...
> >  $ char6 : chr  "Interoute" "Interoute" "Interoute" "Interoute" ...
> >  $ char7 : chr  "Amsterdam" "Amsterdam" "Amsterdam" "Amsterdam" ...
> >  $ char8 : chr  "-1" "-1" "-1" "-1" ...
> >  $ char9 : chr  "NETHERLANDS" "NETHERLANDS" "NETHERLANDS" "NETHERLANDS"
> > ...
> >  $ char10: chr  "NL" "NL" "NL" "NL" ...
> >  $ char11: chr  "NETHERLANDS" "NETHERLANDS" "NETHERLANDS" "NETHERLANDS"
> > ...
> >  - attr(*, ".internal.selfref")=<externalptr>
> >  - attr(*, "sorted")= chr "char4"
> >>
> >> ## The difference is shown here
> >> mapply(identical, Small, Large)
> >  index  char1  char2  char3   int1   int2  char4   int3   int4   int5
> > int6   int7   int8   int9
> >  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE
> >  FALSE  FALSE  FALSE  FALSE
> >  int10  int11  int12  int13  int14  char5  int15  int16  int17  int18
> >  int19  int20  int21  int22
> >  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE
> >  FALSE  FALSE  FALSE  FALSE
> >  int23  int24  int25  int26  int27  int28  int29  int30  char6  char7
> >  char8  char9 char10 char11
> >  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE
> >  FALSE  FALSE  FALSE  FALSE
> >> mapply(all.equal, Small, Large)
> >                                                          index
> >                          "Mean relative difference: 0.6660698"
> >                                                          char1
> >                                    "3100674 string mismatches"
> >                                                          char2
> >                                    "2961621 string mismatches"
> >                                                          char3
> >                                    "1753352 string mismatches"
> >                                                           int1
> >                          "Mean relative difference: 0.2945024"
> >                                                           int2
> >                          "Mean relative difference: 0.4866954"
> >                                                          char4
> >                                    "3103308 string mismatches"
> >                                                           int3
> >                           "Mean relative difference: 1.759713"
> >                                                           int4
> >                           "Mean relative difference: 1.408616"
> >                                                           int5
> >                           "Mean relative difference: 1.411817"
> >                                                           int6
> >                           "Mean relative difference: 1.415648"
> >                                                           int7
> >                           "Mean relative difference: 1.705137"
> >                                                           int8
> >                           "Mean relative difference: 1.954795"
> >                                                           int9
> >                            "Mean relative difference: 1.99701"
> >                                                          int10
> >                           "Mean relative difference: 1.995529"
> >                                                          int11
> >                                  "Mean relative difference: 2"
> >                                                          int12
> >                           "Mean relative difference: 1.479043"
> >                                                          int13
> >                           "Mean relative difference: 1.323619"
> >                                                          int14
> >                           "Mean relative difference: 1.360022"
> >                                                          char5
> >                                    "1454309 string mismatches"
> >                                                          int15
> > "'is.NA' value mismatch: 2260789 in current 2260789 in target"
> >                                                          int16
> >                           "Mean relative difference: 1.997195"
> >                                                          int17
> >                                  "Mean relative difference: 2"
> >                                                          int18
> >                           "Mean relative difference: 1.799441"
> >                                                          int19
> >                           "Mean relative difference: 1.571321"
> >                                                          int20
> >                           "Mean relative difference: 1.474492"
> >                                                          int21
> >                           "Mean relative difference: 1.669488"
> >                                                          int22
> >                           "Mean relative difference: 1.465307"
> >                                                          int23
> >                           "Mean relative difference: 1.842191"
> >                                                          int24
> >                            "Mean relative difference: 1.76578"
> >                                                          int25
> >                           "Mean relative difference: 1.481612"
> >                                                          int26
> >                           "Mean relative difference: 1.403655"
> >                                                          int27
> >                           "Mean relative difference: 1.722723"
> >                                                          int28
> >                           "Mean relative difference: 1.403655"
> >                                                          int29
> >                                  "Mean relative difference: 2"
> >                                                          int30
> >                           "Mean relative difference: 1.535987"
> >                                                          char6
> >                                    "2899128 string mismatches"
> >                                                          char7
> >                                    "3008489 string mismatches"
> >                                                          char8
> >                                    "2503189 string mismatches"
> >                                                          char9
> >                                    "2957002 string mismatches"
> >                                                         char10
> >                                    "1933196 string mismatches"
> >                                                         char11
> >                                    "1933196 string mismatches"
> >>
> >> ## I re-ran the steps to create the files (almost the same the last
> > email),
> >> ## but added an "index" equal to 1:nrow(datMod)
> >> ## This index is used to reorder the files to be consistent
> >> LargeOrd = Large[order(Large$index), ]
> >> str(LargeOrd)
> > Classes ‘data.table’ and 'data.frame': 3103314 obs. of  42 variables:
> >  $ index : int  1 2 3 4 5 6 7 8 9 10 ...
> >  $ char1 : chr  "http://conradhotels3.hilton.com" "
> > http://conradhotels3.hilton.com" "http://conradhotels3.hilton.com" "
> > http://conradhotels3.hilton.com" ...
> >  $ char2 : chr  "/en/index.html" "/en/index.html" "/en/index.html"
> > "/en/index.html" ...
> >  $ char3 : chr  "" "" "" "" ...
> >  $ int1  : int  44903 44903 44903 44903 44903 44903 44903 44903 44903
> > 44903
> > ...
> >  $ int2  : int  411 411 254 254 336 336 118 118 386 386 ...
> >  $ char4 : chr  "2012-05-09 20:17:40.587" "2012-05-09 21:17:54.427"
> > "2012-05-09 20:10:49.560" "2012-05-09 21:11:05.107" ...
> >  $ int3  : int  0 0 0 0 0 0 0 0 0 0 ...
> >  $ int4  : int  0 0 0 0 0 0 0 0 0 0 ...
> >  $ int5  : int  69 69 69 69 69 69 69 68 68 68 ...
> >  $ int6  : int  68 68 68 68 68 68 68 67 67 67 ...
> >  $ int7  : int  35 35 37 35 35 35 33 38 38 40 ...
> >  $ int8  : int  0 0 0 0 0 0 0 0 0 0 ...
> >  $ int9  : int  0 0 0 0 0 0 0 0 0 0 ...
> >  $ int10 : int  0 0 0 0 0 0 0 0 0 0 ...
> >  $ int11 : int  1 1 1 1 1 1 1 1 1 1 ...
> >  $ int12 : int  334830 334847 335102 334838 334836 342687 334521 318626
> > 318578 326800 ...
> >  $ int13 : int  36 36 37 36 36 36 35 38 37 39 ...
> >  $ int14 : int  44 44 49 47 45 45 45 46 45 48 ...
> >  $ char5 : chr  "" "" "" "" ...
> >  $ int15 : int  NA NA NA NA NA NA NA NA NA NA ...
> >  $ int16 : int  0 0 0 0 0 0 0 0 0 0 ...
> >  $ int17 : int  0 0 0 0 0 0 0 0 0 0 ...
> >  $ int18 : int  2 2 2 2 2 2 2 2 2 2 ...
> >  $ int19 : int  1381 1152 424 3728 1772 921 385 725 401 314 ...
> >  $ int20 : int  36 36 37 36 36 36 35 38 37 39 ...
> >  $ int21 : int  2199 2201 1492 1448 2559 2529 1084 1432 1876 1984 ...
> >  $ int22 : int  44 44 49 47 45 45 45 46 45 48 ...
> >  $ int23 : int  2203 2188 1199 1162 2324 2346 821 897 1386 1189 ...
> >  $ int24 : int  13 13 14 13 13 13 12 13 13 14 ...
> >  $ int25 : int  5166 5761 3755 3794 5614 7779 2830 3971 4637 5871 ...
> >  $ int26 : int  103 103 105 103 103 103 101 105 105 107 ...
> >  $ int27 : int  70 183 159 197 217 165 153 232 92 102 ...
> >  $ int28 : int  103 103 105 103 103 103 101 105 105 107 ...
> >  $ int29 : int  0 0 0 0 0 0 0 0 0 0 ...
> >  $ int30 : int  161 146 200 158 150 160 190 161 163 169 ...
> >  $ char6 : chr  "Limelight" "Limelight" "Fusepoint/Savvis"
> > "Fusepoint/Savvis" ...
> >  $ char7 : chr  "Paris" "Paris" "Toronto" "Toronto" ...
> >  $ char8 : chr  "-1" "-1" "-1" "-1" ...
> >  $ char9 : chr  "FRANCE" "FRANCE" "CANADA" "CANADA" ...
> >  $ char10: chr  "FR" "FR" "CA" "CA" ...
> >  $ char11: chr  "FRANCE" "FRANCE" "CANADA" "CANADA" ...
> >  - attr(*, ".internal.selfref")=<externalptr>
> >>
> >> ## Here the ordered  files come out the be equivalent
> >> mapply(identical, Small, LargeOrd)
> >  index  char1  char2  char3   int1   int2  char4   int3   int4   int5
> > int6   int7   int8   int9
> >   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE
> > TRUE   TRUE   TRUE   TRUE
> >  int10  int11  int12  int13  int14  char5  int15  int16  int17  int18
> >  int19  int20  int21  int22
> >   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE
> > TRUE   TRUE   TRUE   TRUE
> >  int23  int24  int25  int26  int27  int28  int29  int30  char6  char7
> >  char8  char9 char10 char11
> >   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE
> > TRUE   TRUE   TRUE   TRUE
> >> mapply(all.equal, Small, LargeOrd)
> >  index  char1  char2  char3   int1   int2  char4   int3   int4   int5
> > int6   int7   int8   int9
> >   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE
> > TRUE   TRUE   TRUE   TRUE
> >  int10  int11  int12  int13  int14  char5  int15  int16  int17  int18
> >  int19  int20  int21  int22
> >   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE
> > TRUE   TRUE   TRUE   TRUE
> >  int23  int24  int25  int26  int27  int28  int29  int30  char6  char7
> >  char8  char9 char10 char11
> >   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE
> > TRUE   TRUE   TRUE   TRUE
> >>
> >> ## The inspection results
> >> .Internal(inspect(Small))
> > @0x00000000128068e8 19 VECSXP g1c7 [OBJ,MARK,NAM(2),ATT] (len=42, tl=0)
> >   @0x000007ff8a3e0010 13 INTSXP g1c7 [MARK,NAM(2)] (len=3103314, tl=0)
> > 1,2,3,4,5,...
> >   @0x000007ff4fb30010 16 STRSXP g1c7 [MARK,NAM(2)] (len=3103314, tl=0)
> >     @0x0000000012114550 09 CHARSXP g1c3 [MARK,gp=0x60] [ASCII] [cached] "
> > http://conradhotels3.hilton.com"
> >     @0x0000000012114550 09 CHARSXP g1c3 [MARK,gp=0x60] [ASCII] [cached] "
> > http://conradhotels3.hilton.com"
> >     @0x0000000012114550 09 CHARSXP g1c3 [MARK,gp=0x60] [ASCII] [cached] "
> > http://conradhotels3.hilton.com"
> >     @0x0000000012114550 09 CHARSXP g1c3 [MARK,gp=0x60] [ASCII] [cached] "
> > http://conradhotels3.hilton.com"
> >     @0x0000000012114550 09 CHARSXP g1c3 [MARK,gp=0x60] [ASCII] [cached] "
> > http://conradhotels3.hilton.com"
> >     ...
> >   @0x000007ff4e380010 16 STRSXP g1c7 [MARK,NAM(2)] (len=3103314, tl=0)
> >     @0x00000000205bf0d8 09 CHARSXP g1c2 [MARK,gp=0x60] [ASCII] [cached]
> > "/en/index.html"
> >     @0x00000000205bf0d8 09 CHARSXP g1c2 [MARK,gp=0x60] [ASCII] [cached]
> > "/en/index.html"
> >     @0x00000000205bf0d8 09 CHARSXP g1c2 [MARK,gp=0x60] [ASCII] [cached]
> > "/en/index.html"
> >     @0x00000000205bf0d8 09 CHARSXP g1c2 [MARK,gp=0x60] [ASCII] [cached]
> > "/en/index.html"
> >     @0x00000000205bf0d8 09 CHARSXP g1c2 [MARK,gp=0x60] [ASCII] [cached]
> > "/en/index.html"
> >     ...
> >   @0x000007ff4cbd0010 16 STRSXP g1c7 [MARK,NAM(2)] (len=3103314, tl=0)
> >     @0x0000000000120f20 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached]
> ""
> >     @0x0000000000120f20 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached]
> ""
> >     @0x0000000000120f20 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached]
> ""
> >     @0x0000000000120f20 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached]
> ""
> >     @0x0000000000120f20 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached]
> ""
> >     ...
> >   @0x000007ff88c20010 13 INTSXP g1c7 [MARK,NAM(2)] (len=3103314, tl=0)
> > 44903,44903,44903,44903,44903,...
> >   ...
> > ATTRIB:
> >   @0x0000000012cab8e0 02 LISTSXP g1c0 [MARK]
> >     TAG: @0x0000000000120088 01 SYMSXP g1c0 [MARK,NAM(2),LCK,gp=0x4000]
> > "names" (has value)
> >     @0x0000000016868d68 16 STRSXP g1c7 [MARK,NAM(2)] (len=42, tl=0)
> >       @0x0000000010112b98 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached]
> > "index"
> >       @0x0000000016b28fd0 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached]
> > "char1"
> >       @0x0000000016b291e0 09 CHARSXP g1c1 [MARK,gp=0x61,ATT] [ASCII]
> > [cached] "char2"
> >       @0x0000000016b293c0 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached]
> > "char3"
> >       @0x0000000016b29600 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached]
> > "int1"
> >       ...
> >     TAG: @0x0000000000120558 01 SYMSXP g1c0 [MARK,NAM(2),LCK,gp=0x4000]
> > "class" (has value)
> >     @0x00000000138b5318 16 STRSXP g1c2 [MARK,NAM(2)] (len=2, tl=0)
> >       @0x000000000b42c760 09 CHARSXP g1c2 [MARK,gp=0x61] [ASCII] [cached]
> > "data.table"
> >       @0x000000000027d230 09 CHARSXP g1c2 [MARK,gp=0x61] [ASCII] [cached]
> > "data.frame"
> >     TAG: @0x0000000000121d98 01 SYMSXP g1c0 [MARK,LCK,gp=0x4000]
> > "row.names" (has value)
> >     @0x0000000012c38050 13 INTSXP g1c1 [MARK,NAM(2)] (len=2, tl=0)
> > -2147483648,-3103314
> >     TAG: @0x000000001497ac10 01 SYMSXP g1c0 [MARK] ".internal.selfref"
> >     @0x0000000012caaa60 22 EXTPTRSXP g1c0 [MARK,NAM(2)]
> >> .Internal(inspect(Large))
> > @0x0000000012c24c68 19 VECSXP g1c7 [OBJ,MARK,NAM(2),ATT] (len=42, tl=0)
> >   @0x000007ff314d0010 13 INTSXP g1c7 [MARK,NAM(2)] (len=3103314, tl=0)
> > 716234,716235,1007651,2679944,1550732,...
> >   @0x000007ff2fd20010 16 STRSXP g1c7 [MARK,NAM(2)] (len=3103314, tl=0)
> >     @0x000000001253d8e0 09 CHARSXP g1c3 [MARK,gp=0x60,ATT] [ASCII]
> > [cached]
> > "http://go.compuware.com"
> >     @0x000000001253d8e0 09 CHARSXP g1c3 [MARK,gp=0x60,ATT] [ASCII]
> > [cached]
> > "http://go.compuware.com"
> >     @0x000000001e6d7ab0 09 CHARSXP g1c3 [MARK,gp=0x60] [ASCII] [cached] "
> > http://www.achmeacollectief.nl"
> >     @0x000000001e4a59b8 09 CHARSXP g1c3 [MARK,gp=0x60] [ASCII] [cached] "
> > https://db3.notify.windows.com"
> >     @0x000000001e63ee70 09 CHARSXP g1c3 [MARK,gp=0x60] [ASCII] [cached] "
> > http://www.christushealth.org"
> >     ...
> >   @0x000007ff2e570010 16 STRSXP g1c7 [MARK,NAM(2)] (len=3103314, tl=0)
> >     @0x00000000200aa218 09 CHARSXP g1c2 [MARK,gp=0x60,ATT] [ASCII]
> > [cached]
> > "/default.aspx"
> >     @0x000000001e444d78 09 CHARSXP g1c3 [MARK,gp=0x60,ATT] [ASCII]
> > [cached]
> > "/dynaTraceMonitor"
> >     @0x000000001eb64790 09 CHARSXP g1c2 [MARK,gp=0x60] [ASCII] [cached]
> > "/unilever/"
> >     @0x000000000feb4e98 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached]
> > "/ping"
> >     @0x0000000000124950 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached]
> > "/"
> >     ...
> >   @0x000007ff2cdc0010 16 STRSXP g1c7 [MARK,NAM(2)] (len=3103314, tl=0)
> >     @0x0000000017a39430 09 CHARSXP g1c5 [MARK,gp=0x60] [ASCII] [cached]
> > "?rurl=http://frontline.compuware.com//products/BU/default.aspx"
> >     @0x000000001b721a50 09 CHARSXP g1c7 [MARK,gp=0x60] [ASCII] [cached]
> > "?url=http%3A%2F%2Fgo.compuware.com%2Fdefault.aspx%3Frurl%3Dhttp%3A%2F%
> > 2Ffrontline.compuware.com
> >
> %2F%2Fproducts%2FBU%2Fdefault.aspx$title=$frames=0$pId=G_1336593601673$fId=G_1336593601673$pFId=$rId=RID_73295254$rpId=1059475658$actions=1%7C_load_%7C-%7C_load_%7C1336593601673%7C1336593602736%7C375%2C2%7C_onload_%7C-%7C_load_%7C1336593602626%7C1336593602704%7C375$domR=1336593602642$dtV=410$3p=
> > www.google-analytics.com
> >
> %7C0%7C0%7C0%7C%7C0%7C0%7C0%7C1%7C828_859%7C31%7C31%7C31%7C0%7C%7C0%7C0%7C0%2Cs%7C828%7C859%7C_load_%7Chttp%253A%252F%
> > 252Fwww.google-analytics.com%252Fga.js%3B2264ff.r.axf8.net
> >
> %7C0%7C0%7C0%7C%7C0%7C0%7C0%7C1%7C953_1078%7C125%7C125%7C125%7C0%7C%7C0%7C0%7C0%2Cs%7C953%7C1078%7C_load_%7Chttp%253A%252F%
> > 252F2264FF.r.axf8.net
> >
> %252Fmr%252Fe.gif%253Finfo%253D%25257Bn%25253Ac%25257Cc%25253A38695455749817%25257Cd%25253A1%25257Ca%25253A2264FF%25257Ch%25253A1%25257Ce%25253A%25257Cb%25253A%25257Cl%25253Ahttp%252524%252A%252524%25252F%
> > 25252Fgo.compuware.com
> >
> %25252Fdefault.aspx%25257Cm%25253A1024%25257Co%25253A768%25257Cp%25253AWin32%25257Cq%25253Ax86%25257Ck%25253Alan%25257Cg%25253AMSIE%25257Cf%25253A8.0%25257D%25257Bn%25253Au%25257Ce%25253A1%25257D%2526a%253D2264FF%2526r%253D1%2526s%253D1$time=1336593603689$"
> >     @0x0000000000120f20 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached]
> ""
> >     @0x0000000000120f20 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached]
> ""
> >     @0x0000000000120f20 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached]
> ""
> >     ...
> >   @0x000007ff2c1e0010 13 INTSXP g1c7 [MARK,NAM(2)] (len=3103314, tl=0)
> > 2812881,2812881,3149757,4286896,3618836,...
> >   ...
> > ATTRIB:
> >   @0x000000001f163298 02 LISTSXP g1c0 [MARK]
> >     TAG: @0x0000000000120088 01 SYMSXP g1c0 [MARK,NAM(2),LCK,gp=0x4000]
> > "names" (has value)
> >     @0x000000001283e0e0 16 STRSXP g1c7 [MARK,NAM(2)] (len=42, tl=0)
> >       @0x0000000010112b98 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached]
> > "index"
> >       @0x0000000016b28fd0 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached]
> > "char1"
> >       @0x0000000016b291e0 09 CHARSXP g1c1 [MARK,gp=0x61,ATT] [ASCII]
> > [cached] "char2"
> >       @0x0000000016b293c0 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached]
> > "char3"
> >       @0x0000000016b29600 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached]
> > "int1"
> >       ...
> >     TAG: @0x0000000000120558 01 SYMSXP g1c0 [MARK,NAM(2),LCK,gp=0x4000]
> > "class" (has value)
> >     @0x000000001368f078 16 STRSXP g1c2 [MARK,NAM(2)] (len=2, tl=0)
> >       @0x000000000b42c760 09 CHARSXP g1c2 [MARK,gp=0x61] [ASCII] [cached]
> > "data.table"
> >       @0x000000000027d230 09 CHARSXP g1c2 [MARK,gp=0x61] [ASCII] [cached]
> > "data.frame"
> >     TAG: @0x0000000000121d98 01 SYMSXP g1c0 [MARK,LCK,gp=0x4000]
> > "row.names" (has value)
> >     @0x000000000fb9a988 13 INTSXP g1c1 [MARK,NAM(2)] (len=2, tl=0)
> > -2147483648,-3103314
> >     TAG: @0x000000001497ac10 01 SYMSXP g1c0 [MARK] ".internal.selfref"
> >     @0x000000001f163110 22 EXTPTRSXP g1c0 [MARK,NAM(2)]
> >     TAG: @0x0000000016c8d648 01 SYMSXP g1c0 [MARK] "sorted"
> >     @0x000000001ece5f88 16 STRSXP g1c1 [MARK,NAM(2)] (len=1, tl=0)
> >       @0x0000000016b27b78 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached]
> > "char4"
> >> .Internal(inspect(LargeOrd))
> > @0x0000000012b69468 19 VECSXP g1c7 [OBJ,MARK,NAM(2),ATT] (len=42, tl=100)
> >   @0x000007ffc4fb0010 13 INTSXP g1c7 [MARK,NAM(2)] (len=3103314, tl=0)
> > 1,2,3,4,5,...
> >   @0x000007ffc2c20010 16 STRSXP g1c7 [MARK,NAM(2)] (len=3103314, tl=0)
> >     @0x0000000012114550 09 CHARSXP g1c3 [MARK,gp=0x60] [ASCII] [cached] "
> > http://conradhotels3.hilton.com"
> >     @0x0000000012114550 09 CHARSXP g1c3 [MARK,gp=0x60] [ASCII] [cached] "
> > http://conradhotels3.hilton.com"
> >     @0x0000000012114550 09 CHARSXP g1c3 [MARK,gp=0x60] [ASCII] [cached] "
> > http://conradhotels3.hilton.com"
> >     @0x0000000012114550 09 CHARSXP g1c3 [MARK,gp=0x60] [ASCII] [cached] "
> > http://conradhotels3.hilton.com"
> >     @0x0000000012114550 09 CHARSXP g1c3 [MARK,gp=0x60] [ASCII] [cached] "
> > http://conradhotels3.hilton.com"
> >     ...
> >   @0x000007ffc0890010 16 STRSXP g1c7 [MARK,NAM(2)] (len=3103314, tl=0)
> >     @0x00000000205bf0d8 09 CHARSXP g1c2 [MARK,gp=0x60] [ASCII] [cached]
> > "/en/index.html"
> >     @0x00000000205bf0d8 09 CHARSXP g1c2 [MARK,gp=0x60] [ASCII] [cached]
> > "/en/index.html"
> >     @0x00000000205bf0d8 09 CHARSXP g1c2 [MARK,gp=0x60] [ASCII] [cached]
> > "/en/index.html"
> >     @0x00000000205bf0d8 09 CHARSXP g1c2 [MARK,gp=0x60] [ASCII] [cached]
> > "/en/index.html"
> >     @0x00000000205bf0d8 09 CHARSXP g1c2 [MARK,gp=0x60] [ASCII] [cached]
> > "/en/index.html"
> >     ...
> >   @0x000007ffbe500010 16 STRSXP g1c7 [MARK,NAM(2)] (len=3103314, tl=0)
> >     @0x0000000000120f20 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached]
> ""
> >     @0x0000000000120f20 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached]
> ""
> >     @0x0000000000120f20 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached]
> ""
> >     @0x0000000000120f20 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached]
> ""
> >     @0x0000000000120f20 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached]
> ""
> >     ...
> >   @0x000007ffbcd40010 13 INTSXP g1c7 [MARK,NAM(2)] (len=3103314, tl=0)
> > 44903,44903,44903,44903,44903,...
> >   ...
> > ATTRIB:
> >   @0x0000000012cec058 02 LISTSXP g1c0 [MARK]
> >     TAG: @0x0000000000120088 01 SYMSXP g1c0 [MARK,NAM(2),LCK,gp=0x4000]
> > "names" (has value)
> >     @0x0000000012b60620 16 STRSXP g1c7 [MARK,NAM(2)] (len=42, tl=100)
> >       @0x0000000010112b98 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached]
> > "index"
> >       @0x0000000016b28fd0 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached]
> > "char1"
> >       @0x0000000016b291e0 09 CHARSXP g1c1 [MARK,gp=0x61,ATT] [ASCII]
> > [cached] "char2"
> >       @0x0000000016b293c0 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached]
> > "char3"
> >       @0x0000000016b29600 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached]
> > "int1"
> >       ...
> >     TAG: @0x0000000000120558 01 SYMSXP g1c0 [MARK,NAM(2),LCK,gp=0x4000]
> > "class" (has value)
> >     @0x0000000013a16be0 16 STRSXP g1c2 [MARK,NAM(2)] (len=2, tl=0)
> >       @0x000000000b42c760 09 CHARSXP g1c2 [MARK,gp=0x61] [ASCII] [cached]
> > "data.table"
> >       @0x000000000027d230 09 CHARSXP g1c2 [MARK,gp=0x61] [ASCII] [cached]
> > "data.frame"
> >     TAG: @0x0000000000121d98 01 SYMSXP g1c0 [MARK,LCK,gp=0x4000]
> > "row.names" (has value)
> >     @0x0000000012c2f2f0 13 INTSXP g1c1 [MARK,NAM(2)] (len=2, tl=0)
> > -2147483648,-3103314
> >     TAG: @0x000000001497ac10 01 SYMSXP g1c0 [MARK] ".internal.selfref"
> >     @0x0000000012cec170 22 EXTPTRSXP g1c0 [MARK,NAM(2)]
> >>
> >>
> >> ## A little size tester function
> >> ## This will set a key, save the result, print the result's size
> >> keytest = function(dt, key){
> > +   setkeyv(dt, key)
> > +   save(dt, file='dt_temp.Rdata')
> > +   tempfilesize = file.info('dt_temp.Rdata')$size
> > +   tempfilesize = formatC(tempfilesize, big.mark=',', format='f',
> > digits=0)
> > +   cat(key, tempfilesize, '\n\n')
> > +   unlink('dt_temp.Rdata')
> > +   invisible(NULL)
> > + }
> >>
> >> str(Small)
> > Classes ‘data.table’ and 'data.frame': 3103314 obs. of  42 variables:
> >  $ index : int  1 2 3 4 5 6 7 8 9 10 ...
> >  $ char1 : chr  "http://conradhotels3.hilton.com" "
> > http://conradhotels3.hilton.com" "http://conradhotels3.hilton.com" "
> > http://conradhotels3.hilton.com" ...
> >  $ char2 : chr  "/en/index.html" "/en/index.html" "/en/index.html"
> > "/en/index.html" ...
> >  $ char3 : chr  "" "" "" "" ...
> >  $ int1  : int  44903 44903 44903 44903 44903 44903 44903 44903 44903
> > 44903
> > ...
> >  $ int2  : int  411 411 254 254 336 336 118 118 386 386 ...
> >  $ char4 : chr  "2012-05-09 20:17:40.587" "2012-05-09 21:17:54.427"
> > "2012-05-09 20:10:49.560" "2012-05-09 21:11:05.107" ...
> >  $ int3  : int  0 0 0 0 0 0 0 0 0 0 ...
> >  $ int4  : int  0 0 0 0 0 0 0 0 0 0 ...
> >  $ int5  : int  69 69 69 69 69 69 69 68 68 68 ...
> >  $ int6  : int  68 68 68 68 68 68 68 67 67 67 ...
> >  $ int7  : int  35 35 37 35 35 35 33 38 38 40 ...
> >  $ int8  : int  0 0 0 0 0 0 0 0 0 0 ...
> >  $ int9  : int  0 0 0 0 0 0 0 0 0 0 ...
> >  $ int10 : int  0 0 0 0 0 0 0 0 0 0 ...
> >  $ int11 : int  1 1 1 1 1 1 1 1 1 1 ...
> >  $ int12 : int  334830 334847 335102 334838 334836 342687 334521 318626
> > 318578 326800 ...
> >  $ int13 : int  36 36 37 36 36 36 35 38 37 39 ...
> >  $ int14 : int  44 44 49 47 45 45 45 46 45 48 ...
> >  $ char5 : chr  "" "" "" "" ...
> >  $ int15 : int  NA NA NA NA NA NA NA NA NA NA ...
> >  $ int16 : int  0 0 0 0 0 0 0 0 0 0 ...
> >  $ int17 : int  0 0 0 0 0 0 0 0 0 0 ...
> >  $ int18 : int  2 2 2 2 2 2 2 2 2 2 ...
> >  $ int19 : int  1381 1152 424 3728 1772 921 385 725 401 314 ...
> >  $ int20 : int  36 36 37 36 36 36 35 38 37 39 ...
> >  $ int21 : int  2199 2201 1492 1448 2559 2529 1084 1432 1876 1984 ...
> >  $ int22 : int  44 44 49 47 45 45 45 46 45 48 ...
> >  $ int23 : int  2203 2188 1199 1162 2324 2346 821 897 1386 1189 ...
> >  $ int24 : int  13 13 14 13 13 13 12 13 13 14 ...
> >  $ int25 : int  5166 5761 3755 3794 5614 7779 2830 3971 4637 5871 ...
> >  $ int26 : int  103 103 105 103 103 103 101 105 105 107 ...
> >  $ int27 : int  70 183 159 197 217 165 153 232 92 102 ...
> >  $ int28 : int  103 103 105 103 103 103 101 105 105 107 ...
> >  $ int29 : int  0 0 0 0 0 0 0 0 0 0 ...
> >  $ int30 : int  161 146 200 158 150 160 190 161 163 169 ...
> >  $ char6 : chr  "Limelight" "Limelight" "Fusepoint/Savvis"
> > "Fusepoint/Savvis" ...
> >  $ char7 : chr  "Paris" "Paris" "Toronto" "Toronto" ...
> >  $ char8 : chr  "-1" "-1" "-1" "-1" ...
> >  $ char9 : chr  "FRANCE" "FRANCE" "CANADA" "CANADA" ...
> >  $ char10: chr  "FR" "FR" "CA" "CA" ...
> >  $ char11: chr  "FRANCE" "FRANCE" "CANADA" "CANADA" ...
> >  - attr(*, ".internal.selfref")=<externalptr>
> >> keytest(Small, colnames(Small)[1])
> > index 77,694,801
> >
> >> keytest(Small, colnames(Small)[2])
> > char1 75,876,250
> >
> >> keytest(Small, colnames(Small)[3])
> > char2 77,218,972
> >
> >> keytest(Small, colnames(Small)[4])
> > char3 80,585,449
> >
> >> keytest(Small, colnames(Small)[5])
> > int1 77,558,982
> >
> >> keytest(Small, colnames(Small)[6])
> > int2 95,185,248
> >
> >> keytest(Small, colnames(Small)[7])
> > char4 204,037,056
> >
> >> keytest(Small, colnames(Small)[8])
> > int3 206,450,705
> >
> >> keytest(Small, colnames(Small)[9])
> > int4 211,520,888
> >
> >> keytest(Small, colnames(Small)[10])
> > int5 156,095,150
> >
> >>
> >>
> >> keytest(Small, colnames(Small)[11])
> > int6 150,431,716
> >
> >> keytest(Small, colnames(Small)[12])
> > int7 136,077,306
> >
> >> keytest(Small, colnames(Small)[13])
> > int8 134,981,911
> >
> >> keytest(Small, colnames(Small)[14])
> > int9 134,871,952
> >
> >> keytest(Small, colnames(Small)[15])
> > int10 134,678,104
> >
> >> keytest(Small, colnames(Small)[16])
> > int11 134,682,904
> >
> >> keytest(Small, colnames(Small)[17])
> > int12 112,097,493
> >
> >> keytest(Small, colnames(Small)[18])
> > int13 101,734,541
> >
> >> keytest(Small, colnames(Small)[19])
> > int14 101,160,920
> >
> >>
> >
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20121023/ace5b93a/attachment-0001.html>


More information about the datatable-help mailing list