[datatable-help] Memory issue
Gene Leynes
gleynes+r at gmail.com
Tue Oct 23 22:57:07 CEST 2012
Hello Matt,
Unfortunately I can't share the data. And, as I found out today, even if I
could get permission to share part of the data, that would not reproduce
the problem.
Here's why I think that a sample of the data won't prove the point.
I started with two sets of data:
dat : A data table with 3,103,314 rows and 43 columns
dat_1000: Same table, but with a new column to define 1000 equally sized
groups
*Whole File Results:*
If I set the key of dat to be column 1 and save it, the resulting file
size is 77.731 MB
If I set the key of dat to be column 9 and save it, the resulting file
size is 206.874 MB
In other words, the second file is inflated by 266% of the original when
the key is set to column 9. The 9th column is an iso date time value".
However, if I use column 10, which is just a integer, the results are
similar. For column 10 the inflation factor is 196% of the original.
However, this inflation disappears when I save the file in chunks.
*Split Results:*
Then I wanted to see how much the file size changed for a small chunk of
the data. So, I used the grouping column to split dat_1000 into 1,000
temporary data tables (one at a time). Then I set the key for each copy,
and saved each copy, and recorded the size of that file. I did this for
column 1 and column 9 as the key.
When using column 1 as the key the average file size is 88 MB and the files
add up to 88,281 MB.
When using column 9 as the key the average file size is 128 MB and the
files add up to 128,395 MB.
So the 1000 files only get inflated by 145%.
*Analysis on the best and worst chunks of data:*
The chunk with the least amount of inflation is chunk 143, which is 96% of
the original.
The chunk with the most inflation is chunk 666 (go figure), which is 184%
of the original.
Chunk Size(Key1) Size(Key9) Inflation
Smallest Inflation: 743 44,741 42,785 96%
Largest Inflation: 666 243,612 448,621 184%
Some information about these chunks:
Both of these chunks have 3,107 rows.
Column 9 has 347 unique values for chunk 743
Column 9 has 2,432 unique values for chunk 666
Generally speaking chunk 666 has a lot more unique values across all the
columns.
At this point, I'm just going to do a work around.
Thank you,
- Gene
On Tue, Oct 23, 2012 at 11:50 AM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>
> Hi Gene,
>
> Thanks for all this. Sorry for the delay. Have looked through. It does
> seem likely to do with those very long character strings. Could you save
> head() of the data, before setting the key, and either email it or save
> online somewhere please?
>
> Matthew
>
>
> > Ok, here is my very lengthy reply with lots of diagnostics.
> >
> >
> >>
> >> ## Clear the workspace
> >> rm(list=ls())
> >>
> >> ## I use a function called "loader" to load single data objects
> >> if(!require('geneorama')){
> > + source('https://raw.github.com/geneorama/geneorama/master/R/loader.R
> ')
> > + cat('loading function \"loader\"')
> > + }
> >>
> >> ## Load the data
> >> Small = loader('test0')
> >> Large = loader('test1')
> >>
> >> ## The two files will be different because their order is different
> >> str(Small)
> > Classes ‘data.table’ and 'data.frame': 3103314 obs. of 42 variables:
> > $ index : int 1 2 3 4 5 6 7 8 9 10 ...
> > $ char1 : chr "http://conradhotels3.hilton.com" "
> > http://conradhotels3.hilton.com" "http://conradhotels3.hilton.com" "
> > http://conradhotels3.hilton.com" ...
> > $ char2 : chr "/en/index.html" "/en/index.html" "/en/index.html"
> > "/en/index.html" ...
> > $ char3 : chr "" "" "" "" ...
> > $ int1 : int 44903 44903 44903 44903 44903 44903 44903 44903 44903
> > 44903
> > ...
> > $ int2 : int 411 411 254 254 336 336 118 118 386 386 ...
> > $ char4 : chr "2012-05-09 20:17:40.587" "2012-05-09 21:17:54.427"
> > "2012-05-09 20:10:49.560" "2012-05-09 21:11:05.107" ...
> > $ int3 : int 0 0 0 0 0 0 0 0 0 0 ...
> > $ int4 : int 0 0 0 0 0 0 0 0 0 0 ...
> > $ int5 : int 69 69 69 69 69 69 69 68 68 68 ...
> > $ int6 : int 68 68 68 68 68 68 68 67 67 67 ...
> > $ int7 : int 35 35 37 35 35 35 33 38 38 40 ...
> > $ int8 : int 0 0 0 0 0 0 0 0 0 0 ...
> > $ int9 : int 0 0 0 0 0 0 0 0 0 0 ...
> > $ int10 : int 0 0 0 0 0 0 0 0 0 0 ...
> > $ int11 : int 1 1 1 1 1 1 1 1 1 1 ...
> > $ int12 : int 334830 334847 335102 334838 334836 342687 334521 318626
> > 318578 326800 ...
> > $ int13 : int 36 36 37 36 36 36 35 38 37 39 ...
> > $ int14 : int 44 44 49 47 45 45 45 46 45 48 ...
> > $ char5 : chr "" "" "" "" ...
> > $ int15 : int NA NA NA NA NA NA NA NA NA NA ...
> > $ int16 : int 0 0 0 0 0 0 0 0 0 0 ...
> > $ int17 : int 0 0 0 0 0 0 0 0 0 0 ...
> > $ int18 : int 2 2 2 2 2 2 2 2 2 2 ...
> > $ int19 : int 1381 1152 424 3728 1772 921 385 725 401 314 ...
> > $ int20 : int 36 36 37 36 36 36 35 38 37 39 ...
> > $ int21 : int 2199 2201 1492 1448 2559 2529 1084 1432 1876 1984 ...
> > $ int22 : int 44 44 49 47 45 45 45 46 45 48 ...
> > $ int23 : int 2203 2188 1199 1162 2324 2346 821 897 1386 1189 ...
> > $ int24 : int 13 13 14 13 13 13 12 13 13 14 ...
> > $ int25 : int 5166 5761 3755 3794 5614 7779 2830 3971 4637 5871 ...
> > $ int26 : int 103 103 105 103 103 103 101 105 105 107 ...
> > $ int27 : int 70 183 159 197 217 165 153 232 92 102 ...
> > $ int28 : int 103 103 105 103 103 103 101 105 105 107 ...
> > $ int29 : int 0 0 0 0 0 0 0 0 0 0 ...
> > $ int30 : int 161 146 200 158 150 160 190 161 163 169 ...
> > $ char6 : chr "Limelight" "Limelight" "Fusepoint/Savvis"
> > "Fusepoint/Savvis" ...
> > $ char7 : chr "Paris" "Paris" "Toronto" "Toronto" ...
> > $ char8 : chr "-1" "-1" "-1" "-1" ...
> > $ char9 : chr "FRANCE" "FRANCE" "CANADA" "CANADA" ...
> > $ char10: chr "FR" "FR" "CA" "CA" ...
> > $ char11: chr "FRANCE" "FRANCE" "CANADA" "CANADA" ...
> > - attr(*, ".internal.selfref")=<externalptr>
> >> str(Large)
> > Classes ‘data.table’ and 'data.frame': 3103314 obs. of 42 variables:
> > $ index : int 716234 716235 1007651 2679944 1550732 1932010 2879445
> > 1007670 1736006 666363 ...
> > $ char1 : chr "http://go.compuware.com" "http://go.compuware.com" "
> > http://www.achmeacollectief.nl" "https://db3.notify.windows.com" ...
> > $ char2 : chr "/default.aspx" "/dynaTraceMonitor" "/unilever/" "/ping"
> > ...
> > $ char3 : chr "?rurl=
> > http://frontline.compuware.com//products/BU/default.aspx"
> > "?url=http%3A%2F%
> > 2Fgo.compuware.com%2Fdefault.aspx%3Frurl%3Dhttp%3A%2F%
> > 2Ffrontline.compuware.com%2F%2Fproducts%2FBU%2Fdefault.as"|
> __truncated__
> > "" "" ...
> > $ int1 : int 2812881 2812881 3149757 4286896 3618836 3861870 4315803
> > 3149760 3779387 2754629 ...
> > $ int2 : int 133 133 133 133 340 340 326 133 133 340 ...
> > $ char4 : chr "2012-05-09 20:00:00.000" "2012-05-09 20:00:00.000"
> > "2012-05-09 20:00:00.000" "2012-05-09 20:00:00.000" ...
> > $ int3 : int 0 1 0 0 0 0 0 0 0 0 ...
> > $ int4 : int 2264 2496 1782 461 1953 1418 641 1207 167 278 ...
> > $ int5 : int 26 20 6 1 71 64 1 6 1 15 ...
> > $ int6 : int 26 20 6 1 69 64 1 6 1 15 ...
> > $ int7 : int 2 2 4 0 2 12 0 2 0 0 ...
> > $ int8 : int 0 0 0 0 2 0 0 0 0 0 ...
> > $ int9 : int 0 0 0 0 0 0 0 0 0 0 ...
> > $ int10 : int 0 0 0 0 0 0 0 0 0 0 ...
> > $ int11 : int 0 0 0 0 0 0 0 0 0 0 ...
> > $ int12 : int 392752 417195 43107 0 1419015 1031349 187344 62969 43
> > 428189 ...
> > $ int13 : int 4 4 5 1 8 22 1 3 1 1 ...
> > $ int14 : int 9 11 8 1 17 38 1 6 1 15 ...
> > $ char5 : chr "" "" "" "" ...
> > $ int15 : int NA NA NA NA 0 NA NA NA NA 0 ...
> > $ int16 : int 0 0 0 0 0 0 0 0 0 0 ...
> > $ int17 : int 0 0 0 0 0 0 0 0 0 0 ...
> > $ int18 : int 2 28 3 0 0 1 0 1 0 0 ...
> > $ int19 : int 137 0 136 298 277 255 147 141 137 209 ...
> > $ int20 : int 4 0 5 1 8 22 1 3 1 1 ...
> > $ int21 : int 945 612 59 22 689 1153 54 29 13 59 ...
> > $ int22 : int 9 5 8 1 17 38 1 6 1 15 ...
> > $ int23 : int 0 0 0 118 0 0 0 0 0 0 ...
> > $ int24 : int 0 0 0 1 0 0 0 0 0 0 ...
> > $ int25 : int 3243 2653 1585 22 3292 3076 64 1043 13 81 ...
> > $ int26 : int 28 22 10 1 73 76 1 8 1 15 ...
> > $ int27 : int 2060 3365 257 1 3304 1038 376 258 4 80 ...
> > $ int28 : int 28 22 10 1 73 76 1 8 1 15 ...
> > $ int29 : int 0 0 0 0 0 0 0 0 0 0 ...
> > $ int30 : int 921 750 203 578 609 1078 234 187 31 140 ...
> > $ char6 : chr "Interoute" "Interoute" "Interoute" "Interoute" ...
> > $ char7 : chr "Amsterdam" "Amsterdam" "Amsterdam" "Amsterdam" ...
> > $ char8 : chr "-1" "-1" "-1" "-1" ...
> > $ char9 : chr "NETHERLANDS" "NETHERLANDS" "NETHERLANDS" "NETHERLANDS"
> > ...
> > $ char10: chr "NL" "NL" "NL" "NL" ...
> > $ char11: chr "NETHERLANDS" "NETHERLANDS" "NETHERLANDS" "NETHERLANDS"
> > ...
> > - attr(*, ".internal.selfref")=<externalptr>
> > - attr(*, "sorted")= chr "char4"
> >>
> >> ## The difference is shown here
> >> mapply(identical, Small, Large)
> > index char1 char2 char3 int1 int2 char4 int3 int4 int5
> > int6 int7 int8 int9
> > FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> > FALSE FALSE FALSE FALSE
> > int10 int11 int12 int13 int14 char5 int15 int16 int17 int18
> > int19 int20 int21 int22
> > FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> > FALSE FALSE FALSE FALSE
> > int23 int24 int25 int26 int27 int28 int29 int30 char6 char7
> > char8 char9 char10 char11
> > FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> > FALSE FALSE FALSE FALSE
> >> mapply(all.equal, Small, Large)
> > index
> > "Mean relative difference: 0.6660698"
> > char1
> > "3100674 string mismatches"
> > char2
> > "2961621 string mismatches"
> > char3
> > "1753352 string mismatches"
> > int1
> > "Mean relative difference: 0.2945024"
> > int2
> > "Mean relative difference: 0.4866954"
> > char4
> > "3103308 string mismatches"
> > int3
> > "Mean relative difference: 1.759713"
> > int4
> > "Mean relative difference: 1.408616"
> > int5
> > "Mean relative difference: 1.411817"
> > int6
> > "Mean relative difference: 1.415648"
> > int7
> > "Mean relative difference: 1.705137"
> > int8
> > "Mean relative difference: 1.954795"
> > int9
> > "Mean relative difference: 1.99701"
> > int10
> > "Mean relative difference: 1.995529"
> > int11
> > "Mean relative difference: 2"
> > int12
> > "Mean relative difference: 1.479043"
> > int13
> > "Mean relative difference: 1.323619"
> > int14
> > "Mean relative difference: 1.360022"
> > char5
> > "1454309 string mismatches"
> > int15
> > "'is.NA' value mismatch: 2260789 in current 2260789 in target"
> > int16
> > "Mean relative difference: 1.997195"
> > int17
> > "Mean relative difference: 2"
> > int18
> > "Mean relative difference: 1.799441"
> > int19
> > "Mean relative difference: 1.571321"
> > int20
> > "Mean relative difference: 1.474492"
> > int21
> > "Mean relative difference: 1.669488"
> > int22
> > "Mean relative difference: 1.465307"
> > int23
> > "Mean relative difference: 1.842191"
> > int24
> > "Mean relative difference: 1.76578"
> > int25
> > "Mean relative difference: 1.481612"
> > int26
> > "Mean relative difference: 1.403655"
> > int27
> > "Mean relative difference: 1.722723"
> > int28
> > "Mean relative difference: 1.403655"
> > int29
> > "Mean relative difference: 2"
> > int30
> > "Mean relative difference: 1.535987"
> > char6
> > "2899128 string mismatches"
> > char7
> > "3008489 string mismatches"
> > char8
> > "2503189 string mismatches"
> > char9
> > "2957002 string mismatches"
> > char10
> > "1933196 string mismatches"
> > char11
> > "1933196 string mismatches"
> >>
> >> ## I re-ran the steps to create the files (almost the same the last
> > email),
> >> ## but added an "index" equal to 1:nrow(datMod)
> >> ## This index is used to reorder the files to be consistent
> >> LargeOrd = Large[order(Large$index), ]
> >> str(LargeOrd)
> > Classes ‘data.table’ and 'data.frame': 3103314 obs. of 42 variables:
> > $ index : int 1 2 3 4 5 6 7 8 9 10 ...
> > $ char1 : chr "http://conradhotels3.hilton.com" "
> > http://conradhotels3.hilton.com" "http://conradhotels3.hilton.com" "
> > http://conradhotels3.hilton.com" ...
> > $ char2 : chr "/en/index.html" "/en/index.html" "/en/index.html"
> > "/en/index.html" ...
> > $ char3 : chr "" "" "" "" ...
> > $ int1 : int 44903 44903 44903 44903 44903 44903 44903 44903 44903
> > 44903
> > ...
> > $ int2 : int 411 411 254 254 336 336 118 118 386 386 ...
> > $ char4 : chr "2012-05-09 20:17:40.587" "2012-05-09 21:17:54.427"
> > "2012-05-09 20:10:49.560" "2012-05-09 21:11:05.107" ...
> > $ int3 : int 0 0 0 0 0 0 0 0 0 0 ...
> > $ int4 : int 0 0 0 0 0 0 0 0 0 0 ...
> > $ int5 : int 69 69 69 69 69 69 69 68 68 68 ...
> > $ int6 : int 68 68 68 68 68 68 68 67 67 67 ...
> > $ int7 : int 35 35 37 35 35 35 33 38 38 40 ...
> > $ int8 : int 0 0 0 0 0 0 0 0 0 0 ...
> > $ int9 : int 0 0 0 0 0 0 0 0 0 0 ...
> > $ int10 : int 0 0 0 0 0 0 0 0 0 0 ...
> > $ int11 : int 1 1 1 1 1 1 1 1 1 1 ...
> > $ int12 : int 334830 334847 335102 334838 334836 342687 334521 318626
> > 318578 326800 ...
> > $ int13 : int 36 36 37 36 36 36 35 38 37 39 ...
> > $ int14 : int 44 44 49 47 45 45 45 46 45 48 ...
> > $ char5 : chr "" "" "" "" ...
> > $ int15 : int NA NA NA NA NA NA NA NA NA NA ...
> > $ int16 : int 0 0 0 0 0 0 0 0 0 0 ...
> > $ int17 : int 0 0 0 0 0 0 0 0 0 0 ...
> > $ int18 : int 2 2 2 2 2 2 2 2 2 2 ...
> > $ int19 : int 1381 1152 424 3728 1772 921 385 725 401 314 ...
> > $ int20 : int 36 36 37 36 36 36 35 38 37 39 ...
> > $ int21 : int 2199 2201 1492 1448 2559 2529 1084 1432 1876 1984 ...
> > $ int22 : int 44 44 49 47 45 45 45 46 45 48 ...
> > $ int23 : int 2203 2188 1199 1162 2324 2346 821 897 1386 1189 ...
> > $ int24 : int 13 13 14 13 13 13 12 13 13 14 ...
> > $ int25 : int 5166 5761 3755 3794 5614 7779 2830 3971 4637 5871 ...
> > $ int26 : int 103 103 105 103 103 103 101 105 105 107 ...
> > $ int27 : int 70 183 159 197 217 165 153 232 92 102 ...
> > $ int28 : int 103 103 105 103 103 103 101 105 105 107 ...
> > $ int29 : int 0 0 0 0 0 0 0 0 0 0 ...
> > $ int30 : int 161 146 200 158 150 160 190 161 163 169 ...
> > $ char6 : chr "Limelight" "Limelight" "Fusepoint/Savvis"
> > "Fusepoint/Savvis" ...
> > $ char7 : chr "Paris" "Paris" "Toronto" "Toronto" ...
> > $ char8 : chr "-1" "-1" "-1" "-1" ...
> > $ char9 : chr "FRANCE" "FRANCE" "CANADA" "CANADA" ...
> > $ char10: chr "FR" "FR" "CA" "CA" ...
> > $ char11: chr "FRANCE" "FRANCE" "CANADA" "CANADA" ...
> > - attr(*, ".internal.selfref")=<externalptr>
> >>
> >> ## Here the ordered files come out the be equivalent
> >> mapply(identical, Small, LargeOrd)
> > index char1 char2 char3 int1 int2 char4 int3 int4 int5
> > int6 int7 int8 int9
> > TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
> > TRUE TRUE TRUE TRUE
> > int10 int11 int12 int13 int14 char5 int15 int16 int17 int18
> > int19 int20 int21 int22
> > TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
> > TRUE TRUE TRUE TRUE
> > int23 int24 int25 int26 int27 int28 int29 int30 char6 char7
> > char8 char9 char10 char11
> > TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
> > TRUE TRUE TRUE TRUE
> >> mapply(all.equal, Small, LargeOrd)
> > index char1 char2 char3 int1 int2 char4 int3 int4 int5
> > int6 int7 int8 int9
> > TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
> > TRUE TRUE TRUE TRUE
> > int10 int11 int12 int13 int14 char5 int15 int16 int17 int18
> > int19 int20 int21 int22
> > TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
> > TRUE TRUE TRUE TRUE
> > int23 int24 int25 int26 int27 int28 int29 int30 char6 char7
> > char8 char9 char10 char11
> > TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
> > TRUE TRUE TRUE TRUE
> >>
> >> ## The inspection results
> >> .Internal(inspect(Small))
> > @0x00000000128068e8 19 VECSXP g1c7 [OBJ,MARK,NAM(2),ATT] (len=42, tl=0)
> > @0x000007ff8a3e0010 13 INTSXP g1c7 [MARK,NAM(2)] (len=3103314, tl=0)
> > 1,2,3,4,5,...
> > @0x000007ff4fb30010 16 STRSXP g1c7 [MARK,NAM(2)] (len=3103314, tl=0)
> > @0x0000000012114550 09 CHARSXP g1c3 [MARK,gp=0x60] [ASCII] [cached] "
> > http://conradhotels3.hilton.com"
> > @0x0000000012114550 09 CHARSXP g1c3 [MARK,gp=0x60] [ASCII] [cached] "
> > http://conradhotels3.hilton.com"
> > @0x0000000012114550 09 CHARSXP g1c3 [MARK,gp=0x60] [ASCII] [cached] "
> > http://conradhotels3.hilton.com"
> > @0x0000000012114550 09 CHARSXP g1c3 [MARK,gp=0x60] [ASCII] [cached] "
> > http://conradhotels3.hilton.com"
> > @0x0000000012114550 09 CHARSXP g1c3 [MARK,gp=0x60] [ASCII] [cached] "
> > http://conradhotels3.hilton.com"
> > ...
> > @0x000007ff4e380010 16 STRSXP g1c7 [MARK,NAM(2)] (len=3103314, tl=0)
> > @0x00000000205bf0d8 09 CHARSXP g1c2 [MARK,gp=0x60] [ASCII] [cached]
> > "/en/index.html"
> > @0x00000000205bf0d8 09 CHARSXP g1c2 [MARK,gp=0x60] [ASCII] [cached]
> > "/en/index.html"
> > @0x00000000205bf0d8 09 CHARSXP g1c2 [MARK,gp=0x60] [ASCII] [cached]
> > "/en/index.html"
> > @0x00000000205bf0d8 09 CHARSXP g1c2 [MARK,gp=0x60] [ASCII] [cached]
> > "/en/index.html"
> > @0x00000000205bf0d8 09 CHARSXP g1c2 [MARK,gp=0x60] [ASCII] [cached]
> > "/en/index.html"
> > ...
> > @0x000007ff4cbd0010 16 STRSXP g1c7 [MARK,NAM(2)] (len=3103314, tl=0)
> > @0x0000000000120f20 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached]
> ""
> > @0x0000000000120f20 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached]
> ""
> > @0x0000000000120f20 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached]
> ""
> > @0x0000000000120f20 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached]
> ""
> > @0x0000000000120f20 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached]
> ""
> > ...
> > @0x000007ff88c20010 13 INTSXP g1c7 [MARK,NAM(2)] (len=3103314, tl=0)
> > 44903,44903,44903,44903,44903,...
> > ...
> > ATTRIB:
> > @0x0000000012cab8e0 02 LISTSXP g1c0 [MARK]
> > TAG: @0x0000000000120088 01 SYMSXP g1c0 [MARK,NAM(2),LCK,gp=0x4000]
> > "names" (has value)
> > @0x0000000016868d68 16 STRSXP g1c7 [MARK,NAM(2)] (len=42, tl=0)
> > @0x0000000010112b98 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached]
> > "index"
> > @0x0000000016b28fd0 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached]
> > "char1"
> > @0x0000000016b291e0 09 CHARSXP g1c1 [MARK,gp=0x61,ATT] [ASCII]
> > [cached] "char2"
> > @0x0000000016b293c0 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached]
> > "char3"
> > @0x0000000016b29600 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached]
> > "int1"
> > ...
> > TAG: @0x0000000000120558 01 SYMSXP g1c0 [MARK,NAM(2),LCK,gp=0x4000]
> > "class" (has value)
> > @0x00000000138b5318 16 STRSXP g1c2 [MARK,NAM(2)] (len=2, tl=0)
> > @0x000000000b42c760 09 CHARSXP g1c2 [MARK,gp=0x61] [ASCII] [cached]
> > "data.table"
> > @0x000000000027d230 09 CHARSXP g1c2 [MARK,gp=0x61] [ASCII] [cached]
> > "data.frame"
> > TAG: @0x0000000000121d98 01 SYMSXP g1c0 [MARK,LCK,gp=0x4000]
> > "row.names" (has value)
> > @0x0000000012c38050 13 INTSXP g1c1 [MARK,NAM(2)] (len=2, tl=0)
> > -2147483648,-3103314
> > TAG: @0x000000001497ac10 01 SYMSXP g1c0 [MARK] ".internal.selfref"
> > @0x0000000012caaa60 22 EXTPTRSXP g1c0 [MARK,NAM(2)]
> >> .Internal(inspect(Large))
> > @0x0000000012c24c68 19 VECSXP g1c7 [OBJ,MARK,NAM(2),ATT] (len=42, tl=0)
> > @0x000007ff314d0010 13 INTSXP g1c7 [MARK,NAM(2)] (len=3103314, tl=0)
> > 716234,716235,1007651,2679944,1550732,...
> > @0x000007ff2fd20010 16 STRSXP g1c7 [MARK,NAM(2)] (len=3103314, tl=0)
> > @0x000000001253d8e0 09 CHARSXP g1c3 [MARK,gp=0x60,ATT] [ASCII]
> > [cached]
> > "http://go.compuware.com"
> > @0x000000001253d8e0 09 CHARSXP g1c3 [MARK,gp=0x60,ATT] [ASCII]
> > [cached]
> > "http://go.compuware.com"
> > @0x000000001e6d7ab0 09 CHARSXP g1c3 [MARK,gp=0x60] [ASCII] [cached] "
> > http://www.achmeacollectief.nl"
> > @0x000000001e4a59b8 09 CHARSXP g1c3 [MARK,gp=0x60] [ASCII] [cached] "
> > https://db3.notify.windows.com"
> > @0x000000001e63ee70 09 CHARSXP g1c3 [MARK,gp=0x60] [ASCII] [cached] "
> > http://www.christushealth.org"
> > ...
> > @0x000007ff2e570010 16 STRSXP g1c7 [MARK,NAM(2)] (len=3103314, tl=0)
> > @0x00000000200aa218 09 CHARSXP g1c2 [MARK,gp=0x60,ATT] [ASCII]
> > [cached]
> > "/default.aspx"
> > @0x000000001e444d78 09 CHARSXP g1c3 [MARK,gp=0x60,ATT] [ASCII]
> > [cached]
> > "/dynaTraceMonitor"
> > @0x000000001eb64790 09 CHARSXP g1c2 [MARK,gp=0x60] [ASCII] [cached]
> > "/unilever/"
> > @0x000000000feb4e98 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached]
> > "/ping"
> > @0x0000000000124950 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached]
> > "/"
> > ...
> > @0x000007ff2cdc0010 16 STRSXP g1c7 [MARK,NAM(2)] (len=3103314, tl=0)
> > @0x0000000017a39430 09 CHARSXP g1c5 [MARK,gp=0x60] [ASCII] [cached]
> > "?rurl=http://frontline.compuware.com//products/BU/default.aspx"
> > @0x000000001b721a50 09 CHARSXP g1c7 [MARK,gp=0x60] [ASCII] [cached]
> > "?url=http%3A%2F%2Fgo.compuware.com%2Fdefault.aspx%3Frurl%3Dhttp%3A%2F%
> > 2Ffrontline.compuware.com
> >
> %2F%2Fproducts%2FBU%2Fdefault.aspx$title=$frames=0$pId=G_1336593601673$fId=G_1336593601673$pFId=$rId=RID_73295254$rpId=1059475658$actions=1%7C_load_%7C-%7C_load_%7C1336593601673%7C1336593602736%7C375%2C2%7C_onload_%7C-%7C_load_%7C1336593602626%7C1336593602704%7C375$domR=1336593602642$dtV=410$3p=
> > www.google-analytics.com
> >
> %7C0%7C0%7C0%7C%7C0%7C0%7C0%7C1%7C828_859%7C31%7C31%7C31%7C0%7C%7C0%7C0%7C0%2Cs%7C828%7C859%7C_load_%7Chttp%253A%252F%
> > 252Fwww.google-analytics.com%252Fga.js%3B2264ff.r.axf8.net
> >
> %7C0%7C0%7C0%7C%7C0%7C0%7C0%7C1%7C953_1078%7C125%7C125%7C125%7C0%7C%7C0%7C0%7C0%2Cs%7C953%7C1078%7C_load_%7Chttp%253A%252F%
> > 252F2264FF.r.axf8.net
> >
> %252Fmr%252Fe.gif%253Finfo%253D%25257Bn%25253Ac%25257Cc%25253A38695455749817%25257Cd%25253A1%25257Ca%25253A2264FF%25257Ch%25253A1%25257Ce%25253A%25257Cb%25253A%25257Cl%25253Ahttp%252524%252A%252524%25252F%
> > 25252Fgo.compuware.com
> >
> %25252Fdefault.aspx%25257Cm%25253A1024%25257Co%25253A768%25257Cp%25253AWin32%25257Cq%25253Ax86%25257Ck%25253Alan%25257Cg%25253AMSIE%25257Cf%25253A8.0%25257D%25257Bn%25253Au%25257Ce%25253A1%25257D%2526a%253D2264FF%2526r%253D1%2526s%253D1$time=1336593603689$"
> > @0x0000000000120f20 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached]
> ""
> > @0x0000000000120f20 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached]
> ""
> > @0x0000000000120f20 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached]
> ""
> > ...
> > @0x000007ff2c1e0010 13 INTSXP g1c7 [MARK,NAM(2)] (len=3103314, tl=0)
> > 2812881,2812881,3149757,4286896,3618836,...
> > ...
> > ATTRIB:
> > @0x000000001f163298 02 LISTSXP g1c0 [MARK]
> > TAG: @0x0000000000120088 01 SYMSXP g1c0 [MARK,NAM(2),LCK,gp=0x4000]
> > "names" (has value)
> > @0x000000001283e0e0 16 STRSXP g1c7 [MARK,NAM(2)] (len=42, tl=0)
> > @0x0000000010112b98 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached]
> > "index"
> > @0x0000000016b28fd0 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached]
> > "char1"
> > @0x0000000016b291e0 09 CHARSXP g1c1 [MARK,gp=0x61,ATT] [ASCII]
> > [cached] "char2"
> > @0x0000000016b293c0 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached]
> > "char3"
> > @0x0000000016b29600 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached]
> > "int1"
> > ...
> > TAG: @0x0000000000120558 01 SYMSXP g1c0 [MARK,NAM(2),LCK,gp=0x4000]
> > "class" (has value)
> > @0x000000001368f078 16 STRSXP g1c2 [MARK,NAM(2)] (len=2, tl=0)
> > @0x000000000b42c760 09 CHARSXP g1c2 [MARK,gp=0x61] [ASCII] [cached]
> > "data.table"
> > @0x000000000027d230 09 CHARSXP g1c2 [MARK,gp=0x61] [ASCII] [cached]
> > "data.frame"
> > TAG: @0x0000000000121d98 01 SYMSXP g1c0 [MARK,LCK,gp=0x4000]
> > "row.names" (has value)
> > @0x000000000fb9a988 13 INTSXP g1c1 [MARK,NAM(2)] (len=2, tl=0)
> > -2147483648,-3103314
> > TAG: @0x000000001497ac10 01 SYMSXP g1c0 [MARK] ".internal.selfref"
> > @0x000000001f163110 22 EXTPTRSXP g1c0 [MARK,NAM(2)]
> > TAG: @0x0000000016c8d648 01 SYMSXP g1c0 [MARK] "sorted"
> > @0x000000001ece5f88 16 STRSXP g1c1 [MARK,NAM(2)] (len=1, tl=0)
> > @0x0000000016b27b78 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached]
> > "char4"
> >> .Internal(inspect(LargeOrd))
> > @0x0000000012b69468 19 VECSXP g1c7 [OBJ,MARK,NAM(2),ATT] (len=42, tl=100)
> > @0x000007ffc4fb0010 13 INTSXP g1c7 [MARK,NAM(2)] (len=3103314, tl=0)
> > 1,2,3,4,5,...
> > @0x000007ffc2c20010 16 STRSXP g1c7 [MARK,NAM(2)] (len=3103314, tl=0)
> > @0x0000000012114550 09 CHARSXP g1c3 [MARK,gp=0x60] [ASCII] [cached] "
> > http://conradhotels3.hilton.com"
> > @0x0000000012114550 09 CHARSXP g1c3 [MARK,gp=0x60] [ASCII] [cached] "
> > http://conradhotels3.hilton.com"
> > @0x0000000012114550 09 CHARSXP g1c3 [MARK,gp=0x60] [ASCII] [cached] "
> > http://conradhotels3.hilton.com"
> > @0x0000000012114550 09 CHARSXP g1c3 [MARK,gp=0x60] [ASCII] [cached] "
> > http://conradhotels3.hilton.com"
> > @0x0000000012114550 09 CHARSXP g1c3 [MARK,gp=0x60] [ASCII] [cached] "
> > http://conradhotels3.hilton.com"
> > ...
> > @0x000007ffc0890010 16 STRSXP g1c7 [MARK,NAM(2)] (len=3103314, tl=0)
> > @0x00000000205bf0d8 09 CHARSXP g1c2 [MARK,gp=0x60] [ASCII] [cached]
> > "/en/index.html"
> > @0x00000000205bf0d8 09 CHARSXP g1c2 [MARK,gp=0x60] [ASCII] [cached]
> > "/en/index.html"
> > @0x00000000205bf0d8 09 CHARSXP g1c2 [MARK,gp=0x60] [ASCII] [cached]
> > "/en/index.html"
> > @0x00000000205bf0d8 09 CHARSXP g1c2 [MARK,gp=0x60] [ASCII] [cached]
> > "/en/index.html"
> > @0x00000000205bf0d8 09 CHARSXP g1c2 [MARK,gp=0x60] [ASCII] [cached]
> > "/en/index.html"
> > ...
> > @0x000007ffbe500010 16 STRSXP g1c7 [MARK,NAM(2)] (len=3103314, tl=0)
> > @0x0000000000120f20 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached]
> ""
> > @0x0000000000120f20 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached]
> ""
> > @0x0000000000120f20 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached]
> ""
> > @0x0000000000120f20 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached]
> ""
> > @0x0000000000120f20 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached]
> ""
> > ...
> > @0x000007ffbcd40010 13 INTSXP g1c7 [MARK,NAM(2)] (len=3103314, tl=0)
> > 44903,44903,44903,44903,44903,...
> > ...
> > ATTRIB:
> > @0x0000000012cec058 02 LISTSXP g1c0 [MARK]
> > TAG: @0x0000000000120088 01 SYMSXP g1c0 [MARK,NAM(2),LCK,gp=0x4000]
> > "names" (has value)
> > @0x0000000012b60620 16 STRSXP g1c7 [MARK,NAM(2)] (len=42, tl=100)
> > @0x0000000010112b98 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached]
> > "index"
> > @0x0000000016b28fd0 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached]
> > "char1"
> > @0x0000000016b291e0 09 CHARSXP g1c1 [MARK,gp=0x61,ATT] [ASCII]
> > [cached] "char2"
> > @0x0000000016b293c0 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached]
> > "char3"
> > @0x0000000016b29600 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached]
> > "int1"
> > ...
> > TAG: @0x0000000000120558 01 SYMSXP g1c0 [MARK,NAM(2),LCK,gp=0x4000]
> > "class" (has value)
> > @0x0000000013a16be0 16 STRSXP g1c2 [MARK,NAM(2)] (len=2, tl=0)
> > @0x000000000b42c760 09 CHARSXP g1c2 [MARK,gp=0x61] [ASCII] [cached]
> > "data.table"
> > @0x000000000027d230 09 CHARSXP g1c2 [MARK,gp=0x61] [ASCII] [cached]
> > "data.frame"
> > TAG: @0x0000000000121d98 01 SYMSXP g1c0 [MARK,LCK,gp=0x4000]
> > "row.names" (has value)
> > @0x0000000012c2f2f0 13 INTSXP g1c1 [MARK,NAM(2)] (len=2, tl=0)
> > -2147483648,-3103314
> > TAG: @0x000000001497ac10 01 SYMSXP g1c0 [MARK] ".internal.selfref"
> > @0x0000000012cec170 22 EXTPTRSXP g1c0 [MARK,NAM(2)]
> >>
> >>
> >> ## A little size tester function
> >> ## This will set a key, save the result, print the result's size
> >> keytest = function(dt, key){
> > + setkeyv(dt, key)
> > + save(dt, file='dt_temp.Rdata')
> > + tempfilesize = file.info('dt_temp.Rdata')$size
> > + tempfilesize = formatC(tempfilesize, big.mark=',', format='f',
> > digits=0)
> > + cat(key, tempfilesize, '\n\n')
> > + unlink('dt_temp.Rdata')
> > + invisible(NULL)
> > + }
> >>
> >> str(Small)
> > Classes ‘data.table’ and 'data.frame': 3103314 obs. of 42 variables:
> > $ index : int 1 2 3 4 5 6 7 8 9 10 ...
> > $ char1 : chr "http://conradhotels3.hilton.com" "
> > http://conradhotels3.hilton.com" "http://conradhotels3.hilton.com" "
> > http://conradhotels3.hilton.com" ...
> > $ char2 : chr "/en/index.html" "/en/index.html" "/en/index.html"
> > "/en/index.html" ...
> > $ char3 : chr "" "" "" "" ...
> > $ int1 : int 44903 44903 44903 44903 44903 44903 44903 44903 44903
> > 44903
> > ...
> > $ int2 : int 411 411 254 254 336 336 118 118 386 386 ...
> > $ char4 : chr "2012-05-09 20:17:40.587" "2012-05-09 21:17:54.427"
> > "2012-05-09 20:10:49.560" "2012-05-09 21:11:05.107" ...
> > $ int3 : int 0 0 0 0 0 0 0 0 0 0 ...
> > $ int4 : int 0 0 0 0 0 0 0 0 0 0 ...
> > $ int5 : int 69 69 69 69 69 69 69 68 68 68 ...
> > $ int6 : int 68 68 68 68 68 68 68 67 67 67 ...
> > $ int7 : int 35 35 37 35 35 35 33 38 38 40 ...
> > $ int8 : int 0 0 0 0 0 0 0 0 0 0 ...
> > $ int9 : int 0 0 0 0 0 0 0 0 0 0 ...
> > $ int10 : int 0 0 0 0 0 0 0 0 0 0 ...
> > $ int11 : int 1 1 1 1 1 1 1 1 1 1 ...
> > $ int12 : int 334830 334847 335102 334838 334836 342687 334521 318626
> > 318578 326800 ...
> > $ int13 : int 36 36 37 36 36 36 35 38 37 39 ...
> > $ int14 : int 44 44 49 47 45 45 45 46 45 48 ...
> > $ char5 : chr "" "" "" "" ...
> > $ int15 : int NA NA NA NA NA NA NA NA NA NA ...
> > $ int16 : int 0 0 0 0 0 0 0 0 0 0 ...
> > $ int17 : int 0 0 0 0 0 0 0 0 0 0 ...
> > $ int18 : int 2 2 2 2 2 2 2 2 2 2 ...
> > $ int19 : int 1381 1152 424 3728 1772 921 385 725 401 314 ...
> > $ int20 : int 36 36 37 36 36 36 35 38 37 39 ...
> > $ int21 : int 2199 2201 1492 1448 2559 2529 1084 1432 1876 1984 ...
> > $ int22 : int 44 44 49 47 45 45 45 46 45 48 ...
> > $ int23 : int 2203 2188 1199 1162 2324 2346 821 897 1386 1189 ...
> > $ int24 : int 13 13 14 13 13 13 12 13 13 14 ...
> > $ int25 : int 5166 5761 3755 3794 5614 7779 2830 3971 4637 5871 ...
> > $ int26 : int 103 103 105 103 103 103 101 105 105 107 ...
> > $ int27 : int 70 183 159 197 217 165 153 232 92 102 ...
> > $ int28 : int 103 103 105 103 103 103 101 105 105 107 ...
> > $ int29 : int 0 0 0 0 0 0 0 0 0 0 ...
> > $ int30 : int 161 146 200 158 150 160 190 161 163 169 ...
> > $ char6 : chr "Limelight" "Limelight" "Fusepoint/Savvis"
> > "Fusepoint/Savvis" ...
> > $ char7 : chr "Paris" "Paris" "Toronto" "Toronto" ...
> > $ char8 : chr "-1" "-1" "-1" "-1" ...
> > $ char9 : chr "FRANCE" "FRANCE" "CANADA" "CANADA" ...
> > $ char10: chr "FR" "FR" "CA" "CA" ...
> > $ char11: chr "FRANCE" "FRANCE" "CANADA" "CANADA" ...
> > - attr(*, ".internal.selfref")=<externalptr>
> >> keytest(Small, colnames(Small)[1])
> > index 77,694,801
> >
> >> keytest(Small, colnames(Small)[2])
> > char1 75,876,250
> >
> >> keytest(Small, colnames(Small)[3])
> > char2 77,218,972
> >
> >> keytest(Small, colnames(Small)[4])
> > char3 80,585,449
> >
> >> keytest(Small, colnames(Small)[5])
> > int1 77,558,982
> >
> >> keytest(Small, colnames(Small)[6])
> > int2 95,185,248
> >
> >> keytest(Small, colnames(Small)[7])
> > char4 204,037,056
> >
> >> keytest(Small, colnames(Small)[8])
> > int3 206,450,705
> >
> >> keytest(Small, colnames(Small)[9])
> > int4 211,520,888
> >
> >> keytest(Small, colnames(Small)[10])
> > int5 156,095,150
> >
> >>
> >>
> >> keytest(Small, colnames(Small)[11])
> > int6 150,431,716
> >
> >> keytest(Small, colnames(Small)[12])
> > int7 136,077,306
> >
> >> keytest(Small, colnames(Small)[13])
> > int8 134,981,911
> >
> >> keytest(Small, colnames(Small)[14])
> > int9 134,871,952
> >
> >> keytest(Small, colnames(Small)[15])
> > int10 134,678,104
> >
> >> keytest(Small, colnames(Small)[16])
> > int11 134,682,904
> >
> >> keytest(Small, colnames(Small)[17])
> > int12 112,097,493
> >
> >> keytest(Small, colnames(Small)[18])
> > int13 101,734,541
> >
> >> keytest(Small, colnames(Small)[19])
> > int14 101,160,920
> >
> >>
> >
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20121023/ace5b93a/attachment-0001.html>
More information about the datatable-help
mailing list