[datatable-help] Memory issue

Gene Leynes gleynes+r at gmail.com
Tue Oct 23 16:56:19 CEST 2012


Judging from the unusual silence I'm guessing that this doesn't have an
obvious solution.  I can't provide the data, and in my simulated data I
don't get the same error.

I'll do some more testing today and see if I can isolate the problem.

I have a suspicion about the cause, and ill teat that. The problem seems
related to one particularly messy text field. I will bet that there is some
combination of characters that is causing the problem. Or some of them are
too long.

I could split the file and see if each part blows up in size when I save,
in order to isolate the problem.

Thanks, and my apologies for being unable to send a good example.

On Thursday, October 18, 2012, Gene Leynes wrote:

> Ok, here is my very lengthy reply with lots of diagnostics.
>
>
> >
> > ## Clear the workspace
> > rm(list=ls())
> >
> > ## I use a function called "loader" to load single data objects
> > if(!require('geneorama')){
> +   source('https://raw.github.com/geneorama/geneorama/master/R/loader.R')
> +   cat('loading function \"loader\"')
> + }
> >
> > ## Load the data
> > Small = loader('test0')
> > Large = loader('test1')
> >
> > ## The two files will be different because their order is different
> > str(Small)
> Classes ‘data.table’ and 'data.frame': 3103314 obs. of  42 variables:
>  $ index : int  1 2 3 4 5 6 7 8 9 10 ...
>  $ char1 : chr  "http://conradhotels3.hilton.com" "
> http://conradhotels3.hilton.com" "http://conradhotels3.hilton.com" "
> http://conradhotels3.hilton.com" ...
>  $ char2 : chr  "/en/index.html" "/en/index.html" "/en/index.html"
> "/en/index.html" ...
>  $ char3 : chr  "" "" "" "" ...
>  $ int1  : int  44903 44903 44903 44903 44903 44903 44903 44903 44903
> 44903 ...
>  $ int2  : int  411 411 254 254 336 336 118 118 386 386 ...
>  $ char4 : chr  "2012-05-09 20:17:40.587" "2012-05-09 21:17:54.427"
> "2012-05-09 20:10:49.560" "2012-05-09 21:11:05.107" ...
>  $ int3  : int  0 0 0 0 0 0 0 0 0 0 ...
>  $ int4  : int  0 0 0 0 0 0 0 0 0 0 ...
>  $ int5  : int  69 69 69 69 69 69 69 68 68 68 ...
>  $ int6  : int  68 68 68 68 68 68 68 67 67 67 ...
>  $ int7  : int  35 35 37 35 35 35 33 38 38 40 ...
>  $ int8  : int  0 0 0 0 0 0 0 0 0 0 ...
>  $ int9  : int  0 0 0 0 0 0 0 0 0 0 ...
>  $ int10 : int  0 0 0 0 0 0 0 0 0 0 ...
>  $ int11 : int  1 1 1 1 1 1 1 1 1 1 ...
>  $ int12 : int  334830 334847 335102 334838 334836 342687 334521 318626
> 318578 326800 ...
>  $ int13 : int  36 36 37 36 36 36 35 38 37 39 ...
>  $ int14 : int  44 44 49 47 45 45 45 46 45 48 ...
>  $ char5 : chr  "" "" "" "" ...
>  $ int15 : int  NA NA NA NA NA NA NA NA NA NA ...
>  $ int16 : int  0 0 0 0 0 0 0 0 0 0 ...
>  $ int17 : int  0 0 0 0 0 0 0 0 0 0 ...
>  $ int18 : int  2 2 2 2 2 2 2 2 2 2 ...
>  $ int19 : int  1381 1152 424 3728 1772 921 385 725 401 314 ...
>  $ int20 : int  36 36 37 36 36 36 35 38 37 39 ...
>  $ int21 : int  2199 2201 1492 1448 2559 2529 1084 1432 1876 1984 ...
>  $ int22 : int  44 44 49 47 45 45 45 46 45 48 ...
>  $ int23 : int  2203 2188 1199 1162 2324 2346 821 897 1386 1189 ...
>  $ int24 : int  13 13 14 13 13 13 12 13 13 14 ...
>  $ int25 : int  5166 5761 3755 3794 5614 7779 2830 3971 4637 5871 ...
>  $ int26 : int  103 103 105 103 103 103 101 105 105 107 ...
>  $ int27 : int  70 183 159 197 217 165 153 232 92 102 ...
>  $ int28 : int  103 103 105 103 103 103 101 105 105 107 ...
>  $ int29 : int  0 0 0 0 0 0 0 0 0 0 ...
>  $ int30 : int  161 146 200 158 150 160 190 161 163 169 ...
>  $ char6 : chr  "Limelight" "Limelight" "Fusepoint/Savvis"
> "Fusepoint/Savvis" ...
>  $ char7 : chr  "Paris" "Paris" "Toronto" "Toronto" ...
>  $ char8 : chr  "-1" "-1" "-1" "-1" ...
>  $ char9 : chr  "FRANCE" "FRANCE" "CANADA" "CANADA" ...
>  $ char10: chr  "FR" "FR" "CA" "CA" ...
> > str(Large)
> Classes ‘data.table’ and 'data.frame': 3103314 obs. of  42 variables:
>  $ index : int  716234 716235 1007651 2679944 1550732 1932010 2879445
> 1007670 1736006 666363 ...
>  $ char1 : chr  "http://go.compuware.com" "http://go.compuware.com" "
> http://www.achmeacollectief.nl" "https://db3.notify.windows.com" ...
>  $ char2 : chr  "/default.aspx" "/dynaTraceMonitor" "/unilever/" "/ping"
> ...
>  $ char3 : chr  "?rurl=
> http://frontline.compuware.com//products/BU/default.aspx"
> "?url=http%3A%2F%2Fgo.compuware.com%2Fdefault.aspx%3Frurl%3Dhttp%3A%2F%
> 2Ffrontline.compuware.com%2F%2Fproducts%2FBU%2Fdefault.as"| __truncated__
> "" "" ...
>  $ int1  : int  2812881 2812881 3149757 4286896 3618836 3861870 4315803
> 3149760 3779387 2754629 ...
>  $ int2  : int  133 133 133 133 340 340 326 133 133 340 ...
>  $ char4 : chr  "2012-05-09 20:00:00.000" "2012-05-09 20:00:00.000"
> "2012-05-09 20:00:00.000" "2012-05-09 20:00:00.000" ...
>  $ int3  : int  0 1 0 0 0 0 0 0 0 0 ...
>  $ int4  : int  2264 2496 1782 461 1953 1418 641 1207 167 278 ...
>  $ int5  : int  26 20 6 1 71 64 1 6 1 15 ...
>  $ int6  : int  26 20 6 1 69 64 1 6 1 15 ...
>  $ int7  : int  2 2 4 0 2 12 0 2 0 0 ...
>  $ int8  : int  0 0 0 0 2 0 0 0 0 0 ...
>  $ int9  : int  0 0 0 0 0 0 0 0 0 0 ...
>  $ int10 : int  0 0 0 0 0 0 0 0 0 0 ...
>  $ int11 : int  0 0 0 0 0 0 0 0 0 0 ...
>  $ int12 : int  392752 417195 43107 0 1419015 1031349 187344 62969 43
> 428189 ...
>  $ int13 : int  4 4 5 1 8 22 1 3 1 1 ...
>  $ int14 : int  9 11 8 1 17 38 1 6 1 15 ...
>  $ char5 : chr  "" "" "" "" ...
>  $ int15 : int  NA NA NA NA 0 NA NA NA NA 0 ...
>  $ int16 : int  0 0 0 0 0 0 0 0 0 0 ...
>  $ int17 : int  0 0 0 0 0 0 0 0 0 0 ...
>  $ int18 : int  2 28 3 0 0 1 0 1 0 0 ...
>  $ int19 : int  137 0 136 298 277 255 147 141 137 209 ...
>  $ int20 : int  4 0 5 1 8 22 1 3 1 1 ...
>  $ int21 : int  945 612 59 22 689 1153 54 29 13 59 ...
>  $ int22 : int  9 5 8 1 17 38 1 6 1 15 ...
>  $ int23 : int  0 0 0 118 0 0 0 0 0 0 ...
>  $ int24 : int  0 0 0 1 0 0 0 0 0 0 ...
>  $ int25 : int  3243 2653 1585 22 3292 3076 64 1043 13 81 ...
>  $ int26 : int  28 22 10 1 73 76 1 8 1 15 ...
>  $ int27 : int  2060 3365 257 1 3304 1038 376 258 4 80 ...
>  $ int28 : int  28 22 10 1 73 76 1 8 1 15 ...
>  $ int29 : int  0 0 0 0 0 0 0 0 0 0 ...
>  $ int30 : int  921 750 203 578 609 1078 234 187 31 140 ...
>  $ char6 : chr  "Interoute" "Interoute" "Interoute" "Interoute" ...
>  $ char7 : chr  "Amsterdam" "Amsterdam" "Amsterdam" "Amsterdam" ...
>  $ char8 : chr  "-1" "-1" "-1" "-1" ...
>  $ char9 : chr  "NETHERLANDS" "NETHERLANDS" "NETHERLANDS" "NETHERLANDS" ...
>  $ char10: chr  "NL" "NL" "NL" "NL" ...
>  $ char11: chr  "NETHERLANDS" "NETHERLANDS" "NETHERLANDS" "NETHERLANDS" ...
>  - attr(*, ".internal.selfref")=<externalptr>
>  - attr(*, "sorted")= chr "char4"
> >
> > ## The difference is shown here
> > mapply(identical, Small, Large)
>  index  char1  char2  char3   int1   int2  char4   int3   int4   int5
> int6   int7   int8   int9
>  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE
>  FALSE  FALSE  FALSE  FALSE
>  int10  int11  int12  int13  int14  char5  int15  int16  int17  int18
>  int19  int20  int21  int22
>  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE
>  FALSE  FALSE  FALSE  FALSE
>  int23  int24  int25  int26  int27  int28  int29  int30  char6  char7
>  char8  char9 char10 char11
>  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE
>  FALSE  FALSE  FALSE  FALSE
> > mapply(all.equal, Small, Large)
>                                                          index
>                          "Mean relative difference: 0.6660698"
>                                                          char1
>                                    "3100674 string mismatches"
>                                                          char2
>                                    "2961621 string mismatches"
>                                                          char3
>                                    "1753352 string mismatches"
>                                                           int1
>                          "Mean relative difference: 0.2945024"
>                                                           int2
>                          "Mean relative difference: 0.4866954"
>                                                          char4
>                                    "3103308 string mismatches"
>                                                           int3
>                           "Mean relative difference: 1.759713"
>                                                           int4
>                           "Mean relative difference: 1.408616"
>                                                           int5
>                           "Mean relative difference: 1.411817"
>                                                           int6
>                           "Mean relative difference: 1.415648"
>                                                           int7
>                           "Mean relative difference: 1.705137"
>                                                           int8
>                           "Mean relative difference: 1.954795"
>                                                           int9
>                            "Mean relative difference: 1.99701"
>                                                          int10
>                           "Mean relative difference: 1.995529"
>                                                          int11
>                                  "Mean relative difference: 2"
>                                                          int12
>                           "Mean relative difference: 1.479043"
>                                                          int13
>                           "Mean relative difference: 1.323619"
>                                                          int14
>                           "Mean relative difference: 1.360022"
>                                                          char5
>                                    "1454309 string mismatches"
>                                                          int15
> "'is.NA' value mismatch: 2260789 in current 2260789 in target"
>                                                          int16
>                           "Mean relative difference: 1.997195"
>                                                          int17
>                                  "Mean relative difference: 2"
>                                                          int18
>                           "Mean relative difference: 1.799441"
>                                                          int19
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20121023/89f52104/attachment-0001.html>


More information about the datatable-help mailing list