[datatable-help] Memory issue

Matthew Dowle mdowle at mdowle.plus.com
Wed Oct 17 11:57:26 CEST 2012


And perhaps see if the file size increase can be demonstrated with
head(datMod,100)?  If so, the smaller object could be attached to a bug
report so I can reproduce.

>
> Very interesting, thanks. I've not seen anything like this before. Perhaps
> some kind of UTF-8/ASCII conversion somewhere?
>
> Next step, please run and send output of :
>
> load('test0.Rdata')
> Small = copy(datMod)
> load('test1.Rdata')
> Large = copy(datMod)
> mapply(identical,Small,Large)
> mapply(all.equal,Small,Large)
> .Internal(inspect(Small))
> .Internal(inspect(Large))
>
> Also if you do the setkey on any int column (rather than chr), does that
> also increase the file size?
>
>> Matt
>>
>> I made a much simpler example that only involves the first data.table
>>
>> Also, although I had a POSIX date before, this example just has the text
>> for the date.
>>
>> It appears that the longer text columns are causing a problem.
>>
>> I'm saving as an RData file, and I also try using Rds at the end, but
>> with
>> no difference.
>>
>> Now I'm more convinced that the problem is in data.table, but I'm not
>> ruling out user error.
>>
>>> ## I was able to reproduce a simpler example
>>> ## without the second data.table
>>>
>>> ## Here is the data (with generic column names)
>>> str(datMod)
>> Classes ‘data.table’ and 'data.frame': 3103314 obs. of  41 variables:
>>  $ char1 : chr  "http://conradhotels3.hilton.com" "
>> http://conradhotels3.hilton.com" "http://conradhotels3.hilton.com" "
>> http://conradhotels3.hilton.com" ...
>>  $ char2 : chr  "/en/index.html" "/en/index.html" "/en/index.html"
>> "/en/index.html" ...
>>  $ char3 : chr  "" "" "" "" ...
>>  $ int1  : int  44903 44903 44903 44903 44903 44903 44903 44903 44903
>> 44903
>> ...
>>  $ int2  : int  411 411 254 254 336 336 118 118 386 386 ...
>>  $ char4 : chr  "2012-05-09 20:17:40.587" "2012-05-09 21:17:54.427"
>> "2012-05-09 20:10:49.560" "2012-05-09 21:11:05.107" ...
>>  $ int3  : int  0 0 0 0 0 0 0 0 0 0 ...
>>  $ int4  : int  0 0 0 0 0 0 0 0 0 0 ...
>>  $ int5  : int  69 69 69 69 69 69 69 68 68 68 ...
>>  $ int6  : int  68 68 68 68 68 68 68 67 67 67 ...
>>  $ int7  : int  35 35 37 35 35 35 33 38 38 40 ...
>>  $ int8  : int  0 0 0 0 0 0 0 0 0 0 ...
>>  $ int9  : int  0 0 0 0 0 0 0 0 0 0 ...
>>  $ int10 : int  0 0 0 0 0 0 0 0 0 0 ...
>>  $ int11 : int  1 1 1 1 1 1 1 1 1 1 ...
>>  $ int12 : int  334830 334847 335102 334838 334836 342687 334521 318626
>> 318578 326800 ...
>>  $ int13 : int  36 36 37 36 36 36 35 38 37 39 ...
>>  $ int14 : int  44 44 49 47 45 45 45 46 45 48 ...
>>  $ char5 : chr  "" "" "" "" ...
>>  $ int15 : int  NA NA NA NA NA NA NA NA NA NA ...
>>  $ int16 : int  0 0 0 0 0 0 0 0 0 0 ...
>>  $ int17 : int  0 0 0 0 0 0 0 0 0 0 ...
>>  $ int18 : int  2 2 2 2 2 2 2 2 2 2 ...
>>  $ int19 : int  1381 1152 424 3728 1772 921 385 725 401 314 ...
>>  $ int20 : int  36 36 37 36 36 36 35 38 37 39 ...
>>  $ int21 : int  2199 2201 1492 1448 2559 2529 1084 1432 1876 1984 ...
>>  $ int22 : int  44 44 49 47 45 45 45 46 45 48 ...
>>  $ int23 : int  2203 2188 1199 1162 2324 2346 821 897 1386 1189 ...
>>  $ int24 : int  13 13 14 13 13 13 12 13 13 14 ...
>>  $ int25 : int  5166 5761 3755 3794 5614 7779 2830 3971 4637 5871 ...
>>  $ int26 : int  103 103 105 103 103 103 101 105 105 107 ...
>>  $ int27 : int  70 183 159 197 217 165 153 232 92 102 ...
>>  $ int28 : int  103 103 105 103 103 103 101 105 105 107 ...
>>  $ int29 : int  0 0 0 0 0 0 0 0 0 0 ...
>>  $ int30 : int  161 146 200 158 150 160 190 161 163 169 ...
>>  $ char6 : chr  "Limelight" "Limelight" "Fusepoint/Savvis"
>> "Fusepoint/Savvis" ...
>>  $ char7 : chr  "Paris" "Paris" "Toronto" "Toronto" ...
>>  $ char8 : chr  "-1" "-1" "-1" "-1" ...
>>  $ char9 : chr  "FRANCE" "FRANCE" "CANADA" "CANADA" ...
>>  $ char10: chr  "FR" "FR" "CA" "CA" ...
>>  $ char11: chr  "FRANCE" "FRANCE" "CANADA" "CANADA" ...
>>  - attr(*, ".internal.selfref")=<externalptr>
>>>
>>> ## Here is the size when you save that file
>>> save(datMod, file='test0.Rdata')
>>> originalfilesize = file.info('test0.RData')$size
>>> formatC(originalfilesize, big.mark=',', format='f', digits=0)
>> [1] "71,085,933"
>>>
>>> ## Here is the size after you set the key
>>> setkey(datMod, char4)
>>> save(datMod, file='test1.Rdata')
>>> newfilesize = file.info('test1.RData')$size
>>> formatC(newfilesize, big.mark=',', format='f', digits=0)
>> [1] "195,406,633"
>>>
>>> ## Some of the columns have a large size
>>> datMod[,range(nchar(char2))]
>> [1]    1 1606
>>> datMod[,range(nchar(char3))]
>> [1]    0 2048
>>>
>>> ## If I remove the long columns it helps reduce the file size
>>> datMod$char2 = NULL
>>> datMod$char3 = NULL
>>>
>>> save(datMod, file='test2.Rdata')
>>> secondfilesize = file.info('test2.RData')$size
>>> formatC(secondfilesize, big.mark=',', format='f', digits=0)
>> [1] "121,237,355"
>>>
>>> ## Using RDS doesn't matter
>>> saveRDS(datMod, file='test2.Rds')
>>> secondfilesizeRDS = file.info('test2.Rds')$size
>>> formatC(secondfilesizeRDS, big.mark=',', format='f', digits=0)
>> [1] "121,237,288"
>>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>




More information about the datatable-help mailing list