[datatable-help] Memory issue
Matthew Dowle
mdowle at mdowle.plus.com
Wed Oct 17 11:53:50 CEST 2012
Very interesting, thanks. I've not seen anything like this before. Perhaps
some kind of UTF-8/ASCII conversion somewhere?
Next step, please run and send output of :
load('test0.Rdata')
Small = copy(datMod)
load('test1.Rdata')
Large = copy(datMod)
mapply(identical,Small,Large)
mapply(all.equal,Small,Large)
.Internal(inspect(Small))
.Internal(inspect(Large))
Also if you do the setkey on any int column (rather than chr), does that
also increase the file size?
> Matt
>
> I made a much simpler example that only involves the first data.table
>
> Also, although I had a POSIX date before, this example just has the text
> for the date.
>
> It appears that the longer text columns are causing a problem.
>
> I'm saving as an RData file, and I also try using Rds at the end, but with
> no difference.
>
> Now I'm more convinced that the problem is in data.table, but I'm not
> ruling out user error.
>
>> ## I was able to reproduce a simpler example
>> ## without the second data.table
>>
>> ## Here is the data (with generic column names)
>> str(datMod)
> Classes data.table and 'data.frame': 3103314 obs. of 41 variables:
> $ char1 : chr "http://conradhotels3.hilton.com" "
> http://conradhotels3.hilton.com" "http://conradhotels3.hilton.com" "
> http://conradhotels3.hilton.com" ...
> $ char2 : chr "/en/index.html" "/en/index.html" "/en/index.html"
> "/en/index.html" ...
> $ char3 : chr "" "" "" "" ...
> $ int1 : int 44903 44903 44903 44903 44903 44903 44903 44903 44903
> 44903
> ...
> $ int2 : int 411 411 254 254 336 336 118 118 386 386 ...
> $ char4 : chr "2012-05-09 20:17:40.587" "2012-05-09 21:17:54.427"
> "2012-05-09 20:10:49.560" "2012-05-09 21:11:05.107" ...
> $ int3 : int 0 0 0 0 0 0 0 0 0 0 ...
> $ int4 : int 0 0 0 0 0 0 0 0 0 0 ...
> $ int5 : int 69 69 69 69 69 69 69 68 68 68 ...
> $ int6 : int 68 68 68 68 68 68 68 67 67 67 ...
> $ int7 : int 35 35 37 35 35 35 33 38 38 40 ...
> $ int8 : int 0 0 0 0 0 0 0 0 0 0 ...
> $ int9 : int 0 0 0 0 0 0 0 0 0 0 ...
> $ int10 : int 0 0 0 0 0 0 0 0 0 0 ...
> $ int11 : int 1 1 1 1 1 1 1 1 1 1 ...
> $ int12 : int 334830 334847 335102 334838 334836 342687 334521 318626
> 318578 326800 ...
> $ int13 : int 36 36 37 36 36 36 35 38 37 39 ...
> $ int14 : int 44 44 49 47 45 45 45 46 45 48 ...
> $ char5 : chr "" "" "" "" ...
> $ int15 : int NA NA NA NA NA NA NA NA NA NA ...
> $ int16 : int 0 0 0 0 0 0 0 0 0 0 ...
> $ int17 : int 0 0 0 0 0 0 0 0 0 0 ...
> $ int18 : int 2 2 2 2 2 2 2 2 2 2 ...
> $ int19 : int 1381 1152 424 3728 1772 921 385 725 401 314 ...
> $ int20 : int 36 36 37 36 36 36 35 38 37 39 ...
> $ int21 : int 2199 2201 1492 1448 2559 2529 1084 1432 1876 1984 ...
> $ int22 : int 44 44 49 47 45 45 45 46 45 48 ...
> $ int23 : int 2203 2188 1199 1162 2324 2346 821 897 1386 1189 ...
> $ int24 : int 13 13 14 13 13 13 12 13 13 14 ...
> $ int25 : int 5166 5761 3755 3794 5614 7779 2830 3971 4637 5871 ...
> $ int26 : int 103 103 105 103 103 103 101 105 105 107 ...
> $ int27 : int 70 183 159 197 217 165 153 232 92 102 ...
> $ int28 : int 103 103 105 103 103 103 101 105 105 107 ...
> $ int29 : int 0 0 0 0 0 0 0 0 0 0 ...
> $ int30 : int 161 146 200 158 150 160 190 161 163 169 ...
> $ char6 : chr "Limelight" "Limelight" "Fusepoint/Savvis"
> "Fusepoint/Savvis" ...
> $ char7 : chr "Paris" "Paris" "Toronto" "Toronto" ...
> $ char8 : chr "-1" "-1" "-1" "-1" ...
> $ char9 : chr "FRANCE" "FRANCE" "CANADA" "CANADA" ...
> $ char10: chr "FR" "FR" "CA" "CA" ...
> $ char11: chr "FRANCE" "FRANCE" "CANADA" "CANADA" ...
> - attr(*, ".internal.selfref")=<externalptr>
>>
>> ## Here is the size when you save that file
>> save(datMod, file='test0.Rdata')
>> originalfilesize = file.info('test0.RData')$size
>> formatC(originalfilesize, big.mark=',', format='f', digits=0)
> [1] "71,085,933"
>>
>> ## Here is the size after you set the key
>> setkey(datMod, char4)
>> save(datMod, file='test1.Rdata')
>> newfilesize = file.info('test1.RData')$size
>> formatC(newfilesize, big.mark=',', format='f', digits=0)
> [1] "195,406,633"
>>
>> ## Some of the columns have a large size
>> datMod[,range(nchar(char2))]
> [1] 1 1606
>> datMod[,range(nchar(char3))]
> [1] 0 2048
>>
>> ## If I remove the long columns it helps reduce the file size
>> datMod$char2 = NULL
>> datMod$char3 = NULL
>>
>> save(datMod, file='test2.Rdata')
>> secondfilesize = file.info('test2.RData')$size
>> formatC(secondfilesize, big.mark=',', format='f', digits=0)
> [1] "121,237,355"
>>
>> ## Using RDS doesn't matter
>> saveRDS(datMod, file='test2.Rds')
>> secondfilesizeRDS = file.info('test2.Rds')$size
>> formatC(secondfilesizeRDS, big.mark=',', format='f', digits=0)
> [1] "121,237,288"
>
More information about the datatable-help
mailing list