[datatable-help] fread(character string) limited to strings less than 4096 long?
G See
gsee000 at gmail.com
Thu Mar 28 16:23:34 CET 2013
FWIW, on mac:
> for (n in c(1023:1025, 10000)) {
+ A <- fread(
+ paste(
+ rep('a\tb\n', n),
+ collapse=''
+ ),
+ sep='\t'
+ )
+ print(nrow(A))
+ }
[1] 255
[1] 255
[1] 255
[1] 255
> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.8.9
####### and with verbose
> for (n in c(1023:1025, 10000)) {
+ input = paste( rep('a\tb\n', n), collapse='')
+ A = fread(input,verbose=TRUE)
+ cat(nchar(input), nrow(A), "\n")
+ }
Input contains a \n (or is ""), taking this to be text input (not a filename)
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Using line 30 to detect sep (the last non blank line in the first 30) ... '\t'
Found 2 columns
First row with 2 fields occurs on line 1 (either column names or first
row of data)
All the fields on line 1 are character fields. Treating as the column names.
Count of eol after first data row: 255
Subtracted 0 for last eol and any trailing empty lines, leaving 255 data rows
Type codes: 33 (first 5 rows)
Type codes: 33 (+middle 5 rows)
Type codes: 33 (+last 5 rows)
0.000s ( 14%) Memory map (rerun may be quicker)
0.000s ( 25%) sep and header detection
0.000s ( 8%) Count rows (wc -l)
0.000s ( 24%) Column type detection (first, middle and last 5 rows)
0.000s ( 6%) Allocation of 255x2 result (xMB) in RAM
0.000s ( 22%) Reading data
0.000s ( 0%) Allocation for type bumps (if any), including gc time
if triggered
0.000s ( 0%) Coercing data already read in type bumps (if any)
0.000s ( 1%) Changing na.strings to NA
0.000s Total
4092 255
Input contains a \n (or is ""), taking this to be text input (not a filename)
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Using line 30 to detect sep (the last non blank line in the first 30) ... '\t'
Found 2 columns
First row with 2 fields occurs on line 1 (either column names or first
row of data)
All the fields on line 1 are character fields. Treating as the column names.
Count of eol after first data row: 255
Subtracted 0 for last eol and any trailing empty lines, leaving 255 data rows
Type codes: 33 (first 5 rows)
Type codes: 33 (+middle 5 rows)
Type codes: 33 (+last 5 rows)
0.000s ( 10%) Memory map (rerun may be quicker)
0.000s ( 21%) sep and header detection
0.000s ( 10%) Count rows (wc -l)
0.000s ( 28%) Column type detection (first, middle and last 5 rows)
0.000s ( 3%) Allocation of 255x2 result (xMB) in RAM
0.000s ( 26%) Reading data
0.000s ( 0%) Allocation for type bumps (if any), including gc time
if triggered
0.000s ( 0%) Coercing data already read in type bumps (if any)
0.000s ( 2%) Changing na.strings to NA
0.000s Total
4096 255
Input contains a \n (or is ""), taking this to be text input (not a filename)
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Using line 30 to detect sep (the last non blank line in the first 30) ... '\t'
Found 2 columns
First row with 2 fields occurs on line 1 (either column names or first
row of data)
All the fields on line 1 are character fields. Treating as the column names.
Count of eol after first data row: 255
Subtracted 0 for last eol and any trailing empty lines, leaving 255 data rows
Type codes: 33 (first 5 rows)
Type codes: 33 (+middle 5 rows)
Type codes: 33 (+last 5 rows)
0.000s ( 10%) Memory map (rerun may be quicker)
0.000s ( 21%) sep and header detection
0.000s ( 10%) Count rows (wc -l)
0.000s ( 27%) Column type detection (first, middle and last 5 rows)
0.000s ( 3%) Allocation of 255x2 result (xMB) in RAM
0.000s ( 27%) Reading data
0.000s ( 0%) Allocation for type bumps (if any), including gc time
if triggered
0.000s ( 0%) Coercing data already read in type bumps (if any)
0.000s ( 1%) Changing na.strings to NA
0.000s Total
4100 255
Input contains a \n (or is ""), taking this to be text input (not a filename)
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Using line 30 to detect sep (the last non blank line in the first 30) ... '\t'
Found 2 columns
First row with 2 fields occurs on line 1 (either column names or first
row of data)
All the fields on line 1 are character fields. Treating as the column names.
Count of eol after first data row: 255
Subtracted 0 for last eol and any trailing empty lines, leaving 255 data rows
Type codes: 33 (first 5 rows)
Type codes: 33 (+middle 5 rows)
Type codes: 33 (+last 5 rows)
0.000s ( 10%) Memory map (rerun may be quicker)
0.000s ( 23%) sep and header detection
0.000s ( 10%) Count rows (wc -l)
0.000s ( 25%) Column type detection (first, middle and last 5 rows)
0.000s ( 3%) Allocation of 255x2 result (xMB) in RAM
0.000s ( 26%) Reading data
0.000s ( 0%) Allocation for type bumps (if any), including gc time
if triggered
0.000s ( 0%) Coercing data already read in type bumps (if any)
0.000s ( 3%) Changing na.strings to NA
0.000s Total
40000 255
Best,
Garrett
On Thu, Mar 28, 2013 at 10:19 AM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
>
>
> Hi,
>
> Thanks. That was from v1.8.8 on computer 2 I hope. Computer 1 with v1.8.9
> should have the -nan% problem fixed.
>
> I'm a bit stumped for the moment. I've filed a bug report. Probably, if I
> still can't reproduce my end, I'll add some more detailed tracing to verbose
> output and ask you to try again next week if that's ok.
>
> Thanks for reporting!
>
> Matthew
>
>
>
> On 28.03.2013 14:58, Timothée Carayol wrote:
>
> Input contains a \n (or is ""), taking this to be text input (not a
> filename)
> Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
> Using line 30 to detect sep (the last non blank line in the first 30) ...
> '\t'
> Found 2 columns
> First row with 2 fields occurs on line 1 (either column names or first row
> of data)
> All the fields on line 1 are character fields. Treating as the column names.
> Count of eol after first data row: 1023
> Subtracted 1 for last eol and any trailing empty lines, leaving 1022 data
> rows
> Type codes: 33 (first 5 rows)
> Type codes: 33 (+middle 5 rows)
> Type codes: 33 (+last 5 rows)
> 0.000s (-nan%) Memory map (rerun may be quicker)
> 0.000s (-nan%) sep and header detection
> 0.000s (-nan%) Count rows (wc -l)
> 0.000s (-nan%) Column type detection (first, middle and last 5 rows)
> 0.000s (-nan%) Allocation of 1022x2 result (xMB) in RAM
> 0.000s (-nan%) Reading data
> 0.000s (-nan%) Allocation for type bumps (if any), including gc time if
> triggered
> 0.000s (-nan%) Coercing data already read in type bumps (if any)
> 0.000s (-nan%) Changing na.strings to NA
> 0.000s Total
> 4092 1022
> Input contains a \n (or is ""), taking this to be text input (not a
> filename)
> Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
> Using line 30 to detect sep (the last non blank line in the first 30) ...
> '\t'
> Found 2 columns
> First row with 2 fields occurs on line 1 (either column names or first row
> of data)
> All the fields on line 1 are character fields. Treating as the column names.
> Count of eol after first data row: 1023
> Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data
> rows
> Type codes: 33 (first 5 rows)
> Type codes: 33 (+middle 5 rows)
> Type codes: 33 (+last 5 rows)
> 0.000s (-nan%) Memory map (rerun may be quicker)
> 0.000s (-nan%) sep and header detection
> 0.000s (-nan%) Count rows (wc -l)
> 0.000s (-nan%) Column type detection (first, middle and last 5 rows)
> 0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM
> 0.000s (-nan%) Reading data
> 0.000s (-nan%) Allocation for type bumps (if any), including gc time if
> triggered
> 0.000s (-nan%) Coercing data already read in type bumps (if any)
> 0.000s (-nan%) Changing na.strings to NA
> 0.000s Total
> 4096 1023
> Input contains a \n (or is ""), taking this to be text input (not a
> filename)
> Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
> Using line 30 to detect sep (the last non blank line in the first 30) ...
> '\t'
> Found 2 columns
> First row with 2 fields occurs on line 1 (either column names or first row
> of data)
> All the fields on line 1 are character fields. Treating as the column names.
> Count of eol after first data row: 1023
> Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data
> rows
> Type codes: 33 (first 5 rows)
> Type codes: 33 (+middle 5 rows)
> Type codes: 33 (+last 5 rows)
> 0.000s (-nan%) Memory map (rerun may be quicker)
> 0.000s (-nan%) sep and header detection
> 0.000s (-nan%) Count rows (wc -l)
> 0.000s (-nan%) Column type detection (first, middle and last 5 rows)
> 0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM
> 0.000s (-nan%) Reading data
> 0.000s (-nan%) Allocation for type bumps (if any), including gc time if
> triggered
> 0.000s (-nan%) Coercing data already read in type bumps (if any)
> 0.000s (-nan%) Changing na.strings to NA
> 0.000s Total
> 4100 1023
> Input contains a \n (or is ""), taking this to be text input (not a
> filename)
> Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
> Using line 30 to detect sep (the last non blank line in the first 30) ...
> '\t'
> Found 2 columns
> First row with 2 fields occurs on line 1 (either column names or first row
> of data)
> All the fields on line 1 are character fields. Treating as the column names.
> Count of eol after first data row: 1023
> Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data
> rows
> Type codes: 33 (first 5 rows)
> Type codes: 33 (+middle 5 rows)
> Type codes: 33 (+last 5 rows)
> 0.000s (-nan%) Memory map (rerun may be quicker)
> 0.000s (-nan%) sep and header detection
> 0.000s (-nan%) Count rows (wc -l)
> 0.000s (-nan%) Column type detection (first, middle and last 5 rows)
> 0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM
> 0.000s (-nan%) Reading data
> 0.000s (-nan%) Allocation for type bumps (if any), including gc time if
> triggered
> 0.000s (-nan%) Coercing data already read in type bumps (if any)
> 0.000s (-nan%) Changing na.strings to NA
> 0.000s Total
> 40000 1023
>
>
> On Thu, Mar 28, 2013 at 2:55 PM, Matthew Dowle <mdowle at mdowle.plus.com>
> wrote:
>>
>>
>>
>> Hm this is odd.
>>
>> Could you run the following and paste back the (verbose) results please.
>>
>> for (n in c(1023:1025, 10000)) {
>>
>> input = paste( rep('a\tb\n', n), collapse='')
>> A = fread(input,verbose=TRUE)
>> cat(nchar(input), nrow(A), "\n")
>> }
>>
>>
>>
>>
>>
>> On 28.03.2013 14:38, Timothée Carayol wrote:
>>
>> Curiouser and curiouser..
>>
>> I can reproduce on two computers with different versions of R and of
>> data.table.
>>
>>
>>
>> Computer 1 (it says unknown-linux but is actually ubuntu):
>>
>> R version 2.15.3 (2013-03-01)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
>> LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
>> LC_MONETARY=en_GB.UTF-8
>> LC_MESSAGES=en_GB.UTF-8 LC_PAPER=C LC_NAME=C
>> LC_ADDRESS=C
>> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8
>> LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] bit64_0.9-2 bit_1.1-10 data.table_1.8.9 colorout_1.0-0
>> Computer 2:
>> R version 2.15.2 (2012-10-26)
>> Platform: x86_64-redhat-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
>> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
>> [7] LC_PAPER=C LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] data.table_1.8.8
>>
>> loaded via a namespace (and not attached):
>> [1] tools_2.15.2
>>
>>
>> On Thu, Mar 28, 2013 at 2:31 PM, Matthew Dowle <mdowle at mdowle.plus.com>
>> wrote:
>>>
>>>
>>>
>>> Interesting, what's your sessionInfo() please?
>>>
>>> For me it seems to work ok :
>>>
>>> [1] 1022
>>> [1] 1023
>>> [1] 1024
>>> [1] 9999
>>>
>>> > sessionInfo()
>>> R version 2.15.2 (2012-10-26)
>>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>
>>>
>>>
>>> On 27.03.2013 22:49, Timothée Carayol wrote:
>>>
>>> Agree with Muhammad, longer character strings are definitely permitted in
>>> R.
>>> A minimal example that show something strange happening with fread:
>>> for (n in c(1023:1025, 10000)) {
>>> A
>>> paste(
>>> rep('a\tb\n', n),
>>> collapse=''
>>> ),
>>> sep='\t'
>>> )
>>> print(nrow(A))
>>> }
>>> On my computer, I obtain:
>>> [1] 1022
>>> [1] 1023
>>> [1] 1023
>>> [1] 1023
>>> Hope this helps
>>> Timothée
>>>
>>>
>>> On Wed, Mar 27, 2013 at 9:23 PM, Matthew Dowle <mdowle at mdowle.plus.com>
>>> wrote:
>>>>
>>>> Hi,
>>>> Nice to hear from you. Nope not known to me. Obviously 4096 is 4k, is
>>>> that
>>>> the R limit for a character string length? What happens at 4097?
>>>> Matthew
>>>>
>>>> > Hi,
>>>> >
>>>> > I have an example of a string of 4097 characters which can't be parsed
>>>> > by
>>>> > fread; however, if I remove any character, it can be parsed just fine.
>>>> > Is
>>>> > that a known limitation?
>>>> >
>>>> > (If I write the string to a file and then fread the file name, it
>>>> > works
>>>> > too.)
>>>> >
>>>> > Let me know if you need the string and/or a bug report.
>>>> >
>>>> > Thanks
>>>> > Timothée
>>>> > _______________________________________________
>>>> > datatable-help mailing list
>>>> > datatable-help at lists.r-forge.r-project.org
>>>> >
>>>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>>
>
>
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
More information about the datatable-help
mailing list