[datatable-help] fread(character string) limited to strings less than 4096 long?

Thu Mar 28 16:19:52 CET 2013

Hi, 

Thanks. That was from v1.8.8 on computer 2 I hope. Computer 1
with v1.8.9 should have the -nan% problem fixed. 

I'm a bit stumped for
the moment. I've filed a bug report. Probably, if I still can't
reproduce my end, I'll add some more detailed tracing to verbose output
and ask you to try again next week if that's ok. 

Thanks for reporting!

Matthew 

On 28.03.2013 14:58, Timothée Carayol wrote: 

> Input
contains a n (or is ""), taking this to be text input (not a filename)

> Detected eol as n only (no r afterwards), the UNIX and Mac standard.

> Using line 30 to detect sep (the last non blank line in the first 30)
... 't' 
> Found 2 columns 
> First row with 2 fields occurs on line 1
(either column names or first row of data) 
> All the fields on line 1
are character fields. Treating as the column names. 
> Count of eol
after first data row: 1023 
> Subtracted 1 for last eol and any trailing
empty lines, leaving 1022 data rows 
> Type codes: 33 (first 5 rows) 
>
Type codes: 33 (+middle 5 rows) 
> Type codes: 33 (+last 5 rows) 
>
0.000s (-nan%) Memory map (rerun may be quicker) 
> 0.000s (-nan%) sep
and header detection 
> 0.000s (-nan%) Count rows (wc -l) 
> 0.000s
(-nan%) Column type detection (first, middle and last 5 rows) 
> 0.000s
(-nan%) Allocation of 1022x2 result (xMB) in RAM 
> 0.000s (-nan%)
Reading data 
> 0.000s (-nan%) Allocation for type bumps (if any),
including gc time if triggered 
> 0.000s (-nan%) Coercing data already
read in type bumps (if any) 
> 0.000s (-nan%) Changing na.strings to NA

> 0.000s Total 
> 4092 1022 
> Input contains a n (or is ""), taking
this to be text input (not a filename) 
> Detected eol as n only (no r
afterwards), the UNIX and Mac standard. 
> Using line 30 to detect sep
(the last non blank line in the first 30) ... 't' 
> Found 2 columns 
>
First row with 2 fields occurs on line 1 (either column names or first
row of data) 
> All the fields on line 1 are character fields. Treating
as the column names. 
> Count of eol after first data row: 1023 
>
Subtracted 0 for last eol and any trailing empty lines, leaving 1023
data rows 
> Type codes: 33 (first 5 rows) 
> Type codes: 33 (+middle 5
rows) 
> Type codes: 33 (+last 5 rows) 
> 0.000s (-nan%) Memory map
(rerun may be quicker) 
> 0.000s (-nan%) sep and header detection 
>
0.000s (-nan%) Count rows (wc -l) 
> 0.000s (-nan%) Column type
detection (first, middle and last 5 rows) 
> 0.000s (-nan%) Allocation
of 1023x2 result (xMB) in RAM 
> 0.000s (-nan%) Reading data 
> 0.000s
(-nan%) Allocation for type bumps (if any), including gc time if
triggered 
> 0.000s (-nan%) Coercing data already read in type bumps (if
any) 
> 0.000s (-nan%) Changing na.strings to NA 
> 0.000s Total 
> 4096
1023 
> Input contains a n (or is ""), taking this to be text input (not
a filename) 
> Detected eol as n only (no r afterwards), the UNIX and
Mac standard. 
> Using line 30 to detect sep (the last non blank line in
the first 30) ... 't' 
> Found 2 columns 
> First row with 2 fields
occurs on line 1 (either column names or first row of data) 
> All the
fields on line 1 are character fields. Treating as the column names. 
>
Count of eol after first data row: 1023 
> Subtracted 0 for last eol and
any trailing empty lines, leaving 1023 data rows 
> Type codes: 33
(first 5 rows) 
> Type codes: 33 (+middle 5 rows) 
> Type codes: 33
(+last 5 rows) 
> 0.000s (-nan%) Memory map (rerun may be quicker) 
>
0.000s (-nan%) sep and header detection 
> 0.000s (-nan%) Count rows (wc
-l) 
> 0.000s (-nan%) Column type detection (first, middle and last 5
rows) 
> 0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM 
>
0.000s (-nan%) Reading data 
> 0.000s (-nan%) Allocation for type bumps
(if any), including gc time if triggered 
> 0.000s (-nan%) Coercing data
already read in type bumps (if any) 
> 0.000s (-nan%) Changing
na.strings to NA 
> 0.000s Total 
> 4100 1023 
> Input contains a n (or
is ""), taking this to be text input (not a filename) 
> Detected eol as
n only (no r afterwards), the UNIX and Mac standard. 
> Using line 30 to
detect sep (the last non blank line in the first 30) ... 't' 
> Found 2
columns 
> First row with 2 fields occurs on line 1 (either column names
or first row of data) 
> All the fields on line 1 are character fields.
Treating as the column names. 
> Count of eol after first data row: 1023

> Subtracted 0 for last eol and any trailing empty lines, leaving 1023
data rows 
> Type codes: 33 (first 5 rows) 
> Type codes: 33 (+middle 5
rows) 
> Type codes: 33 (+last 5 rows) 
> 0.000s (-nan%) Memory map
(rerun may be quicker) 
> 0.000s (-nan%) sep and header detection 
>
0.000s (-nan%) Count rows (wc -l) 
> 0.000s (-nan%) Column type
detection (first, middle and last 5 rows) 
> 0.000s (-nan%) Allocation
of 1023x2 result (xMB) in RAM 
> 0.000s (-nan%) Reading data 
> 0.000s
(-nan%) Allocation for type bumps (if any), including gc time if
triggered 
> 0.000s (-nan%) Coercing data already read in type bumps (if
any) 
> 0.000s (-nan%) Changing na.strings to NA 
> 0.000s Total 
>
40000 1023 
> 
> On Thu, Mar 28, 2013 at 2:55 PM, Matthew Dowle
<mdowle at mdowle.plus.com [5]> wrote:
> 
>> Hm this is odd. 
>> 
>> Could
you run the following and paste back the (verbose) results please. 
>>
for (n in c(1023:1025, 10000)) { 
>> 
>> input = paste( rep('atbn', n),
collapse='')
>> A = fread(input,verbose=TRUE)
>> cat(nchar(input),
nrow(A), "n")
>> }
>> 
>> On 28.03.2013 14:38, Timothée Carayol wrote:

>> 
>>> Curiouser and curiouser.. 
>>> 
>>> I can reproduce on two
computers with different versions of R and of data.table. 
>>> 
>>>
Computer 1 (it says unknown-linux but is actually ubuntu): 
>>> 
>>> R
version 2.15.3 (2013-03-01) 
>>> Platform: x86_64-unknown-linux-gnu
(64-bit) 
>>> 
>>> locale: 
>>> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 
>>>
LC_MESSAGES=en_GB.UTF-8 LC_PAPER=C LC_NAME=C LC_ADDRESS=C 
>>> [10]
LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C 
>>> 
>>>
attached base packages: 
>>> [1] stats graphics grDevices utils datasets
methods base 
>>> 
>>> other attached packages: 
>>> [1] bit64_0.9-2
bit_1.1-10 data.table_1.8.9 colorout_1.0-0 
>>> Computer 2: 
>>> 
>>> R
version 2.15.2 (2012-10-26) 
>>> Platform: x86_64-redhat-linux-gnu
(64-bit) 
>>> 
>>> locale: 
>>> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C

>>> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 
>>> [5]
LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 
>>> [7] LC_PAPER=C
LC_NAME=C 
>>> [9] LC_ADDRESS=C LC_TELEPHONE=C 
>>> [11]
LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C 
>>> 
>>> attached base
packages: 
>>> [1] stats graphics grDevices utils datasets methods base

>>> 
>>> other attached packages: 
>>> [1] data.table_1.8.8 
>>> 
>>>
loaded via a namespace (and not attached): 
>>> [1] tools_2.15.2 
>>>

>>> On Thu, Mar 28, 2013 at 2:31 PM, Matthew Dowle
<mdowle at mdowle.plus.com [4]> wrote:
>>> 
>>>> Interesting, what's your
sessionInfo() please? 
>>>> 
>>>> For me it seems to work ok : 
>>>>

>>>> [1] 1022
>>>> [1] 1023
>>>> [1] 1024 
>>>> [1] 9999
>>>> 
>>>>>
sessionInfo()
>>>> R version 2.15.2 (2012-10-26)
>>>> Platform:
x86_64-w64-mingw32/x64 (64-bit)
>>>> 
>>>> On 27.03.2013 22:49, Timothée
Carayol wrote: 
>>>> 
>>>>> Agree with Muhammad, longer character
strings are definitely permitted in R. 
>>>>> A minimal example that
show something strange happening with fread: 
>>>>> 
>>>>> for (n in
c(1023:1025, 10000)) { 
>>>>> A 
>>>>> 
>>>>> paste( 
>>>>> rep('atbn',
n), 
>>>>> collapse='' 
>>>>> ), 
>>>>> sep='t' 
>>>>> ) 
>>>>>
print(nrow(A)) 
>>>>> } 
>>>>> On my computer, I obtain: 
>>>>> 
>>>>>
[1] 1022 
>>>>> [1] 1023 
>>>>> [1] 1023 
>>>>> [1] 1023 
>>>>> Hope
this helps 
>>>>> Timothée 
>>>>> 
>>>>> On Wed, Mar 27, 2013 at 9:23
PM, Matthew Dowle <mdowle at mdowle.plus.com [3]> wrote:
>>>>> 
>>>>>>
Hi,
>>>>>> Nice to hear from you. Nope not known to me. Obviously 4096
is 4k, is that
>>>>>> the R limit for a character string length? What
happens at 4097?
>>>>>> Matthew
>>>>>> 
>>>>>> > Hi,
>>>>>> >
>>>>>> > I
have an example of a string of 4097 characters which can't be parsed
by
>>>>>> > fread; however, if I remove any character, it can be parsed
just fine. Is
>>>>>> > that a known limitation?
>>>>>> >
>>>>>> > (If I
write the string to a file and then fread the file name, it works
>>>>>>
> too.)
>>>>>> >
>>>>>> > Let me know if you need the string and/or a
bug report.
>>>>>> >
>>>>>> > Thanks
>>>>>> > Timothée >
_______________________________________________
>>>>>> > datatable-help
mailing list
>>>>>> > datatable-help at lists.r-forge.r-project.org
[1]
>>>>>> >
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[2]

Links:
------
[1]
mailto:datatable-help at lists.r-forge.r-project.org
[2]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[3]
mailto:mdowle at mdowle.plus.com
[4] mailto:mdowle at mdowle.plus.com
[5]
mailto:mdowle at mdowle.plus.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130328/36910e5b/attachment-0001.html>