[datatable-help] fread(character string) limited to strings less than 4096 long?

Matthew Dowle mdowle at mdowle.plus.com
Sat May 11 23:56:45 CEST 2013


 

Hi, 

Have reproduced now, and fixed (commit 862). 

 * When input
is the data as a character string, it is no longer truncated to your
system's maximum
 path length, #2649. It was being passed through
path.expand() even when it wasn't a filename.
 Many thanks to Timothee
Carayol for the reproducible report. The limit should now be R's
character
 string length limit (2^31-1 bytes = 2GB). Test added.

And
the persisting nan% in verbose output is also fixed. 

Many thanks!


Matthew 

On 25.04.2013 08:58, Timothée Carayol wrote: 

> Hi -
> 
> I
thought I'd follow up on this. 
> Matthew, are you still unable to
reproduce it? It is still happening to me after an upgrade to R 3.0.0.
And Garrett's case above seems even more severe, with a truncation at
256 characters it seems, so it's not just me, and it does seem to depend
on some sort of system configuration. 
> 
> On Thu, Mar 28, 2013 at 3:26
PM, Timothée Carayol <timothee.carayol at gmail.com [7]> wrote:
> 
>> Of
course, I'll be happy to help! 
>> By the way the verbose output was
actually from computer 1 (with 1.8.9) so it seems like the -nan% problem
is maybe still there? 
>> Cheers 
>> Timothée 
>> 
>> On Thu, Mar 28,
2013 at 3:19 PM, Matthew Dowle <mdowle at mdowle.plus.com [6]> wrote:
>>

>>> Hi, 
>>> 
>>> Thanks. That was from v1.8.8 on computer 2 I hope.
Computer 1 with v1.8.9 should have the -nan% problem fixed. 
>>> 
>>>
I'm a bit stumped for the moment. I've filed a bug report. Probably, if
I still can't reproduce my end, I'll add some more detailed tracing to
verbose output and ask you to try again next week if that's ok. 
>>>

>>> Thanks for reporting! 
>>> 
>>> Matthew 
>>> 
>>> On 28.03.2013
14:58, Timothée Carayol wrote: 
>>> 
>>>> Input contains a n (or is ""),
taking this to be text input (not a filename) 
>>>> Detected eol as n
only (no r afterwards), the UNIX and Mac standard. 
>>>> Using line 30
to detect sep (the last non blank line in the first 30) ... 't' 
>>>>
Found 2 columns 
>>>> First row with 2 fields occurs on line 1 (either
column names or first row of data) 
>>>> All the fields on line 1 are
character fields. Treating as the column names. 
>>>> Count of eol after
first data row: 1023 
>>>> Subtracted 1 for last eol and any trailing
empty lines, leaving 1022 data rows 
>>>> Type codes: 33 (first 5 rows)

>>>> Type codes: 33 (+middle 5 rows) 
>>>> Type codes: 33 (+last 5
rows) 
>>>> 0.000s (-nan%) Memory map (rerun may be quicker) 
>>>>
0.000s (-nan%) sep and header detection 
>>>> 0.000s (-nan%) Count rows
(wc -l) 
>>>> 0.000s (-nan%) Column type detection (first, middle and
last 5 rows) 
>>>> 0.000s (-nan%) Allocation of 1022x2 result (xMB) in
RAM 
>>>> 0.000s (-nan%) Reading data 
>>>> 0.000s (-nan%) Allocation
for type bumps (if any), including gc time if triggered 
>>>> 0.000s
(-nan%) Coercing data already read in type bumps (if any) 
>>>> 0.000s
(-nan%) Changing na.strings to NA 
>>>> 0.000s Total 
>>>> 4092 1022

>>>> Input contains a n (or is ""), taking this to be text input (not a
filename) 
>>>> Detected eol as n only (no r afterwards), the UNIX and
Mac standard. 
>>>> Using line 30 to detect sep (the last non blank line
in the first 30) ... 't' 
>>>> Found 2 columns 
>>>> First row with 2
fields occurs on line 1 (either column names or first row of data) 
>>>>
All the fields on line 1 are character fields. Treating as the column
names. 
>>>> Count of eol after first data row: 1023 
>>>> Subtracted 0
for last eol and any trailing empty lines, leaving 1023 data rows 
>>>>
Type codes: 33 (first 5 rows) 
>>>> Type codes: 33 (+middle 5 rows)

>>>> Type codes: 33 (+last 5 rows) 
>>>> 0.000s (-nan%) Memory map
(rerun may be quicker) 
>>>> 0.000s (-nan%) sep and header detection

>>>> 0.000s (-nan%) Count rows (wc -l) 
>>>> 0.000s (-nan%) Column type
detection (first, middle and last 5 rows) 
>>>> 0.000s (-nan%)
Allocation of 1023x2 result (xMB) in RAM 
>>>> 0.000s (-nan%) Reading
data 
>>>> 0.000s (-nan%) Allocation for type bumps (if any), including
gc time if triggered 
>>>> 0.000s (-nan%) Coercing data already read in
type bumps (if any) 
>>>> 0.000s (-nan%) Changing na.strings to NA 
>>>>
0.000s Total 
>>>> 4096 1023 
>>>> Input contains a n (or is ""), taking
this to be text input (not a filename) 
>>>> Detected eol as n only (no
r afterwards), the UNIX and Mac standard. 
>>>> Using line 30 to detect
sep (the last non blank line in the first 30) ... 't' 
>>>> Found 2
columns 
>>>> First row with 2 fields occurs on line 1 (either column
names or first row of data) 
>>>> All the fields on line 1 are character
fields. Treating as the column names. 
>>>> Count of eol after first
data row: 1023 
>>>> Subtracted 0 for last eol and any trailing empty
lines, leaving 1023 data rows 
>>>> Type codes: 33 (first 5 rows) 
>>>>
Type codes: 33 (+middle 5 rows) 
>>>> Type codes: 33 (+last 5 rows)

>>>> 0.000s (-nan%) Memory map (rerun may be quicker) 
>>>> 0.000s
(-nan%) sep and header detection 
>>>> 0.000s (-nan%) Count rows (wc -l)

>>>> 0.000s (-nan%) Column type detection (first, middle and last 5
rows) 
>>>> 0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM

>>>> 0.000s (-nan%) Reading data 
>>>> 0.000s (-nan%) Allocation for
type bumps (if any), including gc time if triggered 
>>>> 0.000s (-nan%)
Coercing data already read in type bumps (if any) 
>>>> 0.000s (-nan%)
Changing na.strings to NA 
>>>> 0.000s Total 
>>>> 4100 1023 
>>>> Input
contains a n (or is ""), taking this to be text input (not a filename)

>>>> Detected eol as n only (no r afterwards), the UNIX and Mac
standard. 
>>>> Using line 30 to detect sep (the last non blank line in
the first 30) ... 't' 
>>>> Found 2 columns 
>>>> First row with 2
fields occurs on line 1 (either column names or first row of data) 
>>>>
All the fields on line 1 are character fields. Treating as the column
names. 
>>>> Count of eol after first data row: 1023 
>>>> Subtracted 0
for last eol and any trailing empty lines, leaving 1023 data rows 
>>>>
Type codes: 33 (first 5 rows) 
>>>> Type codes: 33 (+middle 5 rows)

>>>> Type codes: 33 (+last 5 rows) 
>>>> 0.000s (-nan%) Memory map
(rerun may be quicker) 
>>>> 0.000s (-nan%) sep and header detection

>>>> 0.000s (-nan%) Count rows (wc -l) 
>>>> 0.000s (-nan%) Column type
detection (first, middle and last 5 rows) 
>>>> 0.000s (-nan%)
Allocation of 1023x2 result (xMB) in RAM 
>>>> 0.000s (-nan%) Reading
data 
>>>> 0.000s (-nan%) Allocation for type bumps (if any), including
gc time if triggered 
>>>> 0.000s (-nan%) Coercing data already read in
type bumps (if any) 
>>>> 0.000s (-nan%) Changing na.strings to NA 
>>>>
0.000s Total 
>>>> 40000 1023 
>>>> 
>>>> On Thu, Mar 28, 2013 at 2:55
PM, Matthew Dowle <mdowle at mdowle.plus.com [5]> wrote:
>>>> 
>>>>> Hm
this is odd. 
>>>>> 
>>>>> Could you run the following and paste back
the (verbose) results please. 
>>>>> for (n in c(1023:1025, 10000)) {

>>>>> 
>>>>> input = paste( rep('atbn', n), collapse='')
>>>>> A =
fread(input,verbose=TRUE)
>>>>> cat(nchar(input), nrow(A), "n")
>>>>>
}
>>>>> 
>>>>> On 28.03.2013 14:38, Timothée Carayol wrote: 
>>>>>

>>>>>> Curiouser and curiouser.. 
>>>>>> 
>>>>>> I can reproduce on two
computers with different versions of R and of data.table. 
>>>>>>

>>>>>> Computer 1 (it says unknown-linux but is actually ubuntu):

>>>>>> 
>>>>>> R version 2.15.3 (2013-03-01) 
>>>>>> Platform:
x86_64-unknown-linux-gnu (64-bit) 
>>>>>> 
>>>>>> locale: 
>>>>>> [1]
LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8
LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 
>>>>>>
LC_MESSAGES=en_GB.UTF-8 LC_PAPER=C LC_NAME=C LC_ADDRESS=C 
>>>>>> [10]
LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C 
>>>>>>

>>>>>> attached base packages: 
>>>>>> [1] stats graphics grDevices
utils datasets methods base 
>>>>>> 
>>>>>> other attached packages:

>>>>>> [1] bit64_0.9-2 bit_1.1-10 data.table_1.8.9 colorout_1.0-0

>>>>>> Computer 2: 
>>>>>> 
>>>>>> R version 2.15.2 (2012-10-26)

>>>>>> Platform: x86_64-redhat-linux-gnu (64-bit) 
>>>>>> 
>>>>>>
locale: 
>>>>>> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C 
>>>>>> [3]
LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 
>>>>>> [5]
LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 
>>>>>> [7] LC_PAPER=C
LC_NAME=C 
>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C 
>>>>>> [11]
LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C 
>>>>>> 
>>>>>> attached
base packages: 
>>>>>> [1] stats graphics grDevices utils datasets
methods base 
>>>>>> 
>>>>>> other attached packages: 
>>>>>> [1]
data.table_1.8.8 
>>>>>> 
>>>>>> loaded via a namespace (and not
attached): 
>>>>>> [1] tools_2.15.2 
>>>>>> 
>>>>>> On Thu, Mar 28, 2013
at 2:31 PM, Matthew Dowle <mdowle at mdowle.plus.com [4]> wrote:
>>>>>>

>>>>>>> Interesting, what's your sessionInfo() please? 
>>>>>>>

>>>>>>> For me it seems to work ok : 
>>>>>>> 
>>>>>>> [1] 1022
>>>>>>>
[1] 1023
>>>>>>> [1] 1024 
>>>>>>> [1] 9999
>>>>>>> 
>>>>>>>>
sessionInfo()
>>>>>>> R version 2.15.2 (2012-10-26)
>>>>>>> Platform:
x86_64-w64-mingw32/x64 (64-bit)
>>>>>>> 
>>>>>>> On 27.03.2013 22:49,
Timothée Carayol wrote: 
>>>>>>> 
>>>>>>>> Agree with Muhammad, longer
character strings are definitely permitted in R. 
>>>>>>>> A minimal
example that show something strange happening with fread: 
>>>>>>>>

>>>>>>>> for (n in c(1023:1025, 10000)) { 
>>>>>>>> A 
>>>>>>>>

>>>>>>>> paste( 
>>>>>>>> rep('atbn', n), 
>>>>>>>> collapse=''

>>>>>>>> ), 
>>>>>>>> sep='t' 
>>>>>>>> ) 
>>>>>>>> print(nrow(A))

>>>>>>>> } 
>>>>>>>> On my computer, I obtain: 
>>>>>>>> 
>>>>>>>> [1]
1022 
>>>>>>>> [1] 1023 
>>>>>>>> [1] 1023 
>>>>>>>> [1] 1023 
>>>>>>>>
Hope this helps 
>>>>>>>> Timothée 
>>>>>>>> 
>>>>>>>> On Wed, Mar 27,
2013 at 9:23 PM, Matthew Dowle <mdowle at mdowle.plus.com [3]>
wrote:
>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> Nice to hear from you. Nope not
known to me. Obviously 4096 is 4k, is that
>>>>>>>>> the R limit for a
character string length? What happens at 4097?
>>>>>>>>>
Matthew
>>>>>>>>> 
>>>>>>>>> > Hi,
>>>>>>>>> >
>>>>>>>>> > I have an
example of a string of 4097 characters which can't be parsed
by
>>>>>>>>> > fread; however, if I remove any character, it can be
parsed just fine. Is
>>>>>>>>> > that a known limitation?
>>>>>>>>>
>
>>>>>>>>> > (If I write the string to a file and then fread the file
name, it works
>>>>>>>>> > too.)
>>>>>>>>> >
>>>>>>>>> > Let me know if
you need the string and/or a bug report.
>>>>>>>>> >
>>>>>>>>> >
Thanks
>>>>>>>>> > Timothée >
_______________________________________________
>>>>>>>>> >
datatable-help mailing list
>>>>>>>>> >
datatable-help at lists.r-forge.r-project.org [1]
>>>>>>>>> >
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[2]

 

Links:
------
[1]
mailto:datatable-help at lists.r-forge.r-project.org
[2]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[3]
mailto:mdowle at mdowle.plus.com
[4] mailto:mdowle at mdowle.plus.com
[5]
mailto:mdowle at mdowle.plus.com
[6] mailto:mdowle at mdowle.plus.com
[7]
mailto:timothee.carayol at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130511/62df0892/attachment-0001.html>


More information about the datatable-help mailing list