[datatable-help] fread(character string) limited to strings less than 4096 long?
Matthew Dowle
mdowle at mdowle.plus.com
Sat May 11 23:56:45 CEST 2013
Hi,
Have reproduced now, and fixed (commit 862).
* When input
is the data as a character string, it is no longer truncated to your
system's maximum
path length, #2649. It was being passed through
path.expand() even when it wasn't a filename.
Many thanks to Timothee
Carayol for the reproducible report. The limit should now be R's
character
string length limit (2^31-1 bytes = 2GB). Test added.
And
the persisting nan% in verbose output is also fixed.
Many thanks!
Matthew
On 25.04.2013 08:58, Timothée Carayol wrote:
> Hi -
>
> I
thought I'd follow up on this.
> Matthew, are you still unable to
reproduce it? It is still happening to me after an upgrade to R 3.0.0.
And Garrett's case above seems even more severe, with a truncation at
256 characters it seems, so it's not just me, and it does seem to depend
on some sort of system configuration.
>
> On Thu, Mar 28, 2013 at 3:26
PM, Timothée Carayol <timothee.carayol at gmail.com [7]> wrote:
>
>> Of
course, I'll be happy to help!
>> By the way the verbose output was
actually from computer 1 (with 1.8.9) so it seems like the -nan% problem
is maybe still there?
>> Cheers
>> Timothée
>>
>> On Thu, Mar 28,
2013 at 3:19 PM, Matthew Dowle <mdowle at mdowle.plus.com [6]> wrote:
>>
>>> Hi,
>>>
>>> Thanks. That was from v1.8.8 on computer 2 I hope.
Computer 1 with v1.8.9 should have the -nan% problem fixed.
>>>
>>>
I'm a bit stumped for the moment. I've filed a bug report. Probably, if
I still can't reproduce my end, I'll add some more detailed tracing to
verbose output and ask you to try again next week if that's ok.
>>>
>>> Thanks for reporting!
>>>
>>> Matthew
>>>
>>> On 28.03.2013
14:58, Timothée Carayol wrote:
>>>
>>>> Input contains a n (or is ""),
taking this to be text input (not a filename)
>>>> Detected eol as n
only (no r afterwards), the UNIX and Mac standard.
>>>> Using line 30
to detect sep (the last non blank line in the first 30) ... 't'
>>>>
Found 2 columns
>>>> First row with 2 fields occurs on line 1 (either
column names or first row of data)
>>>> All the fields on line 1 are
character fields. Treating as the column names.
>>>> Count of eol after
first data row: 1023
>>>> Subtracted 1 for last eol and any trailing
empty lines, leaving 1022 data rows
>>>> Type codes: 33 (first 5 rows)
>>>> Type codes: 33 (+middle 5 rows)
>>>> Type codes: 33 (+last 5
rows)
>>>> 0.000s (-nan%) Memory map (rerun may be quicker)
>>>>
0.000s (-nan%) sep and header detection
>>>> 0.000s (-nan%) Count rows
(wc -l)
>>>> 0.000s (-nan%) Column type detection (first, middle and
last 5 rows)
>>>> 0.000s (-nan%) Allocation of 1022x2 result (xMB) in
RAM
>>>> 0.000s (-nan%) Reading data
>>>> 0.000s (-nan%) Allocation
for type bumps (if any), including gc time if triggered
>>>> 0.000s
(-nan%) Coercing data already read in type bumps (if any)
>>>> 0.000s
(-nan%) Changing na.strings to NA
>>>> 0.000s Total
>>>> 4092 1022
>>>> Input contains a n (or is ""), taking this to be text input (not a
filename)
>>>> Detected eol as n only (no r afterwards), the UNIX and
Mac standard.
>>>> Using line 30 to detect sep (the last non blank line
in the first 30) ... 't'
>>>> Found 2 columns
>>>> First row with 2
fields occurs on line 1 (either column names or first row of data)
>>>>
All the fields on line 1 are character fields. Treating as the column
names.
>>>> Count of eol after first data row: 1023
>>>> Subtracted 0
for last eol and any trailing empty lines, leaving 1023 data rows
>>>>
Type codes: 33 (first 5 rows)
>>>> Type codes: 33 (+middle 5 rows)
>>>> Type codes: 33 (+last 5 rows)
>>>> 0.000s (-nan%) Memory map
(rerun may be quicker)
>>>> 0.000s (-nan%) sep and header detection
>>>> 0.000s (-nan%) Count rows (wc -l)
>>>> 0.000s (-nan%) Column type
detection (first, middle and last 5 rows)
>>>> 0.000s (-nan%)
Allocation of 1023x2 result (xMB) in RAM
>>>> 0.000s (-nan%) Reading
data
>>>> 0.000s (-nan%) Allocation for type bumps (if any), including
gc time if triggered
>>>> 0.000s (-nan%) Coercing data already read in
type bumps (if any)
>>>> 0.000s (-nan%) Changing na.strings to NA
>>>>
0.000s Total
>>>> 4096 1023
>>>> Input contains a n (or is ""), taking
this to be text input (not a filename)
>>>> Detected eol as n only (no
r afterwards), the UNIX and Mac standard.
>>>> Using line 30 to detect
sep (the last non blank line in the first 30) ... 't'
>>>> Found 2
columns
>>>> First row with 2 fields occurs on line 1 (either column
names or first row of data)
>>>> All the fields on line 1 are character
fields. Treating as the column names.
>>>> Count of eol after first
data row: 1023
>>>> Subtracted 0 for last eol and any trailing empty
lines, leaving 1023 data rows
>>>> Type codes: 33 (first 5 rows)
>>>>
Type codes: 33 (+middle 5 rows)
>>>> Type codes: 33 (+last 5 rows)
>>>> 0.000s (-nan%) Memory map (rerun may be quicker)
>>>> 0.000s
(-nan%) sep and header detection
>>>> 0.000s (-nan%) Count rows (wc -l)
>>>> 0.000s (-nan%) Column type detection (first, middle and last 5
rows)
>>>> 0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM
>>>> 0.000s (-nan%) Reading data
>>>> 0.000s (-nan%) Allocation for
type bumps (if any), including gc time if triggered
>>>> 0.000s (-nan%)
Coercing data already read in type bumps (if any)
>>>> 0.000s (-nan%)
Changing na.strings to NA
>>>> 0.000s Total
>>>> 4100 1023
>>>> Input
contains a n (or is ""), taking this to be text input (not a filename)
>>>> Detected eol as n only (no r afterwards), the UNIX and Mac
standard.
>>>> Using line 30 to detect sep (the last non blank line in
the first 30) ... 't'
>>>> Found 2 columns
>>>> First row with 2
fields occurs on line 1 (either column names or first row of data)
>>>>
All the fields on line 1 are character fields. Treating as the column
names.
>>>> Count of eol after first data row: 1023
>>>> Subtracted 0
for last eol and any trailing empty lines, leaving 1023 data rows
>>>>
Type codes: 33 (first 5 rows)
>>>> Type codes: 33 (+middle 5 rows)
>>>> Type codes: 33 (+last 5 rows)
>>>> 0.000s (-nan%) Memory map
(rerun may be quicker)
>>>> 0.000s (-nan%) sep and header detection
>>>> 0.000s (-nan%) Count rows (wc -l)
>>>> 0.000s (-nan%) Column type
detection (first, middle and last 5 rows)
>>>> 0.000s (-nan%)
Allocation of 1023x2 result (xMB) in RAM
>>>> 0.000s (-nan%) Reading
data
>>>> 0.000s (-nan%) Allocation for type bumps (if any), including
gc time if triggered
>>>> 0.000s (-nan%) Coercing data already read in
type bumps (if any)
>>>> 0.000s (-nan%) Changing na.strings to NA
>>>>
0.000s Total
>>>> 40000 1023
>>>>
>>>> On Thu, Mar 28, 2013 at 2:55
PM, Matthew Dowle <mdowle at mdowle.plus.com [5]> wrote:
>>>>
>>>>> Hm
this is odd.
>>>>>
>>>>> Could you run the following and paste back
the (verbose) results please.
>>>>> for (n in c(1023:1025, 10000)) {
>>>>>
>>>>> input = paste( rep('atbn', n), collapse='')
>>>>> A =
fread(input,verbose=TRUE)
>>>>> cat(nchar(input), nrow(A), "n")
>>>>>
}
>>>>>
>>>>> On 28.03.2013 14:38, Timothée Carayol wrote:
>>>>>
>>>>>> Curiouser and curiouser..
>>>>>>
>>>>>> I can reproduce on two
computers with different versions of R and of data.table.
>>>>>>
>>>>>> Computer 1 (it says unknown-linux but is actually ubuntu):
>>>>>>
>>>>>> R version 2.15.3 (2013-03-01)
>>>>>> Platform:
x86_64-unknown-linux-gnu (64-bit)
>>>>>>
>>>>>> locale:
>>>>>> [1]
LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8
LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8
>>>>>>
LC_MESSAGES=en_GB.UTF-8 LC_PAPER=C LC_NAME=C LC_ADDRESS=C
>>>>>> [10]
LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] stats graphics grDevices
utils datasets methods base
>>>>>>
>>>>>> other attached packages:
>>>>>> [1] bit64_0.9-2 bit_1.1-10 data.table_1.8.9 colorout_1.0-0
>>>>>> Computer 2:
>>>>>>
>>>>>> R version 2.15.2 (2012-10-26)
>>>>>> Platform: x86_64-redhat-linux-gnu (64-bit)
>>>>>>
>>>>>>
locale:
>>>>>> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
>>>>>> [3]
LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
>>>>>> [5]
LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
>>>>>> [7] LC_PAPER=C
LC_NAME=C
>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>>>> [11]
LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>>>>>>
>>>>>> attached
base packages:
>>>>>> [1] stats graphics grDevices utils datasets
methods base
>>>>>>
>>>>>> other attached packages:
>>>>>> [1]
data.table_1.8.8
>>>>>>
>>>>>> loaded via a namespace (and not
attached):
>>>>>> [1] tools_2.15.2
>>>>>>
>>>>>> On Thu, Mar 28, 2013
at 2:31 PM, Matthew Dowle <mdowle at mdowle.plus.com [4]> wrote:
>>>>>>
>>>>>>> Interesting, what's your sessionInfo() please?
>>>>>>>
>>>>>>> For me it seems to work ok :
>>>>>>>
>>>>>>> [1] 1022
>>>>>>>
[1] 1023
>>>>>>> [1] 1024
>>>>>>> [1] 9999
>>>>>>>
>>>>>>>>
sessionInfo()
>>>>>>> R version 2.15.2 (2012-10-26)
>>>>>>> Platform:
x86_64-w64-mingw32/x64 (64-bit)
>>>>>>>
>>>>>>> On 27.03.2013 22:49,
Timothée Carayol wrote:
>>>>>>>
>>>>>>>> Agree with Muhammad, longer
character strings are definitely permitted in R.
>>>>>>>> A minimal
example that show something strange happening with fread:
>>>>>>>>
>>>>>>>> for (n in c(1023:1025, 10000)) {
>>>>>>>> A
>>>>>>>>
>>>>>>>> paste(
>>>>>>>> rep('atbn', n),
>>>>>>>> collapse=''
>>>>>>>> ),
>>>>>>>> sep='t'
>>>>>>>> )
>>>>>>>> print(nrow(A))
>>>>>>>> }
>>>>>>>> On my computer, I obtain:
>>>>>>>>
>>>>>>>> [1]
1022
>>>>>>>> [1] 1023
>>>>>>>> [1] 1023
>>>>>>>> [1] 1023
>>>>>>>>
Hope this helps
>>>>>>>> Timothée
>>>>>>>>
>>>>>>>> On Wed, Mar 27,
2013 at 9:23 PM, Matthew Dowle <mdowle at mdowle.plus.com [3]>
wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>> Nice to hear from you. Nope not
known to me. Obviously 4096 is 4k, is that
>>>>>>>>> the R limit for a
character string length? What happens at 4097?
>>>>>>>>>
Matthew
>>>>>>>>>
>>>>>>>>> > Hi,
>>>>>>>>> >
>>>>>>>>> > I have an
example of a string of 4097 characters which can't be parsed
by
>>>>>>>>> > fread; however, if I remove any character, it can be
parsed just fine. Is
>>>>>>>>> > that a known limitation?
>>>>>>>>>
>
>>>>>>>>> > (If I write the string to a file and then fread the file
name, it works
>>>>>>>>> > too.)
>>>>>>>>> >
>>>>>>>>> > Let me know if
you need the string and/or a bug report.
>>>>>>>>> >
>>>>>>>>> >
Thanks
>>>>>>>>> > Timothée >
_______________________________________________
>>>>>>>>> >
datatable-help mailing list
>>>>>>>>> >
datatable-help at lists.r-forge.r-project.org [1]
>>>>>>>>> >
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[2]
Links:
------
[1]
mailto:datatable-help at lists.r-forge.r-project.org
[2]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[3]
mailto:mdowle at mdowle.plus.com
[4] mailto:mdowle at mdowle.plus.com
[5]
mailto:mdowle at mdowle.plus.com
[6] mailto:mdowle at mdowle.plus.com
[7]
mailto:timothee.carayol at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130511/62df0892/attachment-0001.html>
More information about the datatable-help
mailing list