[datatable-help] fread(character string) limited to strings less than 4096 long?
Timothée Carayol
timothee.carayol at gmail.com
Thu Apr 25 09:58:32 CEST 2013
Hi –
I thought I'd follow up on this.
Matthew, are you still unable to reproduce it? It is still happening to me
after an upgrade to R 3.0.0. And Garrett's case above seems even more
severe, with a truncation at 256 characters it seems, so it's not just me,
and it does seem to depend on some sort of system configuration.
On Thu, Mar 28, 2013 at 3:26 PM, Timothée Carayol <
timothee.carayol at gmail.com> wrote:
> Of course, I'll be happy to help!
>
> By the way the verbose output was actually from computer 1 (with 1.8.9) so
> it seems like the -nan% problem is maybe still there?
>
> Cheers
> Timothée
>
>
> On Thu, Mar 28, 2013 at 3:19 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>
>> **
>>
>>
>>
>> Hi,
>>
>> Thanks. That was from v1.8.8 on computer 2 I hope. Computer 1 with
>> v1.8.9 should have the -nan% problem fixed.
>>
>> I'm a bit stumped for the moment. I've filed a bug report. Probably, if
>> I still can't reproduce my end, I'll add some more detailed tracing to
>> verbose output and ask you to try again next week if that's ok.
>>
>> Thanks for reporting!
>>
>> Matthew
>>
>>
>>
>> On 28.03.2013 14:58, Timothée Carayol wrote:
>>
>> Input contains a \n (or is ""), taking this to be text input (not a
>> filename)
>> Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
>>
>> Using line 30 to detect sep (the last non blank line in the first 30) ...
>> '\t'
>> Found 2 columns
>>
>> First row with 2 fields occurs on line 1 (either column names or first
>> row of data)
>> All the fields on line 1 are character fields. Treating as the column
>> names.
>> Count of eol after first data row: 1023
>>
>> Subtracted 1 for last eol and any trailing empty lines, leaving 1022 data
>> rows
>> Type codes: 33 (first 5 rows)
>>
>> Type codes: 33 (+middle 5 rows)
>>
>> Type codes: 33 (+last 5 rows)
>>
>> 0.000s (-nan%) Memory map (rerun may be quicker)
>>
>> 0.000s (-nan%) sep and header detection
>>
>> 0.000s (-nan%) Count rows (wc -l)
>>
>> 0.000s (-nan%) Column type detection (first, middle and last 5 rows)
>>
>> 0.000s (-nan%) Allocation of 1022x2 result (xMB) in RAM
>>
>> 0.000s (-nan%) Reading data
>>
>> 0.000s (-nan%) Allocation for type bumps (if any), including gc time
>> if triggered
>> 0.000s (-nan%) Coercing data already read in type bumps (if any)
>>
>> 0.000s (-nan%) Changing na.strings to NA
>>
>> 0.000s Total
>>
>> 4092 1022
>>
>> Input contains a \n (or is ""), taking this to be text input (not a
>> filename)
>> Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
>>
>> Using line 30 to detect sep (the last non blank line in the first 30) ...
>> '\t'
>> Found 2 columns
>>
>> First row with 2 fields occurs on line 1 (either column names or first
>> row of data)
>> All the fields on line 1 are character fields. Treating as the column
>> names.
>> Count of eol after first data row: 1023
>>
>> Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data
>> rows
>> Type codes: 33 (first 5 rows)
>>
>> Type codes: 33 (+middle 5 rows)
>>
>> Type codes: 33 (+last 5 rows)
>>
>> 0.000s (-nan%) Memory map (rerun may be quicker)
>>
>> 0.000s (-nan%) sep and header detection
>>
>> 0.000s (-nan%) Count rows (wc -l)
>>
>> 0.000s (-nan%) Column type detection (first, middle and last 5 rows)
>>
>> 0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM
>>
>> 0.000s (-nan%) Reading data
>>
>> 0.000s (-nan%) Allocation for type bumps (if any), including gc time
>> if triggered
>> 0.000s (-nan%) Coercing data already read in type bumps (if any)
>>
>> 0.000s (-nan%) Changing na.strings to NA
>>
>> 0.000s Total
>>
>> 4096 1023
>>
>> Input contains a \n (or is ""), taking this to be text input (not a
>> filename)
>> Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
>>
>> Using line 30 to detect sep (the last non blank line in the first 30) ...
>> '\t'
>> Found 2 columns
>>
>> First row with 2 fields occurs on line 1 (either column names or first
>> row of data)
>> All the fields on line 1 are character fields. Treating as the column
>> names.
>> Count of eol after first data row: 1023
>>
>> Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data
>> rows
>> Type codes: 33 (first 5 rows)
>>
>> Type codes: 33 (+middle 5 rows)
>>
>> Type codes: 33 (+last 5 rows)
>>
>> 0.000s (-nan%) Memory map (rerun may be quicker)
>>
>> 0.000s (-nan%) sep and header detection
>>
>> 0.000s (-nan%) Count rows (wc -l)
>>
>> 0.000s (-nan%) Column type detection (first, middle and last 5 rows)
>>
>> 0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM
>>
>> 0.000s (-nan%) Reading data
>>
>> 0.000s (-nan%) Allocation for type bumps (if any), including gc time
>> if triggered
>> 0.000s (-nan%) Coercing data already read in type bumps (if any)
>>
>> 0.000s (-nan%) Changing na.strings to NA
>>
>> 0.000s Total
>>
>> 4100 1023
>>
>> Input contains a \n (or is ""), taking this to be text input (not a
>> filename)
>> Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
>>
>> Using line 30 to detect sep (the last non blank line in the first 30) ...
>> '\t'
>> Found 2 columns
>>
>> First row with 2 fields occurs on line 1 (either column names or first
>> row of data)
>> All the fields on line 1 are character fields. Treating as the column
>> names.
>> Count of eol after first data row: 1023
>>
>> Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data
>> rows
>> Type codes: 33 (first 5 rows)
>>
>> Type codes: 33 (+middle 5 rows)
>>
>> Type codes: 33 (+last 5 rows)
>>
>> 0.000s (-nan%) Memory map (rerun may be quicker)
>>
>> 0.000s (-nan%) sep and header detection
>>
>> 0.000s (-nan%) Count rows (wc -l)
>>
>> 0.000s (-nan%) Column type detection (first, middle and last 5 rows)
>>
>> 0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM
>>
>> 0.000s (-nan%) Reading data
>>
>> 0.000s (-nan%) Allocation for type bumps (if any), including gc time
>> if triggered
>> 0.000s (-nan%) Coercing data already read in type bumps (if any)
>>
>> 0.000s (-nan%) Changing na.strings to NA
>>
>> 0.000s Total
>>
>> 40000 1023
>>
>>
>>
>> On Thu, Mar 28, 2013 at 2:55 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>>
>>>
>>>
>>> Hm this is odd.
>>>
>>> Could you run the following and paste back the (verbose) results please.
>>> for (n in c(1023:1025, 10000)) {
>>>
>>> input = paste( rep('a\tb\n', n), collapse='')
>>> A = fread(input,verbose=TRUE)
>>> cat(nchar(input), nrow(A), "\n")
>>> }
>>>
>>>
>>>
>>>
>>>
>>> On 28.03.2013 14:38, Timothée Carayol wrote:
>>>
>>> Curiouser and curiouser..
>>>
>>> I can reproduce on two computers with different versions of R and of
>>> data.table.
>>>
>>>
>>>
>>> Computer 1 (it says unknown-linux but is actually ubuntu):
>>>
>>> R version 2.15.3 (2013-03-01)
>>>
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>>
>>>
>>> locale:
>>>
>>> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
>>> LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
>>> LC_MONETARY=en_GB.UTF-8
>>> LC_MESSAGES=en_GB.UTF-8 LC_PAPER=C LC_NAME=C
>>> LC_ADDRESS=C
>>> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8
>>> LC_IDENTIFICATION=C
>>>
>>>
>>>
>>> attached base packages:
>>>
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>>
>>>
>>> other attached packages:
>>>
>>> [1] bit64_0.9-2 bit_1.1-10 data.table_1.8.9 colorout_1.0-0
>>>
>>> Computer 2:
>>> R version 2.15.2 (2012-10-26)
>>> Platform: x86_64-redhat-linux-gnu (64-bit)
>>>
>>> locale:
>>> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
>>> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
>>> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
>>> [7] LC_PAPER=C LC_NAME=C
>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] data.table_1.8.8
>>>
>>> loaded via a namespace (and not attached):
>>> [1] tools_2.15.2
>>>
>>>
>>> On Thu, Mar 28, 2013 at 2:31 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>>>
>>>>
>>>>
>>>> Interesting, what's your sessionInfo() please?
>>>>
>>>> For me it seems to work ok :
>>>>
>>>> [1] 1022
>>>> [1] 1023
>>>> [1] 1024
>>>> [1] 9999
>>>>
>>>> > sessionInfo()
>>>> R version 2.15.2 (2012-10-26)
>>>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>>
>>>>
>>>>
>>>> On 27.03.2013 22:49, Timothée Carayol wrote:
>>>>
>>>> Agree with Muhammad, longer character strings are definitely
>>>> permitted in R.
>>>> A minimal example that show something strange happening with fread:
>>>> for (n in c(1023:1025, 10000)) {
>>>> A
>>>> paste(
>>>> rep('a\tb\n', n),
>>>> collapse=''
>>>> ),
>>>> sep='\t'
>>>> )
>>>> print(nrow(A))
>>>> }
>>>> On my computer, I obtain:
>>>> [1] 1022
>>>> [1] 1023
>>>> [1] 1023
>>>> [1] 1023
>>>> Hope this helps
>>>> Timothée
>>>>
>>>>
>>>> On Wed, Mar 27, 2013 at 9:23 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>>>>
>>>>> Hi,
>>>>> Nice to hear from you. Nope not known to me. Obviously 4096 is 4k, is
>>>>> that
>>>>> the R limit for a character string length? What happens at 4097?
>>>>> Matthew
>>>>>
>>>>> > Hi,
>>>>> >
>>>>> > I have an example of a string of 4097 characters which can't be
>>>>> parsed by
>>>>> > fread; however, if I remove any character, it can be parsed just
>>>>> fine. Is
>>>>> > that a known limitation?
>>>>> >
>>>>> > (If I write the string to a file and then fread the file name, it
>>>>> works
>>>>> > too.)
>>>>> >
>>>>> > Let me know if you need the string and/or a bug report.
>>>>> >
>>>>> > Thanks
>>>>> > Timothée
>>>>> > _______________________________________________
>>>>> > datatable-help mailing list
>>>>> > datatable-help at lists.r-forge.r-project.org
>>>>> >
>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130425/804ce8ae/attachment-0001.html>
More information about the datatable-help
mailing list