[datatable-help] fread(character string) limited to strings less than 4096 long?

Timothée Carayol timothee.carayol at gmail.com
Thu Mar 28 16:26:38 CET 2013


Of course, I'll be happy to help!

By the way the verbose output was actually from computer 1 (with 1.8.9) so
it seems like the -nan% problem is maybe still there?

Cheers
Timothée


On Thu, Mar 28, 2013 at 3:19 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:

> **
>
>
>
> Hi,
>
> Thanks.  That was from v1.8.8 on computer 2 I hope.  Computer 1 with
> v1.8.9 should have the -nan% problem fixed.
>
> I'm a bit stumped for the moment.  I've filed a bug report.  Probably, if
> I still can't reproduce my end, I'll add some more detailed tracing to
> verbose output and ask you to try again next week if that's ok.
>
> Thanks for reporting!
>
> Matthew
>
>
>
> On 28.03.2013 14:58, Timothée Carayol wrote:
>
>   Input contains a \n (or is ""), taking this to be text input (not a
> filename)
> Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
>
> Using line 30 to detect sep (the last non blank line in the first 30) ...
> '\t'
> Found 2 columns
>
> First row with 2 fields occurs on line 1 (either column names or first row
> of data)
> All the fields on line 1 are character fields. Treating as the column
> names.
> Count of eol after first data row: 1023
>
> Subtracted 1 for last eol and any trailing empty lines, leaving 1022 data
> rows
> Type codes: 33 (first 5 rows)
>
> Type codes: 33 (+middle 5 rows)
>
> Type codes: 33 (+last 5 rows)
>
>    0.000s (-nan%) Memory map (rerun may be quicker)
>
>    0.000s (-nan%) sep and header detection
>
>    0.000s (-nan%) Count rows (wc -l)
>
>    0.000s (-nan%) Column type detection (first, middle and last 5 rows)
>
>    0.000s (-nan%) Allocation of 1022x2 result (xMB) in RAM
>
>    0.000s (-nan%) Reading data
>
>    0.000s (-nan%) Allocation for type bumps (if any), including gc time if
> triggered
>    0.000s (-nan%) Coercing data already read in type bumps (if any)
>
>    0.000s (-nan%) Changing na.strings to NA
>
>    0.000s        Total
>
> 4092 1022
>
> Input contains a \n (or is ""), taking this to be text input (not a
> filename)
>  Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
>
> Using line 30 to detect sep (the last non blank line in the first 30) ...
> '\t'
> Found 2 columns
>
> First row with 2 fields occurs on line 1 (either column names or first row
> of data)
> All the fields on line 1 are character fields. Treating as the column
> names.
> Count of eol after first data row: 1023
>
> Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data
> rows
> Type codes: 33 (first 5 rows)
>
> Type codes: 33 (+middle 5 rows)
>
> Type codes: 33 (+last 5 rows)
>
>    0.000s (-nan%) Memory map (rerun may be quicker)
>
>    0.000s (-nan%) sep and header detection
>
>    0.000s (-nan%) Count rows (wc -l)
>
>    0.000s (-nan%) Column type detection (first, middle and last 5 rows)
>
>    0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM
>
>    0.000s (-nan%) Reading data
>
>    0.000s (-nan%) Allocation for type bumps (if any), including gc time if
> triggered
>    0.000s (-nan%) Coercing data already read in type bumps (if any)
>
>    0.000s (-nan%) Changing na.strings to NA
>
>    0.000s        Total
>
> 4096 1023
>
> Input contains a \n (or is ""), taking this to be text input (not a
> filename)
> Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
>
> Using line 30 to detect sep (the last non blank line in the first 30) ...
> '\t'
> Found 2 columns
>
> First row with 2 fields occurs on line 1 (either column names or first row
> of data)
> All the fields on line 1 are character fields. Treating as the column
> names.
> Count of eol after first data row: 1023
>
> Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data
> rows
> Type codes: 33 (first 5 rows)
>
> Type codes: 33 (+middle 5 rows)
>
> Type codes: 33 (+last 5 rows)
>
>    0.000s (-nan%) Memory map (rerun may be quicker)
>
>    0.000s (-nan%) sep and header detection
>
>    0.000s (-nan%) Count rows (wc -l)
>
>    0.000s (-nan%) Column type detection (first, middle and last 5 rows)
>
>    0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM
>
>    0.000s (-nan%) Reading data
>
>    0.000s (-nan%) Allocation for type bumps (if any), including gc time if
> triggered
>    0.000s (-nan%) Coercing data already read in type bumps (if any)
>
>    0.000s (-nan%) Changing na.strings to NA
>
>    0.000s        Total
>
> 4100 1023
>
> Input contains a \n (or is ""), taking this to be text input (not a
> filename)
> Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
>
> Using line 30 to detect sep (the last non blank line in the first 30) ...
> '\t'
> Found 2 columns
>
> First row with 2 fields occurs on line 1 (either column names or first row
> of data)
> All the fields on line 1 are character fields. Treating as the column
> names.
> Count of eol after first data row: 1023
>
> Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data
> rows
> Type codes: 33 (first 5 rows)
>
> Type codes: 33 (+middle 5 rows)
>
> Type codes: 33 (+last 5 rows)
>
>    0.000s (-nan%) Memory map (rerun may be quicker)
>
>    0.000s (-nan%) sep and header detection
>
>    0.000s (-nan%) Count rows (wc -l)
>
>    0.000s (-nan%) Column type detection (first, middle and last 5 rows)
>
>    0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM
>
>    0.000s (-nan%) Reading data
>
>    0.000s (-nan%) Allocation for type bumps (if any), including gc time if
> triggered
>    0.000s (-nan%) Coercing data already read in type bumps (if any)
>
>    0.000s (-nan%) Changing na.strings to NA
>
>    0.000s        Total
>
> 40000 1023
>
>
>
> On Thu, Mar 28, 2013 at 2:55 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>
>>
>>
>> Hm this is odd.
>>
>> Could you run the following and paste back the (verbose) results please.
>> for (n in c(1023:1025, 10000)) {
>>
>>  input = paste( rep('a\tb\n', n), collapse='')
>>  A = fread(input,verbose=TRUE)
>>  cat(nchar(input), nrow(A), "\n")
>> }
>>
>>
>>
>>
>>
>> On 28.03.2013 14:38, Timothée Carayol wrote:
>>
>>  Curiouser and curiouser..
>>
>> I can reproduce on two computers with different versions of R and of
>> data.table.
>>
>>
>>
>> Computer 1 (it says unknown-linux but is actually ubuntu):
>>
>> R version 2.15.3 (2013-03-01)
>>
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>>
>>
>> locale:
>>
>>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
>> LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
>> LC_MONETARY=en_GB.UTF-8
>>    LC_MESSAGES=en_GB.UTF-8    LC_PAPER=C                 LC_NAME=C
>>            LC_ADDRESS=C
>> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8
>> LC_IDENTIFICATION=C
>>
>>
>>
>> attached base packages:
>>
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>>
>>
>> other attached packages:
>>
>> [1] bit64_0.9-2      bit_1.1-10       data.table_1.8.9 colorout_1.0-0
>>
>> Computer 2:
>>  R version 2.15.2 (2012-10-26)
>> Platform: x86_64-redhat-linux-gnu (64-bit)
>>
>> locale:
>>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
>>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
>>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
>>  [7] LC_PAPER=C                 LC_NAME=C
>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] data.table_1.8.8
>>
>> loaded via a namespace (and not attached):
>> [1] tools_2.15.2
>>
>>
>> On Thu, Mar 28, 2013 at 2:31 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>>
>>>
>>>
>>> Interesting, what's your sessionInfo() please?
>>>
>>> For me it seems to work ok :
>>>
>>> [1] 1022
>>> [1] 1023
>>> [1] 1024
>>> [1] 9999
>>>
>>> > sessionInfo()
>>> R version 2.15.2 (2012-10-26)
>>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>
>>>
>>>
>>> On 27.03.2013 22:49, Timothée Carayol wrote:
>>>
>>>  Agree with Muhammad, longer character strings are definitely permitted
>>> in R.
>>> A minimal example that show something strange happening with fread:
>>>   for (n in c(1023:1025, 10000)) {
>>>   A
>>>              paste(
>>>                  rep('a\tb\n', n),
>>>                  collapse=''
>>>                  ),
>>>            sep='\t'
>>>            )
>>>   print(nrow(A))
>>> }
>>> On my computer, I obtain:
>>>  [1] 1022
>>> [1] 1023
>>> [1] 1023
>>> [1] 1023
>>>  Hope this helps
>>> Timothée
>>>
>>>
>>> On Wed, Mar 27, 2013 at 9:23 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>>>
>>>> Hi,
>>>> Nice to hear from you. Nope not known to me. Obviously 4096 is 4k, is
>>>> that
>>>> the R limit for a character string length? What happens at 4097?
>>>> Matthew
>>>>
>>>> > Hi,
>>>> >
>>>> > I have an example of a string of 4097 characters which can't be
>>>> parsed by
>>>> > fread; however, if I remove any character, it can be parsed just
>>>> fine. Is
>>>> > that a known limitation?
>>>> >
>>>> > (If I write the string to a file and then fread the file name, it
>>>> works
>>>> > too.)
>>>> >
>>>> > Let me know if you need the string and/or a bug report.
>>>> >
>>>> > Thanks
>>>> > Timothée
>>>>  > _______________________________________________
>>>> > datatable-help mailing list
>>>> > datatable-help at lists.r-forge.r-project.org
>>>> >
>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130328/9ac4975e/attachment-0001.html>


More information about the datatable-help mailing list