[Seqinr-commits] r1900 - www/src/mainmatter

noreply at r-forge.r-project.org noreply at r-forge.r-project.org
Thu Jun 2 16:02:00 CEST 2016


Author: jeanlobry
Date: 2016-06-02 16:01:59 +0200 (Thu, 02 Jun 2016)
New Revision: 1900

Modified:
   www/src/mainmatter/getseqflat.rnw
   www/src/mainmatter/getseqflat.tex
Log:
use of gzcon for read.fasta

Modified: www/src/mainmatter/getseqflat.rnw
===================================================================
--- www/src/mainmatter/getseqflat.rnw	2016-06-02 12:58:01 UTC (rev 1899)
+++ www/src/mainmatter/getseqflat.rnw	2016-06-02 14:01:59 UTC (rev 1900)
@@ -97,6 +97,49 @@
 read.fasta(aafile, seqtype = "AA", as.string = TRUE, set.attributes = FALSE)
 @
 
+\subsubsection{Compressed file example}
+
+The original file before compression looks like:
+
+<<examplegzip1,eval=T>>=
+uncompressed <- system.file("sequences/smallAA.fasta", package = "seqinr")
+cat(readLines(uncompressed), sep = "\n")
+@
+
+The compressed file example is full of mojibakes because of its
+binary nature, but the \texttt{readLines()} is still able to read
+it correctly:
+
+<<examplegzip2,eval=T>>=
+compressed <- system.file("sequences/smallAA.fasta.gz", package = "seqinr")
+readChar(compressed, nchar = 1000, useBytes = TRUE)
+cat(readLines(compressed), sep = "\n")
+@
+
+We can therefore import the sequences directly from a gzipped file:
+
+<<examplegzip3,eval=T>>=
+res1 <- read.fasta(uncompressed)
+res2 <- read.fasta(compressed)
+identical(res1, res2)
+@
+
+This automatic conversion works well for local files but is no more
+active when you read the data from an URL, for instance:
+
+<<ftpgz1,eval=T>>=
+myurl <- "ftp://ftp.ncbi.nlm.nih.gov/refseq/release/plasmid/plasmid.1.rna.fna.gz"
+try.res <- try(read.fasta(myurl))
+try.res
+@
+
+A simple workthrough is to encapsulate this into \texttt{gzcon()} :
+
+<<ftpgz2,eval=T>>=
+myseq <- read.fasta(gzcon(url(myurl)))
+getName(myseq)
+@
+
 \subsection{The function \texttt{write.fasta()}}
 
 This function writes sequences to a file in FASTA format.

Modified: www/src/mainmatter/getseqflat.tex
===================================================================
--- www/src/mainmatter/getseqflat.tex	2016-06-02 12:58:01 UTC (rev 1899)
+++ www/src/mainmatter/getseqflat.tex	2016-06-02 14:01:59 UTC (rev 1900)
@@ -330,6 +330,89 @@
 \end{Soutput}
 \end{Schunk}
 
+\subsubsection{Compressed file example}
+
+The original file before compression looks like:
+
+\begin{Schunk}
+\begin{Sinput}
+ uncompressed <- system.file("sequences/smallAA.fasta", package = "seqinr")
+ cat(readLines(uncompressed), sep = "\n")
+\end{Sinput}
+\begin{Soutput}
+>smallAA    A very small AA file in FASTA format
+SEQINRSEQINRSEQINRSEQINR*
+\end{Soutput}
+\end{Schunk}
+
+The compressed file example is full of mojibakes because of its
+binary nature, but the \texttt{readLines()} is still able to read
+it correctly:
+
+\begin{Schunk}
+\begin{Sinput}
+ compressed <- system.file("sequences/smallAA.fasta.gz", package = "seqinr")
+ readChar(compressed, nchar = 1000, useBytes = TRUE)
+\end{Sinput}
+\begin{Soutput}
+[1] "\037\x8b\b\b\xd4\024PW"
+\end{Soutput}
+\begin{Sinput}
+ cat(readLines(compressed), sep = "\n")
+\end{Sinput}
+\begin{Soutput}
+>smallAA    A very small AA file in FASTA format
+SEQINRSEQINRSEQINRSEQINR*
+\end{Soutput}
+\end{Schunk}
+
+We can therefore import the sequences directly from a gzipped file:
+
+\begin{Schunk}
+\begin{Sinput}
+ res1 <- read.fasta(uncompressed)
+ res2 <- read.fasta(compressed)
+ identical(res1, res2)
+\end{Sinput}
+\begin{Soutput}
+[1] TRUE
+\end{Soutput}
+\end{Schunk}
+
+This automatic conversion works well for local files but is no more
+active when you read the data from an URL, for instance:
+
+\begin{Schunk}
+\begin{Sinput}
+ myurl <- "ftp://ftp.ncbi.nlm.nih.gov/refseq/release/plasmid/plasmid.1.rna.fna.gz"
+ try.res <- try(read.fasta(myurl))
+ try.res
+\end{Sinput}
+\begin{Soutput}
+[1] "Error in read.fasta(myurl) : no line starting with a > character found\n"
+attr(,"class")
+[1] "try-error"
+attr(,"condition")
+<simpleError in read.fasta(myurl): no line starting with a > character found>
+\end{Soutput}
+\end{Schunk}
+
+A simple workthrough is to encapsulate this into \texttt{gzcon()} :
+
+\begin{Schunk}
+\begin{Sinput}
+ myseq <- read.fasta(gzcon(url(myurl)))
+ getName(myseq)
+\end{Sinput}
+\begin{Soutput}
+[1] "gi|470467018|ref|NR_074151.1|" "gi|444303868|ref|NR_074290.1|"
+[3] "gi|452192228|ref|NR_075742.1|" "gi|451991842|ref|NR_075394.1|"
+[5] "gi|451991838|ref|NR_075390.1|" "gi|444303919|ref|NR_074342.1|"
+[7] "gi|470486111|ref|NR_076736.1|" "gi|470480648|ref|NR_076426.1|"
+[9] "gi|470478007|ref|NR_076423.1|"
+\end{Soutput}
+\end{Schunk}
+
 \subsection{The function \texttt{write.fasta()}}
 
 This function writes sequences to a file in FASTA format.
@@ -507,7 +590,7 @@
 \end{Sinput}
 \begin{Soutput}
    user  system elapsed 
-  3.908   0.035   3.948 
+  3.827   0.036   3.863 
 \end{Soutput}
 \end{Schunk}
 
@@ -528,7 +611,7 @@
 \end{Sinput}
 \begin{Soutput}
    user  system elapsed 
-  0.164   0.002   0.167 
+  0.161   0.002   0.162 
 \end{Soutput}
 \end{Schunk}
 
@@ -1566,7 +1649,7 @@
 There were two compilation steps:
 
 \begin{itemize}
-  \item \Rlogo{} compilation time was: Thu Jun  2 14:42:23 2016
+  \item \Rlogo{} compilation time was: Thu Jun  2 15:58:57 2016
   \item \LaTeX{} compilation time was: \today
 \end{itemize}
 



More information about the Seqinr-commits mailing list