Programmers learn & share
+1 vote
98 views

Problem :

I am getting following error while trying to read the CSV file with R.
in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : eof within quoted string
by (6.9k points)   | 98 views

2 Answers

0 votes

Solution :

Using read.csv() to read a file with text content is not a good idea.Disabling the quote as set quote="" is only a temporary solution it will only work with Separate quotation marks. There are other reasons which causes the warning, such as some special characters.

The permanent solution(using read.csv()), finding out what those special characters are and use a regular expression to eliminate them is the correct way..

Have you ever installed the package {data.table} and used fread() to read the file. it is much faster and will not bother you with this EOF warning. Please note that the file it loads it will be stored as a data.table object but not a data.frame object. The class data.table has many good features, but anyway, you can transform it using as.data.frame() if needed.

by (36.1k points)  
0 votes

Solution:

STEP 1: download and unzip the file

# download the file
site <- "http://www.informatics.jax.org/downloads/mgigff"
file <- "MGI.20170803.gff3.gz"
url <- paste0(site, "/", file)
if(!file.exists(file)) download.file(url, file)

# unzip to a temporary file
file <- sub(".gz$", "", file)
tmpfile <- tempfile()
remove_tmpfile <- FALSE
if(!file.exists(file)) { # need to unzip
    system(paste0("gunzip -c ", file, ".gz > ", tmpfile))
    remove_tmpfile <- TRUE
    file <- tmpfile
}

 

STEP 2:  read it into R with read.table().

tab <- read.table(file, sep="\t", header=FALSE, comment.char="#",
                  na.strings=".", stringsAsFactors=FALSE)

This gives a warning message:

Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  EOF within quoted string

read.delim() vs read.table()

> read.delim
function (file, header = TRUE, sep = "\\t", quote = "\\"", dec = ".",
    fill = TRUE, comment.char = "", ...)
read.table(file = file, header = header, sep = sep, quote = quote,
    dec = dec, fill = fill, comment.char = comment.char, ...)
tab <- read.table(file, sep="\t", header=FALSE, comment.char="#",
                  na.strings=".", stringsAsFactors=FALSE,
                  quote="", fill=FALSE)

You need to disable quoting

cit <- read.csv("citations.CSV", quote = "", 
                 row.names = NULL, 
                 stringsAsFactors = FALSE)

str(cit)
## 'data.frame':    112543 obs. of  13 variables:
##  $ row.names    : chr  "10.2307/675394" "10.2307/30007362" "10.2307/4254931" "10.2307/20537934" ...
##  $ id           : chr  "10.2307/675394\t" "10.2307/30007362\t" "10.2307/4254931\t" "10.2307/20537934\t" ...
##  $ doi          : chr  "Archaeological Inference and Inductive Confirmation\t" "Sound and Sense in Cath Almaine\t" "Oak Galls Preserved by the Eruption of Mount Vesuvius in A.D. 79_ and Their Probable Use\t" "The Arts Four Thousand Years Ago\t" ...
##  $ title        : chr  "Bruce D. Smith\t" "Tomás Ó Cathasaigh\t" "Hiram G. Larew\t" "\t" ...
##  $ author       : chr  "American Anthropologist\t" "Ériu\t" "Economic Botany\t" "The Illustrated Magazine of Art\t" ...
##  $ journaltitle : chr  "79\t" "54\t" "41\t" "1\t" ...
##  $ volume       : chr  "3\t" "\t" "1\t" "3\t" ...
##  $ issue        : chr  "1977-09-01T00:00:00Z\t" "2004-01-01T00:00:00Z\t" "1987-01-01T00:00:00Z\t" "1853-01-01T00:00:00Z\t" ...
##  $ pubdate      : chr  "pp. 598-617\t" "pp. 41-47\t" "pp. 33-40\t" "pp. 171-172\t" ...
##  $ pagerange    : chr  "American Anthropological Association\tWiley\t" "Royal Irish Academy\t" "New York Botanical Garden Press\tSpringer\t" "\t" ...
##  $ publisher    : chr  "fla\t" "fla\t" "fla\t" "fla\t" ...
##  $ type         : logi  NA NA NA NA NA NA ...
##  $ reviewed.work: logi  NA NA NA NA NA NA ...

I think is because of this kind of lines (check "Thorn" and "Minus")

readLines("citations.CSV")[82]
[1] "10.2307/3642839,10.2307/3642839\t,\"Thorn\" and \"Minus\" in Hieroglyphic Luvian Orthography\t,H. Craig Melchert\t,Anatolian Studies\t,38\t,\t,1988-01-01T00:00:00Z\t,pp. 29-42\t,British Institute at Ankara\t,fla\t,\t,"

In the R help section, as pointed out above, just disabling quoting altogether, by simply adding:

    quote = "" 

I also ran into this problem, and was able to work around a similar EOF error using:

read.table("....csv", sep=",", ...)

The readr package will fix this issue.

install.packages('readr')
library(readr)
readr::read_csv('yourfile.csv')

 

ago by (5.5k points)  
2,187 questions
2,514 answers
59 comments
241 users