Performance improvements #9

hadley · 2017-12-29T21:41:04Z

Currently read_csvy reads the complete file using readLines() - this means it will be slow for large files. I'd recommend (and can possibly help with) writing a C/C++ read_yaml_header() function that would parse from the first --- to the next ---. This metadata could then be used to generate the column specification that's passed to read.csv(), read_csv(), and fread(). (Will probably still need some additional cleanup afterwards).

The text was updated successfully, but these errors were encountered:

leeper · 2017-12-31T14:17:53Z

That would be awesome.

ashiklom · 2018-03-14T14:20:06Z

Not in C, but a first pass at this might look something like this. It uses the fact that if con <- file("/path/to/file", "r") then readLines(con, n = 1) reads a file one line at a time, automatically advancing to the next line.

get_yaml_header <- function(filename, yaml_rxp = "^#?---[[:space:]]*$") {
  con <- file(filename, "r")
  on.exit(close(con))
  first_line <- readLines(con, n = 1)
  if (!grepl(yaml_rxp, first_line)) {
    warning("No YAML file found.")
    return(NULL)
  }
  iline <- 2
  closing_tag <- FALSE
  tag_vec <- character()
  while (!closing_tag) {
    curr_line <- readLines(con, n = 1)
    tag_vec[iline - 1] <- curr_line
    closing_tag <- grepl(yaml_rxp, curr_line)
    iline <- iline + 1
  }
  tag_vec[seq_len(iline - 2)]
}

parse_yaml_header <- function(yaml_header) {
  if (all(grepl("^#", yaml_header))) {
    yaml_header <- gsub("^#", "", yaml_header)
  }
  yaml::yaml.load(paste(yaml_header, collapse = "\n"))
}

raw_header <- get_yaml_header("iris.csvy")
metadata <- parse_yaml_header(raw_header)

You should then be able to do something like csv_file <- fread(filename, skip = length(tag_vec) + 2, ...).

If this looks OK, I can try to put together a more complete pull request later this week.

leeper · 2018-03-14T15:00:00Z

That would be awesome!

See issue leeper#9.

leeper · 2018-06-11T10:14:00Z

Merging of #15 is done. We could do further C-level fixes, but this seems good for the time being.

leeper added the enhancement label Dec 31, 2017

hadley mentioned this issue Dec 31, 2017

Better csvy support tidyverse/readr#770

Closed

ashiklom added a commit to ashiklom/csvy that referenced this issue Mar 15, 2018

Read metadata first, then read file

1a127ad

See issue leeper#9.

ashiklom mentioned this issue Mar 15, 2018

Read metadata first, then read file #15

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvements #9

Performance improvements #9

hadley commented Dec 29, 2017

leeper commented Dec 31, 2017

ashiklom commented Mar 14, 2018 •

edited

Loading

leeper commented Mar 14, 2018

leeper commented Jun 11, 2018

Performance improvements #9

Performance improvements #9

Comments

hadley commented Dec 29, 2017

leeper commented Dec 31, 2017

ashiklom commented Mar 14, 2018 • edited Loading

leeper commented Mar 14, 2018

leeper commented Jun 11, 2018

ashiklom commented Mar 14, 2018 •

edited

Loading