Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements #9

Open
hadley opened this issue Dec 29, 2017 · 4 comments
Open

Performance improvements #9

hadley opened this issue Dec 29, 2017 · 4 comments

Comments

@hadley
Copy link

hadley commented Dec 29, 2017

Currently read_csvy reads the complete file using readLines() - this means it will be slow for large files. I'd recommend (and can possibly help with) writing a C/C++ read_yaml_header() function that would parse from the first --- to the next ---. This metadata could then be used to generate the column specification that's passed to read.csv(), read_csv(), and fread(). (Will probably still need some additional cleanup afterwards).

@leeper
Copy link
Owner

leeper commented Dec 31, 2017

That would be awesome.

@ashiklom
Copy link
Contributor

ashiklom commented Mar 14, 2018

Not in C, but a first pass at this might look something like this. It uses the fact that if con <- file("/path/to/file", "r") then readLines(con, n = 1) reads a file one line at a time, automatically advancing to the next line.

get_yaml_header <- function(filename, yaml_rxp = "^#?---[[:space:]]*$") {
  con <- file(filename, "r")
  on.exit(close(con))
  first_line <- readLines(con, n = 1)
  if (!grepl(yaml_rxp, first_line)) {
    warning("No YAML file found.")
    return(NULL)
  }
  iline <- 2
  closing_tag <- FALSE
  tag_vec <- character()
  while (!closing_tag) {
    curr_line <- readLines(con, n = 1)
    tag_vec[iline - 1] <- curr_line
    closing_tag <- grepl(yaml_rxp, curr_line)
    iline <- iline + 1
  }
  tag_vec[seq_len(iline - 2)]
}

parse_yaml_header <- function(yaml_header) {
  if (all(grepl("^#", yaml_header))) {
    yaml_header <- gsub("^#", "", yaml_header)
  }
  yaml::yaml.load(paste(yaml_header, collapse = "\n"))
}

raw_header <- get_yaml_header("iris.csvy")
metadata <- parse_yaml_header(raw_header)

You should then be able to do something like csv_file <- fread(filename, skip = length(tag_vec) + 2, ...).

If this looks OK, I can try to put together a more complete pull request later this week.

@leeper
Copy link
Owner

leeper commented Mar 14, 2018

That would be awesome!

ashiklom added a commit to ashiklom/csvy that referenced this issue Mar 15, 2018
@leeper
Copy link
Owner

leeper commented Jun 11, 2018

Merging of #15 is done. We could do further C-level fixes, but this seems good for the time being.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants