- About
- FAQ
- Features
- Minimum Supported Compilers
- Acknowledgments
- Example
- API
- Problems
- Benchmarks
- Build All
CSV_co is a C++20 coroutine-driven, callback-providing and sane CSV data reader, or parser. Hope, it is in line with standard RFC 4180. This is because it was conceived to handle with field selection carefully. The following requirements tend to be satisfied:
- Windows and Unix style line endings.
- Optional header line.
- Each row (record) must contain the same number of fields.
- A field can be enclosed in double quotes.
- If a field contains commas, line breaks, double quotes, then this field must be enclosed in double quotes.
- The double-quotes character in the field must be doubled.
- Extension: partly double-quoted fields (non-quoted string fields with arbitrarily double-quoted parts)
Why another parser?
Because I am very unsure about some authors' workings on complicated CSV fields.
Why are you unquoting quoted fields?
Because quotes are for preprocessing, not for end users.
How fast is it?
Well, it seems far from being "the fastest". But look at Benchmarks.
- Memory-mapping CSV files.
- Several CSV data sources.
- Two modes of iteration.
- Callbacks for each field/cell (header's or value).
- Callbacks for new rows.
- String data type only, apply type casts on your own.
- Strong typed (concept-based) reader template parameters.
- Tested.
- Linux
- GNU GCC 10.2 C++ compiler
- LLVM Clang 12.0 C++ compiler
- Windows
- Cygwin with GCC 10.2 C++ compiler
- Microsoft Visual Studio 2019 Update 9 (16.9.4) +
- MinGW with GCC 10.2 C++ compiler
To Andreas Fertig for coroutine tutorials and code which were highly borrowed.
General scheme:
#include <csv_co/reader.hpp>
using namespace csv_co;
using reader_type = reader< trimming_policy >;
reader_type r( CSV_source );
r.run([](auto s) {
// do something with field
});
Ready value iteration mode, save all fields to a container and view a data:
try {
reader_type r(std::filesystem::path("smallpop.csv"));
std::vector<cell_string> ram;
ram.reserve(r.cols() * r.rows());
r.valid().run( // check validity and run
[](auto) {
// ignore header fields
}
,[&ram](auto s) {
// save value fields
ram.emplace_back(s);
});
// population of Southborough,MA:
std::cout << ram [0] << ',' << ram[1] << ':' << ram[3] << '\n';
} catch (reader_type::exception const & e) {
std::cout << e.what() << '\n';
}
Ready value iteration mode, use new row
callback to facilitate filling a matrix:
try {
reader<trim_policy::alltrim> r (std::filesystem::path("smallpop.csv"));
some_matrix_class matrix (shape);
auto c_row {-1}; // will be incremented automatically
auto c_col {0u};
// ignore header fields, obtain value fields, and trace rows:
r.run([](auto) {}
,[&](auto s){ matrix[c_row][c_col++] = s; }
,[&]{ c_row++; c_col = 0; });
// population of Southborough,MA
std::cout << matrix[0][0] << ',' << matrix[0][1] << ':' << matrix[0][3] << '\n';
} catch (reader_type::exception const & e) {/* handler */}
String CSV source:
reader (R"("William said: ""I am going home, but someone always bothers me""","Movie ""Falling down""")")
.run([](auto s) {
assert(
s == R"(William said: "I am going home, but someone always bothers me")"
|| s == R"(Movie "Falling down")"
);
});
In above examples parser works hard to select, prepare and provide every field. This is somewhat more time-consuming, especially if you are interested in specific fields and in common would prefer to move forward faster. There is an option for lazier field iteration: the parser is keeping the memory span corresponding to the current field and gives you the opportunity to get the value of this field in the container you provide. So, preferable way of doing things is right underneath.
Span iteration mode, get necessary fields:
// ignore header fields, obtain value fields, trace rows:
reader<...> r (...);
r.valid().run_span(
[](auto) {}
,[&](auto & s) { if (some col or row) { cell_string value; s.read_value(value); }}
,[&]{ row++; col = 0; });
Public API available:
using cell_string = std::basic_string<char, std::char_traits<char>, allocator<char>>;
template <TrimPolicyConcept TrimPolicy = trim_policy::no_trimming
, QuoteConcept Quote = double_quotes
, DelimiterConcept Delimiter = comma_delimiter>
class reader {
public:
// Constructors
explicit reader(std::filesystem::path const & csv_src);
template <template<class> class Alloc=std::allocator>
explicit reader(std::basic_string<char,std::char_traits<char>,Alloc<char>> const & csv_src);
explicit reader(const char * csv_src);
// csv_co::reader is movable type
reader (reader && other) noexcept = default;
reader & operator=(reader && other) noexcept = default;
// Shape
[[nodiscard]] std::size_t cols() const noexcept;
[[nodiscard]] std::size_t rows() const noexcept;
// Validation
[[nodiscard]] reader& valid();
// Parsing
void run(value_field_cb_t, new_row_cb_t nrc=[]{}) const;
void run(header_field_cb_t, value_field_cb_t, new_row_cb_t nrc=[]{}) const;
void run_span(value_field_span_cb_t, new_row_cb_t nrc=[]{}) const;
void run_span(header_field_span_cb_t, value_field_span_cb_t, new_row_cb_t nrc=[]{}) const;
// Reading fields' values within run_span's() callbacks
class cell_span {
public:
void read_value(auto & any_container_supporting_assignment_from_substring) const;
};
// Callback types
using header_field_cb_t = std::function <void (std::string_view)>;
using value_field_cb_t = std::function <void (std::string_view)>;
using header_field_span_cb_t = std::function <void (cell_span const & )>;
using value_field_span_cb_t = std::function <void (cell_span const & )>;
using new_row_cb_t = std::function <void ()>;
// Exception type
struct exception : public std::runtime_error {
// Constructor
template <typename ... Types>
explicit constexpr exception(Types ... args);
}
};
- Frequent coroutine switching due to current byte-parsing protocol which lead to time-consuming overhead. Well, another approach would bring parsing clarity at the expense of speed. Probably, that is to implement sometime. Pull requests are welcome.
- run() and run_span() dealing with header fields have 3-4% performance cost. Patch is welcome.
Benchmarking source code is in benchmark
folder. It measures, in span iteration mode, the
average execution time (after a warmup run) for CSV_co
to memory-map the input CSV
file and iterate over every field in it.
cd benchmark
clang++ -I../include -O3 -march=native -std=c++20 -stdlib=libc++ ./spanbench.cpp
cd ..
Type | Value |
---|---|
Processor | Intel(R) Core(TM) i7-6700 CPU @ 3.40 GHz |
Installed RAM | 16 GB |
HDD/SSD | USB hard drive |
OS | Linux slax 5.10.92 (Debian Linux 11 (bullseye)) |
C++ Compiler | Clang 15.0.6 |
Dataset | File Size | Rows | Cols | Cells | Time |
---|---|---|---|---|---|
Benchmark folder's game.csv | 2.6M | 100000 | 6 | 600'000 | 0.012s |
Denver Crime Data | 102M | 399573 | 20 | 7'991'460 | 0.524s |
2015 Flight Delays and Cancellations | 565M | 5819080 | 31 | 180'391'480 | 3.55s |
Conventional:
mkdir build && cd build
cmake ..
make -j 4
Best (if you have clang, libc++-dev, libc++abi-dev packages or their analogs installed):
mkdir build && cd build
cmake -DCMAKE_CXX_COMPILER=clang++ -D_STDLIB_LIBCPP=ON ..
make -j 4
Check for memory safety (if you have clang sanitizers):
mkdir build && cd build
cmake -DCMAKE_CXX_COMPILER=clang++ -D_SANITY_CHECK=ON -DCMAKE_BUILD_TYPE=Debug ..
make -j 4
MSVC (in x64 Native Tools Command Prompt):
mkdir build && cd build
cmake -G "Visual Studio 17 2022" -DCMAKE_BUILD_TYPE=Release ..
msbuild /property:Configuration=Release csv_co.sln