-
-
Notifications
You must be signed in to change notification settings - Fork 2
Home
stream-csv-as-json
is a minimal micro-library, which provides a set of light-weight stream components to process huge CSV files with a minimal memory footprint. It is a companion library for stream-json, fully compatible with it, and can use advanced features provided by that library.
It can:
- Parse CSV files compliant with RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files.
- Properly handles values with newlines in them.
- Supports a relaxed definition of "newline".
- Supports a customizable separator.
- Parse CSV files far exceeding available memory.
- Even individual primitive data items (strings) can be streamed piece-wise.
- Processing humongous files can take minutes and even hours. Shaving even a microsecond from each operation can save a lot of time waiting for results. That's why all
stream-csv-as-json
andstream-json
components were meticulously optimized.- See Performance for hints on speeding pipelines up.
- Stream using a SAX-inspired event-based API.
- Provide utilities to handle huge database dumps.
- Follows conventions of a no-dependency micro-library stream-chain.
It was meant to be a set of building blocks for data processing pipelines organized around CSV, JSON and JavaScript objects. Users can easily create their own "blocks" using provided facilities.
This is an overview, which can be used as a cheat sheet. Click on individual components to see a detailed API documentation with examples.
The main module returns a factory function, which produces instances of Parser decorated with emit().
The heart of the package is Parser — a streaming CSV parser, which consumes text and produces a stream of tokens.
const {parser} = require('stream-csv-as-json');
const pipeline = fs.createReadStream('data.csv').pipe(parser());
Each row is logically represented by JSON tokens as an array of string values.
A stream produced by the CSV parser is compliant with the JSON token stream. All data processing facilities of stream-json can be used on it: filters, streamers, and so on.
Classes and functions to make streaming data processing enjoyable:
-
Stringer is a Transform stream. It receives a token stream and converts it to the CSV format. It is very useful when you want to edit a stream with filters and a custom code, and save it back to a file.
const {stringer} = require('stream-csv-as-json/Stringer'); const {pick} = require('stream-json/filters/Pick'); chain([ fs.createReadStream('data.csv.gz'), zlib.createGunzip(), parser(), pick({filter: 'data'}), stringer(), zlib.Gzip(), fs.createWriteStream('edited.csv.gz') ]);
-
AsObjects is a Transform stream. It consumes a stream produced by
Parser
(a row as an array of string values), uses the first row as a header, and reformats array as objects using header values as keys for corresponding fields.const {asObjects} = require('stream-csv-as-json/AsObjects'); chain([ fs.createReadStream('data.csv.gz'), zlib.createGunzip(), // data: // a,b,c // 1,2,3 parser(), // ['a', 'b', 'c'] // ['1', '2', '3'] asObjects() // {a: '1', b: '2', c: '3'} ]);
Performance considerations are discussed in a separate document dedicated to Performance.
The test file tests/sample.csv.gz
is Master.csv
from Lahman’s Baseball Database 2012. The file is copyrighted by Sean Lahman. It is used here under a Creative Commons Attribution-ShareAlike 3.0 Unported License. In order to test all features of the CSV parser, the file was minimally modified: row #1000 has a CRLF inserted in a value, row #1001 has a double quote inserted in a value, then the file was compressed by gzip.