@@ -61,7 +61,7 @@ few MBs. 💾
61
61
pragmatic approach towards this especially with regards to quoting and line
62
62
ends. See section [ RFC-4180] ( #rfc-4180 ) .
63
63
64
- [ Example] ( #Example ) | [ Naming and Terminology] ( #naming-and-terminology ) | [ API] ( #application-programming-interface-api ) | [ Limitations and Constraints] ( #limitations-and-constraints ) | [ Comparison Benchmarks] ( #comparison-benchmarks ) | [ Example Catalogue] ( #example-catalogue ) | [ RFC-4180] ( #rfc-4180 ) | [ FAQ] ( #frequently-asked-questions-faq ) | [ Public API Reference] ( #public-api-reference )
64
+ [ Example] ( #example ) | [ Naming and Terminology] ( #naming-and-terminology ) | [ API] ( #application-programming-interface-api ) | [ Limitations and Constraints] ( #limitations-and-constraints ) | [ Comparison Benchmarks] ( #comparison-benchmarks ) | [ Example Catalogue] ( #example-catalogue ) | [ RFC-4180] ( #rfc-4180 ) | [ FAQ] ( #frequently-asked-questions-faq ) | [ Public API Reference] ( #public-api-reference )
65
65
66
66
## Example
67
67
``` csharp
@@ -253,7 +253,7 @@ That is, to use `SepReader` follow the points below:
253
253
var colNames = header.NamesStarting("GT_");
254
254
var colIndices = header.IndicesOf(colNames);
255
255
```
256
- 1. Enumerate rows. One row at a time.
256
+ 1. Enumerate rows. One row at a time.
257
257
1. Access a column by name or index. Or access multiple columns with names and
258
258
indices. `Sep` internally handles pooled allocation and reuse of arrays for
259
259
multiple columns.
@@ -398,7 +398,7 @@ If you are hovering over `row` then this will show something like:
398
398
```
399
399
2:[5..9] = "B;\"Apple\r\nBanana\r\nOrange\r\nPear\""
400
400
```
401
- This has the format shown below.
401
+ This has the format shown below.
402
402
```
403
403
<ROWINDEX>:[<LINENUMBERRANGE>] = "<ROW>"
404
404
```
@@ -553,7 +553,7 @@ CollectionAssert.AreEqual(expected, actual);
553
553
This means you are still parsing the double (which is magnitudes slower than
554
554
getting just the key) for all rows. Imagine if this was an array of floating
555
555
points or similar. Not only would you then be parsing a lot of values you would
556
- also be allocated 99x arrays that aren't used after filtering with ` Where ` .
556
+ also be allocated 99x arrays that aren't used after filtering with ` Where ` .
557
557
558
558
Instead, you should focus on how to express the enumeration in a way that is
559
559
both efficient and easy to read. For example, the above could be rewritten as:
@@ -709,7 +709,7 @@ That is, to use `SepWriter` follow the points below:
709
709
1. Use `Set` to set the column value either as a `ReadOnlySpan<char>`, `string`
710
710
or via an interpolated string. Or use `Format<T>` where `T : IFormattable`
711
711
to format `T` to the column value.
712
- 1. Row is written when `Dispose` is called on the row.
712
+ 1. Row is written when `Dispose` is called on the row.
713
713
> Note this is to allow a row to be defined flexibly with both column
714
714
> removal, moves and renames in the future. This is not yet supported.
715
715
@@ -738,10 +738,10 @@ public bool WriteHeader { get; init; } = true;
738
738
Sep is designed to be minimal and fast. As such, it has some limitations and
739
739
constraints, since these are not needed for the initial intended usage:
740
740
741
- * Automatic escaping and unescaping quotes is not supported. Use
741
+ * Automatic escaping and unescaping quotes is not supported. Use
742
742
[ ` Trim ` ] ( https://learn.microsoft.com/en-us/dotnet/api/system.memoryextensions.trim )
743
743
extension method to remove surrounding quotes, for example.
744
- * Comments ` # ` are not directly supported. You can skip a row by:
744
+ * Comments ` # ` are not directly supported. You can skip a row by:
745
745
``` csharp
746
746
foreach (var row in reader )
747
747
{
@@ -753,28 +753,28 @@ constraints, since these are not needed for the initial intended usage:
753
753
}
754
754
```
755
755
This does not allow skipping a header row starting with ` # ` though.
756
- * ` SepWriter ` is not yet fully featured and one cannot skip writing a header
756
+ * ` SepWriter ` is not yet fully featured and one cannot skip writing a header
757
757
currently.
758
758
759
759
## Comparison Benchmarks
760
760
To investigate the performance of Sep it is compared to:
761
761
762
- * [ CsvHelper] ( https://github.com/JoshClose/csvhelper ) - * the* most commonly
762
+ * [ CsvHelper] ( https://github.com/JoshClose/csvhelper ) - * the* most commonly
763
763
used CSV library with a staggering
764
764
![ downloads] ( https://img.shields.io/nuget/dt/csvhelper ) downloads on NuGet. Fully
765
765
featured and battle tested.
766
- * [ Sylvan] ( https://github.com/MarkPflug/Sylvan ) - is well-known and has
766
+ * [ Sylvan] ( https://github.com/MarkPflug/Sylvan ) - is well-known and has
767
767
previously been shown to be [ the fastest CSV libraries for
768
768
parsing] ( https://www.joelverhagen.com/blog/2020/12/fastest-net-csv-parsers )
769
769
(Sep changes that 😉).
770
- * ` ReadLine ` /` WriteLine ` - basic naive implementations that read line by line
770
+ * ` ReadLine ` /` WriteLine ` - basic naive implementations that read line by line
771
771
and split on separator. While writing columns, separators and line endings
772
772
directly. Does not handle quotes or similar correctly.
773
773
774
774
All benchmarks are run from/to memory either with:
775
775
776
- * ` StringReader ` or ` StreamReader + MemoryStream `
777
- * ` StringWriter ` or ` StreamWriter + MemoryStream `
776
+ * ` StringReader ` or ` StreamReader + MemoryStream `
777
+ * ` StringWriter ` or ` StreamWriter + MemoryStream `
778
778
779
779
This to avoid confounding factors from reading from or writing to disk.
780
780
@@ -807,6 +807,7 @@ than that. Or how many *times* more bytes are allocated in `Alloc Ratio`.
807
807
808
808
### Runtime and Platforms
809
809
The following runtime is used for benchmarking:
810
+
810
811
* ` NET 8.0.X `
811
812
812
813
The following platforms are used for benchmarking:
@@ -830,25 +831,25 @@ The following platforms are used for benchmarking:
830
831
### Reader Comparison Benchmarks
831
832
The following reader scenarios are benchmarked:
832
833
833
- * [ NCsvPerf] ( https://github.com/joelverhagen/NCsvPerf ) from [ The fastest CSV
834
+ * [ NCsvPerf] ( https://github.com/joelverhagen/NCsvPerf ) from [ The fastest CSV
834
835
parser in
835
836
.NET] ( https://www.joelverhagen.com/blog/2020/12/fastest-net-csv-parsers )
836
- * [ ** Floats** ] ( #floats-reader-comparison-benchmarks ) as for example in machine learning.
837
+ * [ ** Floats** ] ( #floats-reader-comparison-benchmarks ) as for example in machine learning.
837
838
838
839
Details for each can be found in the following. However, for each of these 3
839
840
different scopes are benchmarked to better assertain the low-level performance
840
841
of each library and approach and what parts of the parsing consume the most
841
842
time:
842
843
843
- * ** Row** - for this scope only the row is enumerated. That is, for Sep all
844
+ * ** Row** - for this scope only the row is enumerated. That is, for Sep all
844
845
that is done is:
845
846
``` csharp
846
847
foreach (var row in reader ) { }
847
848
```
848
849
this should capture parsing both row and columns but without accessing these.
849
850
Note that some libraries (like Sylvan) will defer work for columns to when
850
851
these are accessed.
851
- * ** Cols** - for this scope all rows and all columns are enumerated. If
852
+ * ** Cols** - for this scope all rows and all columns are enumerated. If
852
853
possible columns are accessed as spans, if not as strings, which then might
853
854
mean a string has to be allocated. That is, for Sep this is:
854
855
``` csharp
@@ -859,8 +860,8 @@ time:
859
860
var span = row [i ].Span ;
860
861
}
861
862
}
862
- ```
863
- * ** XYZ** - finally the full scope is performed which is specific to each of
863
+ ```
864
+ * ** XYZ** - finally the full scope is performed which is specific to each of
864
865
the scenarios.
865
866
866
867
Additionally, as Sep supports multi-threaded parsing via ` ParallelEnumerate `
@@ -1094,7 +1095,7 @@ With `ParallelEnumerate` and server GC Sep is **>4x faster than Sylvan and up to
1094
1095
` NCsvPerf ` does not examine performance in the face of quotes in the csv. This
1095
1096
is relevant since some libraries like Sylvan will revert to a slower (not SIMD
1096
1097
vectorized) parsing code path if it encounters quotes. Sep was designed to
1097
- always use SIMD vectorization no matter what.
1098
+ always use SIMD vectorization no matter what.
1098
1099
1099
1100
Since there are two extra ` char ` s to handle per column, it does have a
1100
1101
significant impact on performance, no matter what though. This is expected when
@@ -1312,7 +1313,7 @@ efficient `ParallelEnumerate` is, but bear in mind that this is for the case of
1312
1313
repeated micro -benchmark runs .
1313
1314
1314
1315
It is a testament to how good the .NET and the .NET GC is that the ReadLine is
1315
- pretty good compared to CsvHelper regardless of allocating a lot of strings .
1316
+ pretty good compared to CsvHelper regardless of allocating a lot of strings .
1316
1317
1317
1318
##### AMD.Ryzen.9.5950X - FloatsReader Benchmark Results (Sep 0.4.6.0, Sylvan 1.3.7.0, CsvHelper 31.0.2.15)
1318
1319
@@ -1531,7 +1532,8 @@ Ask questions on GitHub and this section will be expanded. :)
1531
1532
### SepWriter FAQ
1532
1533
1533
1534
## Links
1534
- * [ Publishing a NuGet package using GitHub and GitHub Actions] ( https://www.meziantou.net/publishing-a-nuget-package-following-best-practices-using-github.htm )
1535
+
1536
+ * [ Publishing a NuGet package using GitHub and GitHub Actions] ( https://www.meziantou.net/publishing-a-nuget-package-following-best-practices-using-github.htm )
1535
1537
1536
1538
## Public API Reference
1537
1539
``` csharp
0 commit comments