Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fails to detect delimiter #45

Open
ghost opened this issue Jun 2, 2020 · 4 comments
Open

Fails to detect delimiter #45

ghost opened this issue Jun 2, 2020 · 4 comments

Comments

@ghost
Copy link

ghost commented Jun 2, 2020

I'm trying to use the detect function to determine the delimiter used for data which can be either tab or comma delimited, however it always returns the delimiter as , even if there are no commas in the data.

@touv
Copy link
Collaborator

touv commented Jun 2, 2020

As you can see, the "detect" function is pretty basic, https://github.com/Inist-CNRS/node-csv-string/blob/master/src/CSV.ts#L57-L65

comma is the default value !

@ghost
Copy link
Author

ghost commented Jun 2, 2020

comma is the default value !

I understand that, however shouldn't it change if it detects a different delimiter (tab in my case)?

@pbrunisholz
Copy link

pbrunisholz commented Oct 6, 2020

Hello, I noticed that problem too.

I tried to detect the best separator in that test string using the detect function :
a,b|c;d
In a first time I wasn't expecting something, I was only waiting to see the result in order to understand the logic behind the selection of the "best" separator. The result was a comma , so I guessed it was selected as default because there was too many different separators.

Then I made another test :
"a,b";c;d
This time I expected a semi-colon ; as result since I escaped the content containing the comma with double-quotes but I was disapointed to get a comma , as result once more. So I guessed it was using the first character matching with the array of declared delimiters.

So I made a last test :
"a;b"|c|d
Since I removed the comma and put the semi-colon between double-quotes I was expecting the pipe | as result but I got the semi-colon ; which confirms my second hypothesis now.

Since the issue has been notified, do you plan to fix that or do you consider it as a correct behaviour ?
If you consider it as correct, can you please explain why ?

Thank you

EDIT :
I checked the detect function and indeed it was like I expected as second hypothesis.

The function defines the "best" delimiter as the first character found in the input which is also included in the array of accepted separators without considering the escaped content.

So for now I created my own detect function which use the same array of allowed delimiter except that I do not consider matching separators within a string escaped either by double or single quotes and it's not the first encounter but the most recurring one that is returned.
Since it can quickly become a time consuming process on large files, I only read the first row (generally the headers one) to detect, not the best, but the most probable delimiter used in the file.

Let me know if you're interested in my function, I'd be glad to share it with you. Have a good day !

@hossam-magdy
Copy link
Contributor

@pbrunisholz Thank you for spending time in this.

Your suggestion of the detect function seems interesting to me!

As you already implemented it, why don't you create a pull request with the changes? if you like! ... perhaps saving some time. Also it might be easier to discuss in a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants