Skip to content

Corpus form: inconsistency between CSV sniffer and CSV reader #2003

@lukavdplas

Description

@lukavdplas

When you upload a CSV file in the corpus form, Textcavator will extract column info from the file. The code for this uses the pandas library to parse the file, while the CSVReader is based on the csv base library. The CSVReader is responsible for extracting the content, so it can happen that a file appears to be parsed as intended in step 2 of the form (data upload), but isn't parsed correctly in step 4 (indexing).

This also relates to #1998: if you show a data preview based on the pandas output, it may not match the output in step 4.

Suggested solution: rewrite backend/addcorpus/json_corpora/csv_field_info.py to use the csv library instead of pandas.

Alternative solutions:

  • Try to find a configuration for pandas and/or csv so the output in these steps is always consistent. Precarious.
  • Rewrite the CSVReader in ianalyzer_readers so it's based on pandas instead of csv. Not preferred because that class is already, like, mission-critical, and using pandas would not improve it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions