cindex is silently ignoring some text files and there's no way to tell why #80

victor-sudakov · 2022-01-04T02:19:58Z

I have a couple of text files (UTF-8, with mostly ASCII and Cyrillic characters) which cindex/csearch ignore.

The worst problem is that I cannot tell why cindex ignores them, there is no "verbose" option to cindex. Maybe there is a character somewhere in the file cindex does not like but how do I tell?

iconv -f utf-8 -t utf-16 < text/book1.txt > /dev/null never complains so I presume the book1.txt file is valid UTF-8. But cindex excludes it from search.

codesearch version:
codesearch/oldstable,now 0.0~hg20120502-3+b11 amd64 on Debian 10.

The problem may be related to #26

The text was updated successfully, but these errors were encountered:

dgryski · 2022-01-04T02:52:03Z

I believe there is also a line length limit that causes files to not be indexed.

You might have better luck switching to zoekt if possible.

victor-sudakov · 2022-01-04T07:08:09Z

I believe there is also a line length limit that causes files to not be indexed.

I've just tried glimpse on it. glimpseindex skips this file too, it can be forced to index it by glimpseindex -E
The are long lines somewhere in the file indeed.

$ file text/book1.txt 
text/book1.txt: UTF-8 Unicode text, with very long lines

I should probably grep the text for long lines and see what comes out.

You might have better luck switching to zoekt if possible.

Not in the Debian repo unfortunately.

victor-sudakov · 2022-01-04T09:21:30Z

I've found the offending line. It is not even long, but removing it allows indexing again. The whole line is below (yes it's the whole line by itself)

(HTTPConnectionPool(host='172.31.38.116', port=8008): Max retries exceeded with

I'm really surprised. There should be a switch to cindex to either disable file contents heuristics or to expose it verbosely.

hillu mentioned this issue Aug 18, 2024

cindex: Log skipped files #93

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cindex is silently ignoring some text files and there's no way to tell why #80

cindex is silently ignoring some text files and there's no way to tell why #80

victor-sudakov commented Jan 4, 2022

dgryski commented Jan 4, 2022

victor-sudakov commented Jan 4, 2022

victor-sudakov commented Jan 4, 2022

cindex is silently ignoring some text files and there's no way to tell why #80

cindex is silently ignoring some text files and there's no way to tell why #80

Comments

victor-sudakov commented Jan 4, 2022

dgryski commented Jan 4, 2022

victor-sudakov commented Jan 4, 2022

victor-sudakov commented Jan 4, 2022