Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cindex is silently ignoring some text files and there's no way to tell why #80

Open
victor-sudakov opened this issue Jan 4, 2022 · 3 comments

Comments

@victor-sudakov
Copy link

I have a couple of text files (UTF-8, with mostly ASCII and Cyrillic characters) which cindex/csearch ignore.

The worst problem is that I cannot tell why cindex ignores them, there is no "verbose" option to cindex. Maybe there is a character somewhere in the file cindex does not like but how do I tell?

iconv -f utf-8 -t utf-16 < text/book1.txt > /dev/null never complains so I presume the book1.txt file is valid UTF-8. But cindex excludes it from search.

codesearch version:
codesearch/oldstable,now 0.0~hg20120502-3+b11 amd64 on Debian 10.

The problem may be related to #26

@dgryski
Copy link
Contributor

dgryski commented Jan 4, 2022

I believe there is also a line length limit that causes files to not be indexed.

You might have better luck switching to zoekt if possible.

@victor-sudakov
Copy link
Author

I believe there is also a line length limit that causes files to not be indexed.

I've just tried glimpse on it. glimpseindex skips this file too, it can be forced to index it by glimpseindex -E
The are long lines somewhere in the file indeed.

$ file text/book1.txt 
text/book1.txt: UTF-8 Unicode text, with very long lines

I should probably grep the text for long lines and see what comes out.

You might have better luck switching to zoekt if possible.

Not in the Debian repo unfortunately.

@victor-sudakov
Copy link
Author

I've found the offending line. It is not even long, but removing it allows indexing again. The whole line is below (yes it's the whole line by itself)

(HTTPConnectionPool(host='172.31.38.116', port=8008): Max retries exceeded with

I'm really surprised. There should be a switch to cindex to either disable file contents heuristics or to expose it verbosely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants