Vision Explorer

Capture any area of your screen and get instant text extraction and visual descriptions using local, privacy-focused multimodal language models with Ollama.

Features

Screen Area Capture: A simple, intuitive tool to select any portion of your screen.
Local AI Processing: Leverages Ollama to run powerful multimodal models (like LLaVA or Qwen-VL) locally on your machine. Your data never leaves your computer.
Two-Pass Analysis:
Responsive UI: Built with Kivy, the user interface remains responsive while the AI processes the image in the background.
Side-by-Side View: Immediately compare the original captured image with the extracted text and the AI-generated visual description.
Configurable: Easily change the Ollama endpoint, model, and other settings via a simple config.json file.

Requirements

Python 3.8+
An active Ollama instance running a multimodal model.

Recommended Models:

qwen2.5vl:3b (used in development)

You can pull a model by running:

ollama pull qwen2.5vl:3b

Installation

Clone the repository:

git clone <your-repo-url>
cd VisionExplorer

Install the required Python packages:

It's recommended to use a virtual environment:
```
python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`
```
Install dependencies from requirements.txt:
```
pip install -r requirements.txt
```
(If a requirements.txt is not available, you can install the packages manually: pip install kivy Pillow requests ollama)

Usage

Ensure your Ollama application is running in the background.
Run the Vision Explorer application:
```
python main.py
```
The application will launch with a screen capture overlay.
Click and drag to select the desired area of your screen.
Release the mouse to confirm the selection. The main window will appear.
The application will show a "Processing..." status. Once the Ollama model responds, the "Extracted Text" and "Visual Description" columns will be populated.
Click the "Capture Screen Area" button to start a new capture.

To cancel a capture, press the ESC key.

License

This project is open-source and available under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
app_settings.json		app_settings.json
main.py		main.py
ollama_vision_twopass.py		ollama_vision_twopass.py
requirements.txt		requirements.txt
sample.png		sample.png
screen_capture.py		screen_capture.py
text_extractor.py		text_extractor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision Explorer

Features

Requirements

Installation

Usage

License

About

Uh oh!

Releases

Packages

Languages

License

kursad-k/visionexplorer

Folders and files

Latest commit

History

Repository files navigation

Vision Explorer

Features

Requirements

Installation

Usage

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages