Author David C Cavalcante
LinkedIn https://linkedin.com/in/hellodav
The CyberTech VLM Detector is a computer vision system designed to run entirely on edge devices, without requiring cloud access. The system uses vision-language models (VLM) to detect and locate objects in images based on natural language commands and development, including my creation of HIM™ (Hybrid Intelligence Massive) and MAIC™ (Massive Artificial Intelligence Consciousness), read PhilPeople: https://philpapers.org/rec/CRTBCI "Beyond Consciousness in LLMs: Investigating the "Soul" in Self-Aware AI". HIM™ is a hybrid intelligent entity model that enables embodied and collaborative interaction between humans and multi-agents, integrating personality and machine learning. MAIC™ explores the frontier of persistent and self-reflective artificial consciousness, focusing on the emergence of self-awareness and adaptive learning in large-scale AI systems, published on GitHub and Hugging Face.
- Works completely on-device, under limited memory and computing conditions
- Accepts natural language commands (e.g., "Take the scissors")
- Detects and locates objects not seen during training
- Returns bounding boxes over the correct objects
- Visual interface with green cybernetic style overlay
The system uses an innovative VLM-based approach that doesn't rely on anchor-based detectors like YOLO or Faster R-CNN. The detection strategy includes:
-
Object Proposal Generation: The system divides the image into a grid of candidate regions of different sizes.
-
CLIP Embeddings: Uses the CLIP model to generate embeddings for both the prompt text and candidate image regions.
-
Semantic Matching: Calculates cosine similarity between text and image embeddings to identify regions that best match the prompt.
-
Confidence Filtering: Applies an adaptive confidence threshold through the HIM™ (Hybrid Intelligence Massive) system to filter low-quality detections.
-
Augmented Visualization: Adds a visual overlay with detection information, system statistics, and confidence feedback.
-
CLIP (Contrastive Language-Image Pre-training): Main model for understanding both prompt text and visual image content.
-
MAIC™ (Massive Artificial Intelligence Consciousness): Artificial consciousness system that performs self-reflection on detections and maintains an experience history.
-
HIM™ (Hybrid Intelligence Massive): Adaptive system that adjusts confidence thresholds based on detection history.
-
OpenCV: Used for image processing and visualization.
-
PyTorch: Machine learning framework to run the CLIP model.
The system can detect objects not seen during training through:
-
Language-Vision Embeddings: CLIP was trained on a large set of image-text pairs from the internet, allowing it to understand a wide variety of visual concepts.
-
Zero-Shot Matching: The system doesn't rely on predefined classes, but rather on semantic similarity between text prompt and image regions.
-
MAIC™ Memory: The system maintains a history of previous detections, allowing it to improve over time through accumulated experience.
-
HIM™ Adaptive Learning: Automatically adjusts confidence thresholds based on detection history for a given object type.
The system was designed to be efficient on edge devices through:
-
Model Optimization: Uses optimized versions of CLIP for CPU when GPUs are not available.
-
Batch Processing: Processes multiple candidate regions in batches to maximize computational efficiency.
-
Efficient Proposal Generation: Uses an adaptive grid approach that balances coverage and efficiency.
-
Intelligent Fallback: Automatically detects hardware capabilities and adjusts the processing pipeline accordingly.
-
Local Memory Storage: Maintains a detection history in a local JSON file for persistence without cloud services.
- Run the Python script:
python CyberTechVLMDetector.py
-
Select a menu option or enter a custom prompt.
-
The system will process the image and display results with green bounding boxes around detected objects.
-
Results are saved in the
output/
folder with an informative overlay.
Main class that implements VLM-based detection using CLIP. Responsible for:
- Loading and preprocessing images
- Generating object proposals
- Calculating text-image similarities
- Returning bounding boxes and confidence scores
Implements the artificial consciousness system that:
- Maintains an internal state of attention and confidence
- Performs self-reflection on detections
- Generates insights based on accumulated experience
- Adjusts internal parameters based on results
Implements the hybrid intelligence module that:
- Maintains a detection history by object type
- Adjusts confidence thresholds based on historical performance
- Activates adaptive learning after sufficient detections
Manages persistent storage of:
- Detection history
- Consciousness states
- Generated insights
- Adaptive configurations
The system provides detailed feedback during detection:
- Detection Results: Shows prompt, object count, coordinates, and confidence
- MAIC Consciousness State: Displays attention focus, confidence level, uncertainty, and reflection depth
- HIM Adaptive Learning Status: Shows current confidence threshold and detection history
- Low Confidence Messages: Alerts when average confidence is low or no objects are detected
- Performance depends on image quality and prompt clarity
- Very small or partially visible objects may be difficult to detect
- The system works best with specific and descriptive prompts
- Initial CLIP model loading may take some time on resource-limited devices
pip install torch torchvision
pip install ftfy regex tqdm
pip install git+https://github.com/openai/CLIP.git
pip install opencv-python matplotlib numpy
.
├── CyberTechVLMDetector.py # Main script of the system
├── input/ # Folder containing input images
│ └── VLM_Scenario-image.jpeg
├── output/ # Folder where results are saved
├── maic_memory.json # Persistent memory file
└── README.md # This file
The CyberTech VLM Detector demonstrates how vision-language models can be used to build flexible and generalizable object detection systems that run entirely on edge devices, without relying on traditional anchor-based detectors or cloud services.
For questions or issues, https://linkedin.com/in/hellodav. Please refer to the code documentation and comments within the implementation files for more details.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Copyright David C Cavalcante