Skip to content

Commit b900ece

Browse files
committed
add enrichment via Ollama multimodal models (e.g. LLaVA)
1 parent bbe3fc4 commit b900ece

10 files changed

+238
-2
lines changed

Dockerfile-amd64-cpu

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ RUN apt-get -y update \
1212
RUN mkdir /app
1313
COPY ./*.py ./requirements.txt /app/
1414
RUN pip install --no-cache-dir -r /app/requirements.txt --extra-index-url https://download.pytorch.org/whl/cpu
15+
COPY ./enrichment-prompts /
1516

1617
WORKDIR /app
1718
RUN curl -f -L -O https://github.com/ultralytics/assets/releases/download/v8.2.0/yolov8n.pt

Dockerfile-amd64-cuda

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ RUN mkdir /app
1717
COPY ./*.py ./requirements.txt /app/
1818
RUN pip install --no-cache-dir nvidia-tensorrt
1919
RUN pip install --no-cache-dir -r /app/requirements.txt
20+
COPY ./enrichment-prompts /
2021

2122
WORKDIR /app
2223
RUN curl -f -L -O https://github.com/ultralytics/assets/releases/download/v8.2.0/yolov8n.pt

Dockerfile-arm64

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ RUN apt-get -y update \
1212
RUN mkdir /app
1313
COPY ./*.py ./requirements.txt /app/
1414
RUN pip install --no-cache-dir -r /app/requirements.txt
15+
COPY ./enrichment-prompts /
1516

1617
WORKDIR /app
1718
RUN curl -f -L -O https://github.com/ultralytics/assets/releases/download/v8.2.0/yolov8n.pt

README.md

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,9 @@
44

55
`driveway-monitor` accepts an RTSP video stream (or, for testing purposes, a video file) and uses the [YOLOv8 model](https://docs.ultralytics.com/models/yolov8/) to track objects in the video. When an object meets your notification criteria (highly customizable; see "Configuration" below), `driveway-monitor` will notify you via [Ntfy](https://ntfy.sh). The notification includes a snapshot of the object that triggered the notification and provides options to mute notifications for a period of time.
66

7-
The model can run on your CPU or on NVIDIA or Apple Silicon GPUs. It would be possible to use a customized model, and in fact I originally planned to refine my own model based on YOLOv8, but it turned out that the pretrained YOLOv8 model seems to work fine.
7+
The YOLO computer vision model can run on your CPU or on NVIDIA or Apple Silicon GPUs. It would be possible to use a customized model, and in fact I originally planned to refine my own model based on YOLOv8, but it turned out that the pretrained YOLOv8 model seems to work fine.
8+
9+
Optionally, `driveway-monitor` can also use an instance of [Ollama](https://ollama.com) to provide a detailed description of the object that triggered the notification.
810

911
[This short video](doc/ntfy-mute-ui.mov) gives an overview of the end result. A notification is received; clicking the "Mute" button results in another notifiation with options to extend the mute time period or unmute the system. Tapping on the notification would open an image of me in my driveway; this isn't shown in the video for privacy reasons.
1012

@@ -52,6 +54,7 @@ services:
5254
image: cdzombak/driveway-monitor:1-amd64-cuda
5355
volumes:
5456
- ./config.json:/config.json:ro
57+
- ./enrichment-prompts:/enrichment-prompts:ro
5558
command:
5659
[
5760
"--debug",
@@ -143,6 +146,20 @@ The prediction process consumes a video stream frame-by-frame and feeds each fra
143146

144147
The tracker process aggregates the model's predictions over time, building tracks that represent the movement of individual objects in the video stream. Every time a track is updated with a prediction from a new frame, the tracker evaluates the track against the notification criteria. If the track meets the criteria, a notification is triggered.
145148

149+
### Enrichment
150+
151+
Enrichment is an optional feature that uses an [Ollama](https://ollama.com) model to generate a more detailed description of the object that triggered a notification. If the Ollama model succeeds, the resulting description is included in the notification's message.
152+
153+
To use enrichment, you'll need a working Ollama setup with a multimodal model installed. `driveway-monitor` does not provide this, since it's not necessary for the core feature set, and honestly it provides little additional value.
154+
155+
The best results I've gotten (which still are not stellar) are using [the LLaVA 13b model](https://ollama.com/library/llava). This usually returns a result in under 3 seconds (when running on a 2080 Ti). On a CPU or less powerful GPU, consider `llava:7b`, [`llava-llama3`](https://ollama.com/library/llava-llama3), or just skip enrichment altogether.
156+
157+
You can change the timeout for Ollama enrichment to generate a response by setting `enrichment.timeout_s` in your config. If you want to use enrichment, I highly recommend setting an aggressive timeout to ensure `driveway-monitor`'s responsiveness.
158+
159+
Using enrichment requires providing a _prompt file_ for each YOLO object classification (e.g. `car`, `truck`, `person`) you want to enrich. This allows giving different instructions to your Ollama model for people vs. cars, for example. The `enrichment_prompts` directory provides a useful set of prompt files to get you started.
160+
161+
When running `driveway-monitor` in Docker, keep in mind that your enrichment prompt files must be mounted in the container, and the paths in your config file must reflect the paths inside the container.
162+
146163
### Notifier
147164

148165
(Configuration key: `notifier`.)
@@ -174,6 +191,13 @@ The file is a single JSON object containing the following keys, or a subset ther
174191
- `tracker`: Configures the system that builds tracks from the model's detections over time.
175192
- `inactive_track_prune_s`: Specifies the number of seconds after which an inactive track is pruned. This prevents incorrectly adding a new prediction to an old track.
176193
- `track_connect_min_overlap`: Minimum overlap percentage of a prediction box with the average of the last 2 boxes in an existing track for the prediction to be added to that track.
194+
- `enrichment`: Configures the subsystem that enriches notifications via the Ollama API.
195+
- `enable`: Whether to enable enrichment via Ollama. Defaults to `false`.
196+
- `endpoint`: Complete URL to the Ollama `/generate` endpoint, e.g. `http://localhost:11434/api/generate`.
197+
- `keep_alive`: Ask Ollama to keep the model in memory for this long after the request. String, formatted like `60m`. [See the Ollama API docs](https://github.com/ollama/ollama/blob/main/docs/api.md#parameters).
198+
- `model`: The name of the Ollama model to use, e.g. `llava` or `llava:13b`.
199+
- `prompt_files`: Map of `YOLO classification name` → `path`. Each path is a file containing the prompt to give Ollama along with an image of that YOLO classification.
200+
- `timeout_s`: Timeout for the Ollama request, in seconds. This includes connection/network time _and_ the time Ollama takes to generate a response.
177201
- `notifier`: Configures how notifications are sent.
178202
- `debounce_threshold_s`: Specifies the number of seconds to wait after a notification before sending another one for the same type of object.
179203
- `default_priority`: Default priority for notifications. ([See Ntfy docs on Message Priority](https://docs.ntfy.sh/publish/#message-priority).)

config.example.json

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,18 @@
3535
},
3636
"image_method": "attach"
3737
},
38+
"enrichment": {
39+
"enable": true,
40+
"endpoint": "https://mygpuserver.tailnet-example.ts.net:11434/api/generate",
41+
"model": "llava",
42+
"keep_alive": "60m",
43+
"timeout_s": 5,
44+
"prompt_files": {
45+
"car": "enrichment_prompts/llava_prompt_car.txt",
46+
"truck": "enrichment_prompts/llava_prompt_truck.txt",
47+
"person": "enrichment_prompts/llava_prompt_person.txt"
48+
}
49+
},
3850
"web": {
3951
"port": 5550,
4052
"external_base_url": "https://mymachine.tailnet-example.ts.net:5559"

config.py

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -258,5 +258,61 @@ def config_from_file(
258258
# health:
259259
cfg.health_pinger.req_timeout_s = int(cfg.model.liveness_tick_s - 1.0)
260260

261+
# enrichment:
262+
enrichment_dict = cfg_dict.get("enrichment", {})
263+
cfg.notifier.enrichment.enable = enrichment_dict.get(
264+
"enable", cfg.notifier.enrichment.enable
265+
)
266+
if not isinstance(cfg.notifier.enrichment.enable, bool):
267+
raise ConfigValidationError("enrichment.enable must be a bool")
268+
if cfg.notifier.enrichment.enable:
269+
cfg.notifier.enrichment.prompt_files = enrichment_dict.get(
270+
"prompt_files", cfg.notifier.enrichment.prompt_files
271+
)
272+
if not isinstance(cfg.notifier.enrichment.prompt_files, dict):
273+
raise ConfigValidationError("enrichment.prompt_files must be a dict")
274+
for k, v in cfg.notifier.enrichment.prompt_files.items():
275+
if not isinstance(k, str) or not isinstance(v, str):
276+
raise ConfigValidationError(
277+
"enrichment.prompt_files must be a dict of str -> str"
278+
)
279+
try:
280+
with open(v) as f:
281+
f.read()
282+
except Exception as e:
283+
raise ConfigValidationError(
284+
f"enrichment.prompt_files: error reading file '{v}': {e}"
285+
)
286+
cfg.notifier.enrichment.endpoint = enrichment_dict.get(
287+
"endpoint", cfg.notifier.enrichment.endpoint
288+
)
289+
if not cfg.notifier.enrichment.endpoint or not isinstance(
290+
cfg.notifier.enrichment.endpoint, str
291+
):
292+
raise ConfigValidationError("enrichment.endpoint must be a string")
293+
if not (
294+
cfg.notifier.enrichment.endpoint.casefold().startswith("http://")
295+
or cfg.notifier.enrichment.endpoint.casefold().startswith("https://")
296+
):
297+
# noinspection HttpUrlsUsage
298+
raise ConfigValidationError(
299+
"enrichment.endpoint must start with http:// or https://"
300+
)
301+
cfg.notifier.enrichment.model = enrichment_dict.get(
302+
"model", cfg.notifier.enrichment.model
303+
)
304+
if not isinstance(cfg.notifier.enrichment.model, str):
305+
raise ConfigValidationError("enrichment.model must be a string")
306+
cfg.notifier.enrichment.timeout_s = enrichment_dict.get(
307+
"timeout_s", cfg.notifier.enrichment.timeout_s
308+
)
309+
if not isinstance(cfg.notifier.enrichment.timeout_s, (int, float)):
310+
raise ConfigValidationError("enrichment.timeout_s must be a number")
311+
cfg.notifier.enrichment.keep_alive = enrichment_dict.get(
312+
"keep_alive", cfg.notifier.enrichment.keep_alive
313+
)
314+
if not isinstance(cfg.notifier.enrichment.keep_alive, str):
315+
raise ConfigValidationError("enrichment.keep_alive must be a str")
316+
261317
logger.info("config loaded & validated")
262318
return cfg
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
This is an image of a vehicle, taken from a security camera. Identify the vehicle's most likely type, according to the following rules:
2+
3+
- If it looks like an Amazon delivery vehicle, its type is "Amazon delivery". Notes: Any vehicle with the word "prime" on it is an Amazon delivery vehicle. Any vehicle with Amazon's logo on it is an Amazon delivery vehicle. A vehicle that looks like a passenger car is NOT an Amazon delivery vehicle.
4+
- If it looks like a UPS delivery vehicle, its type is "UPS delivery". Notes: UPS delivery vehicles are painted dark brown. Any light-colored vehicle is NOT a UPS delivery vehicle.
5+
- If it looks like a FedEx delivery vehicle, its type is "FedEx delivery". Note: Any dark-colored vehicle is NOT a FedEx delivery vehicle.
6+
- If it looks like a USPS delivery vehicle, its type is "USPS delivery". Note: Any dark-colored vehicle is NOT a USPS delivery vehicle.
7+
- If it looks like a yellow DHL delivery van, its type is "DHL delivery".
8+
- If it looks like a pizza delivery vehicle, its type is "pizza delivery".
9+
- If it looks like a contractor's truck, plumber's truck, electrician's truck, or a construction vehicle, its type is "contractor".
10+
- If it looks like a pickup truck, its type is "pickup truck".
11+
- If it looks like a sedan, coupe, hatchback, or passenger car, its type is "passenger car".
12+
- If it does not look like any of those, you should describe its type in 3 words or less. Do not include any punctuation or any non-alphanumeric characters.
13+
14+
Your response MUST be a valid JSON object with exactly two keys, "desc" and "error":
15+
16+
- "desc" will contain the vehicle type you identified/described. If you could not identify or describe the vehicle, "desc" is "unknown". If there was no vehicle in the image, "desc" is an empty string ("").
17+
- IF AND ONLY IF you could not identify the vehicle, "error" will describe what went wrong. If you identified the vehicle's type, do not provide any error message.
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
This is an image from a security camera. The image contains at least one person.
2+
3+
Identify the person's most likely job, according to these rules:
4+
5+
- If the person is wearing a brown uniform, their job is "UPS delivery".
6+
- If the person is wearing a purple uniform, their job is "FedEx delivery".
7+
- If the person is wearing a blue uniform or a blue vest, their job is "Amazon delivery".
8+
- If the person appears to be wearing some other uniform, you should describe a job their uniform is commonly associated with, in 3 words or less. Do not include any punctuation or any non-alphanumeric characters.
9+
- If the person isn't wearing a uniform commonly associated with a specific job, or you cannot guess their job for any other reason, their job is "unknown".
10+
11+
Your response MUST be a valid JSON object with exactly two keys: "desc" and "error":
12+
13+
- "desc" will contain the job you identified. If you could not identify the person's job, "desc" is "unknown". If there was no person in the image, "desc" is an empty string ("").
14+
- IF AND ONLY IF you could not plausibly guess the person's job, "error" will describe what went wrong. If you made a guess at the person's job, do not provide any error message.
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
This is an image of a vehicle, taken from a security camera. Identify the vehicle's most likely type, according to the following rules:
2+
3+
- If it looks like an Amazon delivery vehicle, its type is "Amazon delivery". Notes: Any vehicle with the word "prime" on it is an Amazon delivery vehicle. Any vehicle with Amazon's logo on it is an Amazon delivery vehicle. A vehicle that looks like a passenger car is NOT an Amazon delivery vehicle.
4+
- If it looks like a UPS delivery vehicle, its type is "UPS delivery". Notes: UPS delivery vehicles are painted dark brown. Any light-colored vehicle is NOT a UPS delivery vehicle.
5+
- If it looks like a FedEx delivery vehicle, its type is "FedEx delivery". Note: Any dark-colored vehicle is NOT a FedEx delivery vehicle.
6+
- If it looks like a USPS delivery vehicle, its type is "USPS delivery". Note: Any dark-colored vehicle is NOT a USPS delivery vehicle.
7+
- If it looks like a yellow DHL delivery van, its type is "DHL delivery".
8+
- If it looks like a pizza delivery vehicle, its type is "pizza delivery".
9+
- If it looks like a contractor's truck, plumber's truck, electrician's truck, or a construction vehicle, its type is "contractor".
10+
- If it looks like a pickup truck, its type is "pickup truck".
11+
- If it looks like a sedan, coupe, hatchback, or passenger car, its type is "passenger car".
12+
- If it does not look like any of those, you should describe its type in 3 words or less. Do not include any punctuation or any non-alphanumeric characters.
13+
14+
Your response MUST be a valid JSON object with exactly two keys, "desc" and "error":
15+
16+
- "desc" will contain the vehicle type you identified/described. If you could not identify or describe the vehicle, "desc" is "unknown". If there was no vehicle in the image, "desc" is an empty string ("").
17+
- IF AND ONLY IF you could not identify the vehicle, "error" will describe what went wrong. If you identified the vehicle's type, do not provide any error message.

ntfy.py

Lines changed: 94 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
1+
import base64
12
import dataclasses
23
import datetime
4+
import json
35
import logging
46
import multiprocessing
57
import os.path
@@ -51,8 +53,19 @@ class NtfyRecord:
5153
jpeg_image: Optional[bytes]
5254

5355

56+
@dataclasses.dataclass
57+
class EnrichmentConfig:
58+
enable: bool = False
59+
endpoint: str = ""
60+
keep_alive: str = "240m"
61+
model: str = "llava"
62+
prompt_files: Dict[str, str] = dataclasses.field(default_factory=lambda: {})
63+
timeout_s: float = 5.0
64+
65+
5466
@dataclasses.dataclass
5567
class NtfyConfig:
68+
enrichment: EnrichmentConfig = dataclasses.field(default_factory=EnrichmentConfig)
5669
external_base_url: str = "http://localhost:5550"
5770
log_level: Optional[int] = logging.INFO
5871
topic: str = "driveway-monitor"
@@ -84,9 +97,12 @@ class ObjectNotification(Notification):
8497
event: str
8598
id: str
8699
jpeg_image: Optional[bytes]
100+
enriched_class: Optional[str] = None
87101

88102
def message(self):
89-
return f"{self.classification} {self.event}.".capitalize()
103+
if self.enriched_class:
104+
return f"Likely: {self.enriched_class}.".capitalize()
105+
return self.title()
90106

91107
def title(self):
92108
return f"{self.classification} {self.event}".capitalize()
@@ -235,6 +251,82 @@ def _suppress(self, logger, n: ObjectNotification) -> bool:
235251
self._last_notification[n.classification] = n.t
236252
return False
237253

254+
def _enrich(self, logger, n: ObjectNotification) -> ObjectNotification:
255+
if not self._config.enrichment.enable:
256+
return n
257+
if not n.jpeg_image:
258+
return n
259+
260+
prompt_file = self._config.enrichment.prompt_files.get(n.classification)
261+
if not prompt_file:
262+
return n
263+
try:
264+
with open(prompt_file, "r") as f:
265+
enrichment_prompt = f.read()
266+
except Exception as e:
267+
logger.error(f"error reading enrichment prompt file '{prompt_file}': {e}")
268+
return n
269+
if not enrichment_prompt:
270+
return n
271+
272+
try:
273+
resp = requests.post(
274+
self._config.enrichment.endpoint,
275+
json={
276+
"model": self._config.enrichment.model,
277+
"stream": False,
278+
"images": [
279+
base64.b64encode(n.jpeg_image).decode("ascii"),
280+
],
281+
"keep_alive": self._config.enrichment.keep_alive,
282+
"format": "json",
283+
"prompt": enrichment_prompt,
284+
},
285+
timeout=self._config.enrichment.timeout_s,
286+
)
287+
parsed = resp.json()
288+
except requests.Timeout:
289+
logger.error("enrichment request timed out")
290+
return n
291+
except requests.RequestException as e:
292+
logger.error(f"enrichment failed: {e}")
293+
return n
294+
295+
model_resp_str = parsed.get("response")
296+
if not model_resp_str:
297+
logger.error("enrichment response is missing")
298+
return n
299+
300+
try:
301+
model_resp_parsed = json.loads(model_resp_str)
302+
except json.JSONDecodeError as e:
303+
logger.info(f"enrichment model did not produce valid JSON: {e}")
304+
logger.info(f"response: {model_resp_str}")
305+
return n
306+
307+
if "type" not in model_resp_parsed and "error" not in model_resp_parsed:
308+
logger.info("enrichment model did not produce expected JSON keys")
309+
return n
310+
311+
model_desc = model_resp_parsed.get("desc", "unknown")
312+
if model_desc == "unknown" or model_desc == "":
313+
model_err = model_resp_parsed.get("error")
314+
if not model_err:
315+
model_err = "(no error returned)"
316+
logger.info(
317+
f"enrichment model could not produce a useful description: {model_err}"
318+
)
319+
return n
320+
321+
return ObjectNotification(
322+
t=n.t,
323+
classification=n.classification,
324+
event=n.event,
325+
id=n.id,
326+
jpeg_image=n.jpeg_image,
327+
enriched_class=model_desc,
328+
)
329+
238330
def _run(self):
239331
logger = logging.getLogger(__name__)
240332
logging.basicConfig(level=self._config.log_level, format=LOG_DEFAULT_FMT)
@@ -260,6 +352,7 @@ def _run(self):
260352
jpeg_image=n.jpeg_image,
261353
expires_at=n.t + datetime.timedelta(days=1),
262354
)
355+
n = self._enrich(logger, n)
263356

264357
try:
265358
headers = self._prep_ntfy_headers(n)

0 commit comments

Comments
 (0)