License Classificator — Top-Tier Take-Home Assignment Submission

Objective: Build a production-ready system that classifies software licenses into business typologies using LLM/NLP, with strong engineering rigor, scalability considerations, and reproducibility.

Executive Summary

This project delivers a production-grade LLM-driven classification service that ingests software licenses from Excel, assigns business typologies, generates short natural-language rationales, persists results, and exposes a clean FastAPI interface for automation and review.

Key strengths:

Clean layered architecture (API, domain, persistence, inference)
LLM abstraction supporting provider portability
Human-in-the-loop overrides (enterprise-ready)
Deterministic output constraints for reliability
Test coverage + Dockerized deployment
Scalability and MLOps strategy clearly defined

This solution reflects real-world ML engineering standards, not just a proof-of-concept.

Problem Statement

Given a list of software licenses, classify each into one of six typologies:

Productivity
Design
Communication
Development
Finance
Marketing

Each classification must include a ≤150-character explanation and support manual correction without LLM override.

Technical Approach

Pipeline Overview

Excel Input → Validation → SQLite Persistence → LLM Classification → Audit Layer → Excel Export → API Access

Processing Steps

Ingest structured license data from licenses.xlsx
Normalize & persist records in SQLite (SQLModel)
Classify licenses using a local LLM via Ollama
Validate outputs (label constraints + explanation length cap)
Preserve manual overrides (human-in-the-loop)
Export results to output.xlsx
Serve results via FastAPI endpoints

Architecture & Design Decisions

1. Layered & Maintainable Architecture

Layer	Responsibility
API	HTTP contract & request validation
Services	Business logic & workflows
LLM Client	Model provider abstraction
Persistence	SQLModel + SQLite
I/O	Excel ingestion/export

Why: Enables testability, modularity, and clean separation of concerns.

2. LLM-First Semantic Classification

LLMs outperform rule-based NLP when:

License names are ambiguous (e.g., Dynamics 365 Sales)
Vendors bundle multiple functions
Product naming is inconsistent

Benefits:

Generalization to unseen licenses
Natural-language explanations
Lower maintenance burden than heuristics

3. Strict Output Governance (Reliability Layer)

To ensure deterministic and enterprise-safe output:

Typology restricted to a fixed allowed label set
Explanations capped at 150 characters
Fallback classification logic if LLM deviates

Prevents uncontrolled or hallucinated responses.

4. Human-in-the-Loop Overrides

Manual edits always override model output.

Why this matters:

Aligns with enterprise review workflows
Prevents overwriting curated business decisions
Enables active learning & feedback loops

5. Local LLM (Ollama) Instead of Cloud API

Reason	Benefit
No API keys	Easy reproducibility
Offline inference	Privacy-safe
Cost-free	Ideal for demos/tests
Provider abstraction	Easy cloud migration

6. Persistence via SQLite + SQLModel

Lightweight relational audit trail
Tracks typology, explanation, and decision source
Supports reproducibility and debugging

API Contract

Method	Endpoint	Purpose
POST	`/classify`	Run ingestion + LLM classification
GET	`/licenses`	Retrieve stored results
PATCH	`/licenses/{id}`	Manual override classification

FastAPI was selected for performance, type safety, OpenAPI auto-docs, and async LLM support.

Testing Strategy

Implemented

API availability tests
Endpoint response validation
Error-handling coverage

Planned Extensions

Mocked LLM inference tests
Golden-set accuracy benchmarks
Regression tests for classification drift

Ensures robustness as models evolve.

Evaluation Criteria Alignment

Technical Explanation — High

Clear modular design, documented inference logic, and explicit trade-off reasoning.

Code Quality — High

Clean architecture
Separation of concerns
Reusable components

Classification Result — Strong

LLM semantic reasoning
Deterministic label enforcement
Manual correction support

Test Implementation — Expandable

Unit test baseline provided with roadmap for expansion.

Architecture Diagram — System & Data Flow

High-Level System Architecture

┌──────────────────┐
│   licenses.xlsx  │
└─────────┬────────┘
          │  (Ingest)
          ▼
┌──────────────────────────┐
│  Excel I/O Service       │
│  (Pandas + Validation)   │
└─────────┬────────────────┘
          │  (Persist)
          ▼
┌──────────────────────────┐
│ SQLite (SQLModel ORM)    │
│ - Raw licenses           │
│ - Typology               │
│ - Explanation            │
│ - Decision source        │
└─────────┬────────────────┘
          │  (Query)
          ▼
┌──────────────────────────┐
│ Classification Service   │
│ - Business rules         │
│ - Manual override guard  │
└─────────┬────────────────┘
          │  (Inference)
          ▼
┌──────────────────────────┐
│ Ollama LLM Client        │
│ - Prompt templating      │
│ - Output validation      │
└─────────┬────────────────┘
          │
          ▼
┌──────────────────────────┐
│ FastAPI Layer            │
│ - /classify              │
│ - /licenses              │
│ - /licenses/{id} PATCH   │
└─────────┬────────────────┘
          │
          ▼
┌──────────────────────────┐
│ output/output.xlsx       │
│ + API JSON Responses     │
└──────────────────────────┘

End-to-End Data Flow (Request Lifecycle)

Client
│
│ POST /classify
▼
FastAPI Router
│
▼
Read Excel → Normalize → Upsert DB
│
▼
For each License:
├─ If manual → skip LLM
└─ Else → Send to Ollama
            │
            ▼
LLM returns JSON
            │
            ▼
Validate + Persist Result
            │
            ▼
Export output.xlsx
│
▼
Return API Response

Mermaid Diagram (System + Flow) (GitHub-Renderable)

flowchart LR

A[licenses.xlsx] --> B[Excel Loader]
B --> C[(SQLite DB)]
C --> D[Classification Service]
D --> E[Ollama LLM]
E --> D
D --> C
C --> F[FastAPI Endpoints]
F --> G[output.xlsx]
F --> H[JSON API Clients]

Scaling Strategy (Strategic Bonus)

How Would This Scale to 10,000 Licenses / Day?

Compute & Throughput

Batch inference instead of single calls
Async worker pools (Celery / RQ)
Horizontal FastAPI autoscaling

System Design

Queue ingestion (Kafka / RabbitMQ)
Result caching for duplicate licenses
Dedicated LLM inference workers

Goal: Sustain high throughput with controlled cost and latency.

Using Embeddings Instead of Direct Prompting

Upgraded Pipeline

Generate embeddings for each license name
Store vectors in Qdrant / FAISS / Pinecone
Classify via nearest-neighbor semantic similarity
Use LLM only for low-confidence cases

Benefits

Faster inference
Lower cost
Deterministic outputs
Easier quantitative evaluation

Service & Model Versioning Strategy

API Versioning

/v1/classify
/v2/classify

Model & Prompt Versioning

Track prompt templates in Git
Log LLM model hashes
Store inference metadata

Deployment Versioning

Docker image tags
GitHub Actions CI/CD
Rollback-safe releases

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
app		app
output		output
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
20260125 license-classificator exercise stem.docx		20260125 license-classificator exercise stem.docx
Dockerfile		Dockerfile
License Classificator - System and Data Flow.png		License Classificator - System and Data Flow.png
README.md		README.md
licenses.xlsx		licenses.xlsx
pytest.ini		pytest.ini
requirements.txt		requirements.txt
submission_notes.md		submission_notes.md

Folders and files

Latest commit

History

Repository files navigation

License Classificator — Top-Tier Take-Home Assignment Submission

Executive Summary

Problem Statement

Technical Approach

Pipeline Overview

Processing Steps

Architecture & Design Decisions

1. Layered & Maintainable Architecture

2. LLM-First Semantic Classification

3. Strict Output Governance (Reliability Layer)

4. Human-in-the-Loop Overrides

5. Local LLM (Ollama) Instead of Cloud API

6. Persistence via SQLite + SQLModel

API Contract

Testing Strategy

Implemented

Planned Extensions

Evaluation Criteria Alignment

Technical Explanation — High

Code Quality — High

Classification Result — Strong

Test Implementation — Expandable

Architecture Diagram — System & Data Flow

High-Level System Architecture

End-to-End Data Flow (Request Lifecycle)

Mermaid Diagram (System + Flow) (GitHub-Renderable)

Scaling Strategy (Strategic Bonus)

How Would This Scale to 10,000 Licenses / Day?

Compute & Throughput

System Design

Using Embeddings Instead of Direct Prompting

Upgraded Pipeline

Benefits

Service & Model Versioning Strategy

API Versioning

Model & Prompt Versioning

Deployment Versioning

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages