Skip to content

ptobarra/license-classificator

Repository files navigation

License Classificator — Top-Tier Take-Home Assignment Submission

Objective: Build a production-ready system that classifies software licenses into business typologies using LLM/NLP, with strong engineering rigor, scalability considerations, and reproducibility.


Executive Summary

This project delivers a production-grade LLM-driven classification service that ingests software licenses from Excel, assigns business typologies, generates short natural-language rationales, persists results, and exposes a clean FastAPI interface for automation and review.

Key strengths:

  • Clean layered architecture (API, domain, persistence, inference)
  • LLM abstraction supporting provider portability
  • Human-in-the-loop overrides (enterprise-ready)
  • Deterministic output constraints for reliability
  • Test coverage + Dockerized deployment
  • Scalability and MLOps strategy clearly defined

This solution reflects real-world ML engineering standards, not just a proof-of-concept.


Problem Statement

Given a list of software licenses, classify each into one of six typologies:

  • Productivity
  • Design
  • Communication
  • Development
  • Finance
  • Marketing

Each classification must include a ≤150-character explanation and support manual correction without LLM override.


Technical Approach

Pipeline Overview

Excel Input → Validation → SQLite Persistence → LLM Classification → Audit Layer → Excel Export → API Access

License Classificator - System and Data Flow

Processing Steps

  1. Ingest structured license data from licenses.xlsx
  2. Normalize & persist records in SQLite (SQLModel)
  3. Classify licenses using a local LLM via Ollama
  4. Validate outputs (label constraints + explanation length cap)
  5. Preserve manual overrides (human-in-the-loop)
  6. Export results to output.xlsx
  7. Serve results via FastAPI endpoints

Architecture & Design Decisions

1. Layered & Maintainable Architecture

Layer Responsibility
API HTTP contract & request validation
Services Business logic & workflows
LLM Client Model provider abstraction
Persistence SQLModel + SQLite
I/O Excel ingestion/export

Why: Enables testability, modularity, and clean separation of concerns.


2. LLM-First Semantic Classification

LLMs outperform rule-based NLP when:

  • License names are ambiguous (e.g., Dynamics 365 Sales)
  • Vendors bundle multiple functions
  • Product naming is inconsistent

Benefits:

  • Generalization to unseen licenses
  • Natural-language explanations
  • Lower maintenance burden than heuristics

3. Strict Output Governance (Reliability Layer)

To ensure deterministic and enterprise-safe output:

  • Typology restricted to a fixed allowed label set
  • Explanations capped at 150 characters
  • Fallback classification logic if LLM deviates

Prevents uncontrolled or hallucinated responses.


4. Human-in-the-Loop Overrides

Manual edits always override model output.

Why this matters:

  • Aligns with enterprise review workflows
  • Prevents overwriting curated business decisions
  • Enables active learning & feedback loops

5. Local LLM (Ollama) Instead of Cloud API

Reason Benefit
No API keys Easy reproducibility
Offline inference Privacy-safe
Cost-free Ideal for demos/tests
Provider abstraction Easy cloud migration

6. Persistence via SQLite + SQLModel

  • Lightweight relational audit trail
  • Tracks typology, explanation, and decision source
  • Supports reproducibility and debugging

API Contract

Method Endpoint Purpose
POST /classify Run ingestion + LLM classification
GET /licenses Retrieve stored results
PATCH /licenses/{id} Manual override classification

FastAPI was selected for performance, type safety, OpenAPI auto-docs, and async LLM support.


Testing Strategy

Implemented

  • API availability tests
  • Endpoint response validation
  • Error-handling coverage

Planned Extensions

  • Mocked LLM inference tests
  • Golden-set accuracy benchmarks
  • Regression tests for classification drift

Ensures robustness as models evolve.


Evaluation Criteria Alignment

Technical Explanation — High

Clear modular design, documented inference logic, and explicit trade-off reasoning.

Code Quality — High

  • Clean architecture
  • Separation of concerns
  • Reusable components

Classification Result — Strong

  • LLM semantic reasoning
  • Deterministic label enforcement
  • Manual correction support

Test Implementation — Expandable

Unit test baseline provided with roadmap for expansion.


Architecture Diagram — System & Data Flow

High-Level System Architecture

┌──────────────────┐
│   licenses.xlsx  │
└─────────┬────────┘
          │  (Ingest)
          ▼
┌──────────────────────────┐
│  Excel I/O Service       │
│  (Pandas + Validation)   │
└─────────┬────────────────┘
          │  (Persist)
          ▼
┌──────────────────────────┐
│ SQLite (SQLModel ORM)    │
│ - Raw licenses           │
│ - Typology               │
│ - Explanation            │
│ - Decision source        │
└─────────┬────────────────┘
          │  (Query)
          ▼
┌──────────────────────────┐
│ Classification Service   │
│ - Business rules         │
│ - Manual override guard  │
└─────────┬────────────────┘
          │  (Inference)
          ▼
┌──────────────────────────┐
│ Ollama LLM Client        │
│ - Prompt templating      │
│ - Output validation      │
└─────────┬────────────────┘
          │
          ▼
┌──────────────────────────┐
│ FastAPI Layer            │
│ - /classify              │
│ - /licenses              │
│ - /licenses/{id} PATCH   │
└─────────┬────────────────┘
          │
          ▼
┌──────────────────────────┐
│ output/output.xlsx       │
│ + API JSON Responses     │
└──────────────────────────┘

End-to-End Data Flow (Request Lifecycle)

Client
│
│ POST /classify
▼
FastAPI Router
│
▼
Read Excel → Normalize → Upsert DB
│
▼
For each License:
├─ If manual → skip LLM
└─ Else → Send to Ollama
            │
            ▼
LLM returns JSON
            │
            ▼
Validate + Persist Result
            │
            ▼
Export output.xlsx
│
▼
Return API Response

Mermaid Diagram (System + Flow) (GitHub-Renderable)

flowchart LR

A[licenses.xlsx] --> B[Excel Loader]
B --> C[(SQLite DB)]
C --> D[Classification Service]
D --> E[Ollama LLM]
E --> D
D --> C
C --> F[FastAPI Endpoints]
F --> G[output.xlsx]
F --> H[JSON API Clients]
Loading

Scaling Strategy (Strategic Bonus)

How Would This Scale to 10,000 Licenses / Day?

Compute & Throughput

  • Batch inference instead of single calls
  • Async worker pools (Celery / RQ)
  • Horizontal FastAPI autoscaling

System Design

  • Queue ingestion (Kafka / RabbitMQ)
  • Result caching for duplicate licenses
  • Dedicated LLM inference workers

Goal: Sustain high throughput with controlled cost and latency.


Using Embeddings Instead of Direct Prompting

Upgraded Pipeline

  • Generate embeddings for each license name
  • Store vectors in Qdrant / FAISS / Pinecone
  • Classify via nearest-neighbor semantic similarity
  • Use LLM only for low-confidence cases

Benefits

  • Faster inference
  • Lower cost
  • Deterministic outputs
  • Easier quantitative evaluation

Service & Model Versioning Strategy

API Versioning

  • /v1/classify
  • /v2/classify

Model & Prompt Versioning

  • Track prompt templates in Git
  • Log LLM model hashes
  • Store inference metadata

Deployment Versioning

  • Docker image tags
  • GitHub Actions CI/CD
  • Rollback-safe releases

About

LLM-powered software license classification API

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors