Skip to content

yanghp123/llm-device-fingerprints

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

An LLM-based Framework for Fingerprinting Internet-connected Devices

This repository includes code used for training and evaluating transformer-based language models on banners obtained from global Internet scans. The resulting models can be used to generate device embeddings (for downstream learning tasks), as well as to analyze clustered embeddings and generate text-based (regex) fingerprints for detecting software/hardware products.

Installation

Run python setup.py install to install the package and its dependencies.

Scripts

The scripts directory contains the scripts used for preparing datasets, training the language models, and using clustered embeddings to generate regex fingerprints. Note that for preparing datasets, one needs to provide banners collected through Internet-wide scans and exported as JSON files. Models in the paper are trained on snapshots from the Censys Universal Internet BigQuery Dataset.

Using Models

Trained models are available on the HuggingFace Hub, and can be further fine-tuned on downstream applications. Currently, the following models are available:

  • roberta-base-banner: A RoBERTa masked language model trained on banners from all protocols available in the Censys database.
  • roberta-embedding-http: A model fine-tuned on HTTP banners (headers) using a contrastive loss function to generate temporally stable embeddings. See scripts/compute_embeddings.py on how to aggregate token embeddings from the last layer of the model to compute banner embeddings.

Reference

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages