Skip to content

Connoiter/king_tutte_datamap_scrolls

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

King Tutte Scrolls

This repo is a how-to manual for building "King Tutte pipelines," which generate datamaps of collections of embedding vectors. These pipelines are assembled from permissively licensed open source components developed primarily by the The Tutte Institute for Mathematics and Computing (TIMC) out of Canada, hence the name "King Tutte."

Funerary mask of Bill Tutte

Introduction

Large collections of neural embedding vectors are proliferating rapidly. For example, at its core every vector database and RAG knowledge base consists of such a collection. Datamaps are visualizations of embedding vector collections implemented via modern web technologies such as HTML, WebGL/WebGPU, JavaScript, CSS, etc. They provide a graphical overview of a collection and enable interactive exploratory data analysis.

These "scrolls" (read: code, documentation, recorded tech talks, and associated commentary) are an agglomeration of permissively licensed repos and web content related to building datamaps, the bulk of which originates out of the Tutte Institute of Canada -- hence the nod to them in this repo's name.

The goal of this King Tutte Scrolls repo is to create a one-stop shop for datamapping enabling any developer (human and/or agentic) looking to produce datamaps of any "AI-ready data," that is, data with embedding vectors. If necessary, the production process can include creating the embedding vectors themselves from input data not originally vectorized.

Note that this repo can be built into an Agent Skill (as defined by Anthropic) for use by software development tools and agents. See connoiter.com for pre-built Agent Skill distros.

King Tutte pipelines

Multiple Tutte Institute tools can be chained together into pipelines that produce datamaps, which herein are called King Tutte pipelines. These Tutte Institute tools include:

  • UMAP
  • HDBSCAN,
  • DataMapPlot
  • Toponymy
  • EVōC
  • TNT
  • Vectorizers
  • Glasbly

Some of the tools are simply combinations of their tools configured for a specific use case. For example, EVōC is essentially just UMAP followed by HDBSCAN (two separate Tutte Institute innovations), with both components tuned for the purpose of working with embedding vectors.

For pedological purposes, ignore such valuable optimizations for now. The core architecture of King Tutte pipelines has a regular structure.

Note that although the Tutte Institute folks are the ones who did all the original work, they did release their work as permissively licensed open source, and so others have since contributed. For example, the original implementation of UMAP is distributed as the package, umap-learn. Subsequently, as part of their RAPIDS initiative NVIDIA banged out an implementation of UMAP that runs on their hardware, even claiming speed-ups as high as 300x over umap-learn (300 times faster, not 300 percent faster!). That code is licensed under the Apache 2.0 license so, of course, the King Tutte Scrolls include implementations which use NVIDIA's version. This is not the only instance where non Tutte Institute code is used herein. There is also torchDR and the list goes on. As long as the code is high quality and permissively licensed (Apache, MIT, BSD, etc.) it is fodder for the King Tutte Scrolls.

Use cases

These King Tutte Scrolls have multiple envisioned use cases.

Self-study

A solo human coder can use the King Tutte Scrolls for self-study. The content (code and docs previously written by others) has been curated and commentary has been included.

This repo will include as a submodule the Jupyter notebooks of a related repo, the King Tutte Datamapping Codex. Most of those notebooks are not original work from Connoiter. The have been modified from the originals as needed to make them work out-of-the-box on Google's Colab.

Experimenting and modifying the code on Colab is which is a good place to start. A pre-rendered build of the King Tutte Datamapping Codex Jupyter Book is hosted on connoiter.com.

Workshops

One of the deliverables of this repo is a Jupyter Book (v2) known as the King Tutte Pipelines Jupyter Book. For conference workshop contexts, it includes Jupyter notebooks guaranteed to run on Colab. Most of that code has been previously written elsewhere but they have been tweaked as needed to ensure they all work out of the box on Google's Colab service. (Similarly, the same goes for the documentation: most was written elsewhere, and as needed simply converted from ReStructure Text (for Sphinx) to MyST (for Jupyter Book 2)).

For workshops, a Colab-based set-up is:

  • free
  • including optional free GPUs, which King Tutte tools know how to use
  • involved no set-up hassles beyond asking for a GPU runtime
  • is scalable for large workshops

So, a workshop can quickly get started, and after the workshop is completed, attendees can continue to hack on their own versions of the notebooks they were working on at the workshop, which should help diffuse these datamap innovations.

Agent Skill

Simon Willison's definition of agent is: "An LLM agent runs tools in a loop to achieve a goal." That is the definition used herein. In October of 2025, Anthropic defined Agent Skills which is a type of tool for agents to use.

An agent can immediately make use of this repo simply by telling it to check out connoiter.com. It should find, via the llms.txt file on connoiter.com, the content of this repo built as an Agent Skill available as a sub-tree of HTTP Collections on connoiter.com.

The same Agent Skill just mentioned is also available packaged a ZIP files. See connoiter.com for pre-built Agent Skill distros. For more on how to build and deploy the same Agent Skill see later in this document, in the Building section.)

Tut juxt Tutte

“Humor can be dissected, as a frog can, but the thing dies in the process” ― E.B. White

The Tutte Institute

Tutte's Cambridge undergraduate portrait

The Tutte Institute for Mathematics and Computing (TIMC) is a Canadian government-funded research institute, named after W. T. Tutte. Bill Tutte was a British code breaker and mathematician. During WW2, he worked alongside the likes of Alan Turing cracking the Nazi's encrypted comms at Bletchley Park. Later he became a Canadian citizen.

Notably, Tutte worked to crack the German High Command communications code known by the Allies as “Fish,” an even tougher nut to crack than the more well-known Enigma machine. Feel free to nerd out on tools he developed such as the Δ and double-delta methods, wherein the hacker digitally simulated mechanical wheel-based crypto communiction boxes such as the Enigma machine. It can be thought of as an en silico picking of physical locks by tuning into (via XOR filters on bit streams) their encrypted digital broadcasts.

The reason why this work was more impressive than cracking Enigma is that during the war the Allies never captured a Lorenz machine (which generated Fish code). In contrast, for Enigma they did acquire a machine including its code books, which proved quite handy while reverse engineering the Enigma machine, otherwise all one'd have to go on is intercepted ciphertext broadcast to work it all out. :(

Perhaps his work which is closest to this datamap technology is to be found in his 1963 paper, How to Draw a Graph. But that is just a tangential curiosity since these King Tutte pipelines use UMAP which is a force-directed method (a.k.a. spring embedder), and Bill used different methods to plot a 2D map of datapoints.

King Tut

In ancient Egypt during the late Eighteenth Dynasty, King Tutankhamun, also known as King Tut (not Tutte), was an Egyptian pharaoh who ruled around 1332 to 1323 BC. King Tut has been immortalized in song, as can be seen on YouTube.

King Tut on SNL, 1978

King Tutte

Note, although the core software and documentation in this repo were developed by the Tutte Institute, the term "King Tutte" and this repo are ideas out of the start-up, Connoiter. There is no relation between the two organizations.

Blame Connoiter for the lighthearted nature of this project. In contrast, the Tutte Institute is a serious Canadian government research institute whose technology was partially developed for use by Five Eyes nations for cyberdefense purposes.

As to the spelling of "datamap" versus the (currently) more common "data map," that too was a decision made by Connoiter. If "dataset" and "datapoint" can be a single words, why not "datamap?"

Repo structure

This repo is an assemblage of multiple repos and variants thereof. It is a corpus for use when building datamapping software. Separate repos are included as git submodules if possible.

Very little of the content herein is original work by Connoiter. The value is mostly in having a single source to go to when datamapping. This value is most clearly demonstrated by the packaging of this information into an Agent Skill (BTW, "Claud Skill" is the name that

  • External as-is repos as submodules
    • KTS makes no changes to the content of the repos
    • Goal is simply to one-stop-shop agglomerate relevant info
  • For Colab-based workshops, some Jupyter notebooks from the (permissively licensed) external as-is repos have been copied and modified, and packaged as a Jupyter Book (whose Jupyter notebooks just so happen to run out-of-the-box on Colab). In an ideal world, there would be only on version, but in Colab Google has done some "embrace and extend" work that requires separate variants of the notebooks. C'est la vie.

Building

Or equivalently use gitinjest; connoiter.com simply hosts the pre-rendered results of running this repo through gitinjest.

One convenient deploy option is run individual Jupyter notebooks on Colab. See the list of Colab-ready notebooks in this repo.

Bonus, it would be trivial to run this repo through gitingest and feed the markdown-based output to a transformer-based LLM programming tools, agent or otherwise.

With this repo in its context window, it should be able to crank out Python code to make datamaps, which are webUI(HTML&JS&c.) widgets that run live inside notebooks (Jupyter and Marimo) or can be exported as stand-along static file HTML web-apps that can be run in browsers later.

For convenience, a pre-built run of this repo through gitingest is available at:

https://connoiter.com/king_tutte_scrolls

Or just tell your coding LLM to check out connoiter.com, and it will find the King Tutte Scrolls via the site's llms.txt file.

Contributing

PRs welcome! We use the GihHub Flow for coordinating. Fork and PR. Follow GitHubs instructions.

Finally, the academic types such as the researchers at the Tutte Institute live and die by citations so giving thanks via a citation is the best bang for buck way of contributing. See specific repos for how to cite. For example:

About

A datamap SDK containing permissively licensed code and docs, packaged as an Agent Skill

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published