Skip to content

eyeseast/llm-documentcloud

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llm-documentcloud

PyPI Changelog Tests License

LLM integrations for DocumentCloud

Installation

Install this plugin in the same environment as LLM.

llm install llm-documentcloud

Add your DocumentCloud credentials to your environment variables (likely in your shell profile file):

export DC_USERNAME=""
export DC_PASSWORD=""

Usage

Use the dc: fragment to load documents hosted on DocumentCloud.

# run a basic prompt
llm -f dc:71072 'Summarize this document'

# extract tabular data
llm -f dc:25507045 'Extract the tables in this document as CSV'

Documents can be fetched based on ID alone, ID and slug or full URL. The following are equivalent:

llm -f dc:25507045 'Extract the tables in this document as CSV'
llm -f dc:25507045-20250118-ufc-intuit-dome-athlete-pay-and-weights-c-amico 'Extract the tables in this document as CSV'
llm -f dc:https://www.documentcloud.org/documents/25507045-20250118-ufc-intuit-dome-athlete-pay-and-weights-c-amico/ 'Extract the tables in this document as CSV'

In each case, a DocumentCloud API client will fetch the document's full text and store it as a fragment for llm.

Using file attachments instead of text

DocumentCloud stores each document in several ways: a PDF file, its extracted text and each page as an image. You can feed each of these into llm using mode parameters:

# use the original PDF as an attachment
llm -f 'dc:https://www.documentcloud.org/documents/25507045-20250118-ufc-intuit-dome-athlete-pay-and-weights-c-amico/?mode=pdf'

# use each page image as an attachment
llm -f 'dc:https://www.documentcloud.org/documents/25507045-20250118-ufc-intuit-dome-athlete-pay-and-weights-c-amico/?mode=images'

# this is the same, since "grid" is the mode name used on the documentcloud frontend
llm -f 'dc:https://www.documentcloud.org/documents/25507045-20250118-ufc-intuit-dome-athlete-pay-and-weights-c-amico/?mode=grid'

# these are all equivalent and will extract full text
llm -f dc:https://www.documentcloud.org/documents/25507045-20250118-ufc-intuit-dome-athlete-pay-and-weights-c-amico/
llm -f 'dc:https://www.documentcloud.org/documents/25507045-20250118-ufc-intuit-dome-athlete-pay-and-weights-c-amico/?mode=document'
llm -f 'dc:https://www.documentcloud.org/documents/25507045-20250118-ufc-intuit-dome-athlete-pay-and-weights-c-amico/?mode=text'

Getting specific pages

Sometimes you only want one page. DocumentCloud can link to specific pages, and those URLs can be used here:

# extract text, but only for page 2
llm -f 'dc:https://www.documentcloud.org/documents/25507045-20250118-ufc-intuit-dome-athlete-pay-and-weights-c-amico/?mode=document#document/p2'

Note that pages are 1-indexed. You can also get images:

# attach the image for page 2
llm -f 'dc:https://www.documentcloud.org/documents/25507045-20250118-ufc-intuit-dome-athlete-pay-and-weights-c-amico/?mode=images#document/p2'

There isn't a way to get a single page out of a PDF, so passing mode=pdf will set page to None.

Development

To set up this plugin locally, first checkout the code. Then create a new virtual environment using uv:

cd llm-documentcloud
uv sync

To install the dependencies and test dependencies, include the test extras:

uv sync --extra test

To run the tests:

uv run pytest

About

LLM integrations for DocumentCloud

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages