Skip to content

Full-history deduplication with encrypted content #18

@djdarcy

Description

@djdarcy

Full-history deduplication with encrypted content

Problem

teeclip currently deduplicates only against the most recent entry:

# history.py:181-186
last = conn.execute(
    "SELECT hash FROM clips ORDER BY id DESC LIMIT 1"
).fetchone()
if last and last["hash"] == content_hash:
    return None

This means if a user copies "hello", then "world", then "hello" again, three entries are stored — the duplicate "hello" is not caught because it's not consecutive. An idx_clips_hash index exists on the hash column but no query uses it.

For unencrypted content, full-history dedup is straightforward — WHERE hash = ? with the SHA-256 digest. But with encryption enabled, the hash column stores HMAC-SHA-256 keyed with the encryption key. This creates two complications:

  1. Key rotation: If the encryption key changes (user runs --decrypt then --encrypt with a different key, or switches auth methods), all existing HMACs become incomparable to new ones. A WHERE hash = ? lookup would find no matches even for identical plaintext.

  2. No way to verify without decryption: Without the key, there's no way to confirm whether two entries with different HMACs contain the same plaintext. With the key, HMAC comparison works — but only within a single key epoch.

The result is that encrypted history accumulates duplicates that plaintext history would not.

Proposed solution

Implement full-history deduplication that works correctly across both encrypted and unencrypted content, with awareness of key epochs.

Behavior for unencrypted clips (simple case):

SELECT id FROM clips WHERE hash = ? AND encrypted = 0 LIMIT 1

If a match exists, skip the insert (or optionally update the timestamp to "bump" the existing entry).

Behavior for encrypted clips (HMAC-keyed):

Within the same key epoch, HMAC comparison works:

SELECT id FROM clips WHERE hash = ? AND encrypted = 1 LIMIT 1

Same plaintext + same key = same HMAC, so this catches duplicates without decryption.

Key rotation scenario:

When encrypt_history() or decrypt_history() runs, all hashes are recomputed with the new key (or reverted to SHA-256). After rotation, existing HMACs are stale but the new hashes are consistent going forward. No special handling needed — dedup naturally works within each epoch, and cross-epoch duplicates are an acceptable edge case (the alternative is decrypting everything to compare, which defeats the purpose).

Deduplication strategy options

Option A: Most-recent only (current behavior)

  • Skip insert if hash matches the single most recent entry
  • Index idx_clips_hash is unused
  • Simple, but allows non-consecutive duplicates to accumulate

Option B: Full-history exact match

existing = conn.execute(
    "SELECT id FROM clips WHERE hash = ? LIMIT 1",
    (content_hash,)
).fetchone()
if existing:
    return None  # or bump timestamp
  • Uses idx_clips_hash for fast lookup
  • Catches all duplicates within a key epoch
  • Decision needed: skip silently, or "bump" the existing entry's timestamp?

Option C: Full-history with bump

existing = conn.execute(
    "SELECT id FROM clips WHERE hash = ? LIMIT 1",
    (content_hash,)
).fetchone()
if existing:
    conn.execute(
        "UPDATE clips SET timestamp = ? WHERE id = ?",
        (timestamp, existing["id"])
    )
    conn.commit()
    return None
  • Same as B but updates the timestamp so the entry floats to the top of --list
  • Better UX: re-copying something makes it "recent" again without creating a duplicate
  • Slightly more complex

Option D: Configurable dedup scope

[history]
dedup = "last"      # "last" (current), "all", or "none"
  • Lets users choose their preference
  • "none" is useful for clipboard audit trails where every copy matters

Design considerations

  • HMAC consistency within key epoch: As long as the encryption key doesn't change between saves, HMAC-SHA-256 is deterministic — same plaintext produces the same hash. Full-history dedup works without decryption.
  • Cross-epoch duplicates: After key rotation, old HMACs are incomparable to new ones. Accepting this is better than requiring full decryption for dedup.
  • Bump vs skip: "Bump" (Option C) is more useful — if you re-copy a URL you copied yesterday, you probably want it at the top of --list without a duplicate entry. But some users may want strict chronological history.
  • Index justification: idx_clips_hash currently exists but is unused. Full-history dedup would give it a purpose and make inserts with dedup O(log n) instead of O(n).
  • Performance: Hash index lookup is fast. Even with 10,000 entries, this adds negligible overhead to save().

Acceptance criteria

  • Full-history dedup catches non-consecutive duplicates
  • Dedup works without decryption for encrypted content (HMAC comparison)
  • idx_clips_hash is used by the dedup query
  • Key rotation does not break dedup (new key = new HMACs, dedup works within new epoch)
  • Decision made on skip vs bump behavior
  • Existing consecutive-dedup tests still pass
  • New tests for non-consecutive dedup (both encrypted and unencrypted)
  • --list ordering is correct after bump (if bump is chosen)

Related issues

Analysis

See notes/cli/2026-02-17__20-47-29__both_encrypted-metadata-architecture.md for the column-by-column encryption audit that surfaced this gap.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestsecuritySecurity, encryption, and data protection

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions