Splitting markdown-formatted outlines in an odd way #14

nick-youngblut · 2025-01-01T22:30:29Z

My markdown doc is structured as:

# header1

## header2

Some text

## header2 

Some more text


### Step 0: this is pre-planning step

* ⚠️ this is a warning
▶ the top line
    ▶ the next line
    ▶ the next line
    ▶ the next line
        1. numbered list
        1. numbered list

### Step 1: the first actual step

▶ the top line
    ▶ the next line
    ▶ the next line
    ▶ the next line
        1. numbered list
        1. numbered list

### Step 2: the second step

etc...

My code:

import semchunk
chunker = semchunk.chunkerify('gpt-4', chunk_size = 2000)
chunker(text)

I would expect the chunker to split by headers, when possible; however, the chunks generally END with a header.

An example chunk:

▶ the top line
    ▶ the next line
    ▶ the next line
    ▶ the next line
        1. numbered list
        1. numbered list

### Step 2: the second step

...instead of:

### Step 1: the first actual step
▶ the top line
    ▶ the next line
    ▶ the next line
    ▶ the next line
        1. numbered list
        1. numbered list

Any idea why this is happening?

umarbutler · 2025-01-02T01:00:26Z

I’m assuming the text in question has two newlines separating headers from succeeding content? Like this:

# Header

Content.

Instead of like this:

# Header
Content.

If that’s the case, then what is happening under the hood is that semchunk is splitting your text at the occurrence of two newlines into:

[
    “# Header 1”, “Content 1.”,
    “# Header 2”, “Content 2.”
]

And then when semchunk goes to rejoin the splits to form new chunks meeting your desired chunk size, you might end up with:

[
    “# Header 1\n\nContent 1.\n\n# Header 2”,
    “Content 2.”
]

semchunk heuristically leverages the fact that normal English text tends to use newlines and other delimiters like punctuation to indicate varying degrees of semantic separation, but when it comes to Markdown, specialised syntax might take the place of those patterns.

I myself have run into this problem with Markdown. There's an easy solution, however.

Before passing your text to semchunk, you can preprocess it with this code:

import re

# Remove empty lines after Markdown headings.
text = re.sub(r'(^#+[^\n]+\n)\n', r'\1', text, flags = re.MULTILINE)

With that code, your original text ends up looking like this:

# header1
## header2
Some text

## header2 
Some more text


### Step 0: this is pre-planning step
* ⚠️ this is a warning
▶ the top line
    ▶ the next line
    ▶ the next line
    ▶ the next line
        1. numbered list
        1. numbered list

### Step 1: the first actual step
▶ the top line
    ▶ the next line
    ▶ the next line
    ▶ the next line
        1. numbered list
        1. numbered list

### Step 2: the second step
etc...

Which then produces much nicer chunks:

import semchunk

chunker = semchunk.chunkerify('gpt-4', chunk_size = 100)
chunks = chunker(text)

for chunk in chunks:
    print(chunk)
    print('-'*80)

# header1
## header2
Some text

## header2 
Some more text
--------------------------------------------------------------------------------
### Step 0: this is pre-planning step
* ⚠️ this is a warning
▶ the top line
    ▶ the next line
    ▶ the next line
    ▶ the next line
        1. numbered list
        1. numbered list
--------------------------------------------------------------------------------
### Step 1: the first actual step
▶ the top line
    ▶ the next line
    ▶ the next line
    ▶ the next line
        1. numbered list
        1. numbered list

### Step 2: the second step
etc...
--------------------------------------------------------------------------------

Given that I can see an opportunity to improve Markdown chunking even further by introducing some new specialised rules, I'm going to leave this issue open for now and work on adding an extra markdown argument that can be used to invoke those rules 😊

nick-youngblut · 2025-01-02T16:05:09Z

Thanks! I'll give text = re.sub(r'(^#+[^\n]+\n)\n+', r'\1', text, flags = re.MULTILINE) a try.

nick-youngblut changed the title ~~Splitting markdown in an odd way~~ Splitting markdown-formatted outlines in an odd way Jan 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Splitting markdown-formatted outlines in an odd way #14

Splitting markdown-formatted outlines in an odd way #14

nick-youngblut commented Jan 1, 2025

umarbutler commented Jan 2, 2025 •

edited

Loading

nick-youngblut commented Jan 2, 2025

Splitting markdown-formatted outlines in an odd way #14

Splitting markdown-formatted outlines in an odd way #14

Comments

nick-youngblut commented Jan 1, 2025

umarbutler commented Jan 2, 2025 • edited Loading

nick-youngblut commented Jan 2, 2025

umarbutler commented Jan 2, 2025 •

edited

Loading