Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splitting markdown-formatted outlines in an odd way #14

Open
nick-youngblut opened this issue Jan 1, 2025 · 2 comments
Open

Splitting markdown-formatted outlines in an odd way #14

nick-youngblut opened this issue Jan 1, 2025 · 2 comments

Comments

@nick-youngblut
Copy link

My markdown doc is structured as:

# header1

## header2

Some text

## header2 

Some more text


### Step 0: this is pre-planning step

* ⚠️ this is a warning
▶ the top line
    ▶ the next line
    ▶ the next line
    ▶ the next line
        1. numbered list
        1. numbered list

### Step 1: the first actual step

▶ the top line
    ▶ the next line
    ▶ the next line
    ▶ the next line
        1. numbered list
        1. numbered list

### Step 2: the second step

etc...

My code:

import semchunk
chunker = semchunk.chunkerify('gpt-4', chunk_size = 2000)
chunker(text)

I would expect the chunker to split by headers, when possible; however, the chunks generally END with a header.

An example chunk:

▶ the top line
    ▶ the next line
    ▶ the next line
    ▶ the next line
        1. numbered list
        1. numbered list

### Step 2: the second step

...instead of:

### Step 1: the first actual step
▶ the top line
    ▶ the next line
    ▶ the next line
    ▶ the next line
        1. numbered list
        1. numbered list

Any idea why this is happening?

@nick-youngblut nick-youngblut changed the title Splitting markdown in an odd way Splitting markdown-formatted outlines in an odd way Jan 1, 2025
@umarbutler
Copy link
Collaborator

umarbutler commented Jan 2, 2025

I’m assuming the text in question has two newlines separating headers from succeeding content? Like this:

# Header

Content.

Instead of like this:

# Header
Content.

If that’s the case, then what is happening under the hood is that semchunk is splitting your text at the occurrence of two newlines into:

[
    “# Header 1”, “Content 1.”,# Header 2”, “Content 2.”
]

And then when semchunk goes to rejoin the splits to form new chunks meeting your desired chunk size, you might end up with:

[
    “# Header 1\n\nContent 1.\n\n# Header 2”,Content 2.”
]

semchunk heuristically leverages the fact that normal English text tends to use newlines and other delimiters like punctuation to indicate varying degrees of semantic separation, but when it comes to Markdown, specialised syntax might take the place of those patterns.

I myself have run into this problem with Markdown. There's an easy solution, however.

Before passing your text to semchunk, you can preprocess it with this code:

import re

# Remove empty lines after Markdown headings.
text = re.sub(r'(^#+[^\n]+\n)\n', r'\1', text, flags = re.MULTILINE)

With that code, your original text ends up looking like this:

# header1
## header2
Some text

## header2 
Some more text


### Step 0: this is pre-planning step
* ⚠️ this is a warning
▶ the top line
    ▶ the next line
    ▶ the next line
    ▶ the next line
        1. numbered list
        1. numbered list

### Step 1: the first actual step
▶ the top line
    ▶ the next line
    ▶ the next line
    ▶ the next line
        1. numbered list
        1. numbered list

### Step 2: the second step
etc...

Which then produces much nicer chunks:

import semchunk

chunker = semchunk.chunkerify('gpt-4', chunk_size = 100)
chunks = chunker(text)

for chunk in chunks:
    print(chunk)
    print('-'*80)
# header1
## header2
Some text

## header2 
Some more text
--------------------------------------------------------------------------------
### Step 0: this is pre-planning step
* ⚠️ this is a warning
▶ the top line
    ▶ the next line
    ▶ the next line
    ▶ the next line
        1. numbered list
        1. numbered list
--------------------------------------------------------------------------------
### Step 1: the first actual step
▶ the top line
    ▶ the next line
    ▶ the next line
    ▶ the next line
        1. numbered list
        1. numbered list

### Step 2: the second step
etc...
--------------------------------------------------------------------------------

Given that I can see an opportunity to improve Markdown chunking even further by introducing some new specialised rules, I'm going to leave this issue open for now and work on adding an extra markdown argument that can be used to invoke those rules 😊

@nick-youngblut
Copy link
Author

Thanks! I'll give text = re.sub(r'(^#+[^\n]+\n)\n+', r'\1', text, flags = re.MULTILINE) a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants