-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Splitting markdown-formatted outlines in an odd way #14
Comments
I’m assuming the text in question has two newlines separating headers from succeeding content? Like this: # Header
Content. Instead of like this: # Header
Content. If that’s the case, then what is happening under the hood is that [
“# Header 1”, “Content 1.”,
“# Header 2”, “Content 2.”
] And then when [
“# Header 1\n\nContent 1.\n\n# Header 2”,
“Content 2.”
]
I myself have run into this problem with Markdown. There's an easy solution, however. Before passing your text to semchunk, you can preprocess it with this code: import re
# Remove empty lines after Markdown headings.
text = re.sub(r'(^#+[^\n]+\n)\n', r'\1', text, flags = re.MULTILINE) With that code, your original text ends up looking like this: # header1
## header2
Some text
## header2
Some more text
### Step 0: this is pre-planning step
* ⚠️ this is a warning
▶ the top line
▶ the next line
▶ the next line
▶ the next line
1. numbered list
1. numbered list
### Step 1: the first actual step
▶ the top line
▶ the next line
▶ the next line
▶ the next line
1. numbered list
1. numbered list
### Step 2: the second step
etc... Which then produces much nicer chunks: import semchunk
chunker = semchunk.chunkerify('gpt-4', chunk_size = 100)
chunks = chunker(text)
for chunk in chunks:
print(chunk)
print('-'*80) # header1
## header2
Some text
## header2
Some more text
--------------------------------------------------------------------------------
### Step 0: this is pre-planning step
* ⚠️ this is a warning
▶ the top line
▶ the next line
▶ the next line
▶ the next line
1. numbered list
1. numbered list
--------------------------------------------------------------------------------
### Step 1: the first actual step
▶ the top line
▶ the next line
▶ the next line
▶ the next line
1. numbered list
1. numbered list
### Step 2: the second step
etc...
-------------------------------------------------------------------------------- Given that I can see an opportunity to improve Markdown chunking even further by introducing some new specialised rules, I'm going to leave this issue open for now and work on adding an extra |
Thanks! I'll give |
My markdown doc is structured as:
My code:
I would expect the chunker to split by headers, when possible; however, the chunks generally END with a header.
An example chunk:
▶ the top line ▶ the next line ▶ the next line ▶ the next line 1. numbered list 1. numbered list ### Step 2: the second step
...instead of:
### Step 1: the first actual step ▶ the top line ▶ the next line ▶ the next line ▶ the next line 1. numbered list 1. numbered list
Any idea why this is happening?
The text was updated successfully, but these errors were encountered: