Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 fix: fix invalid utf8 character #5732

Merged
merged 6 commits into from
Feb 4, 2025
Merged

🐛 fix: fix invalid utf8 character #5732

merged 6 commits into from
Feb 4, 2025

Conversation

arvinxx
Copy link
Contributor

@arvinxx arvinxx commented Feb 4, 2025

💻 变更类型 | Change Type

  • ✨ feat
  • 🐛 fix
  • ♻️ refactor
  • 💄 style
  • 👷 build
  • ⚡️ perf
  • 📝 docs
  • 🔨 chore

🔀 变更说明 | Description of Change

📝 补充信息 | Additional Information

invalid byte sequence for encoding "UTF8": 0x00

这个错误是因为数据中包含了无效的 UTF-8 字符序列,特别是出现了 NULL 字节(\u0000)。PostgreSQL 在处理 UTF-8 编码的文本数据时,不允许包含 NULL 字节。

content 字段包含了一些二进制数据(以 \u0002\u0000\u0000\u0002 开头),这些数据不是有效的 UTF-8 文本。

解决这个问题有几种方案:

  1. 清理数据,移除非法字符:
function cleanInvalidUTF8(str: string) {
  // 移除 NULL 字节和其他非法 UTF-8 字符
  return str.replace(/\u0000/g, '')
           .replace(/[\u0001-\u0008\u000B-\u000C\u000E-\u001F]/g, '');
}

// 使用时:
const cleanedContent = cleanInvalidUTF8(data.content);

建议:

  1. 检查数据源,确认这些二进制数据的来源和用途
  2. 如果这些是错误数据,在导入前进行清理
  3. 考虑在应用层增加数据验证,确保只有有效的 UTF-8 文本进入数据库

Copy link

vercel bot commented Feb 4, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
lobe-chat-database ✅ Ready (Inspect) Visit Preview Feb 4, 2025 1:59pm
lobe-chat-preview ✅ Ready (Inspect) Visit Preview 💬 Add feedback Feb 4, 2025 1:59pm

@dosubot dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Feb 4, 2025
@lobehubbot
Copy link
Member

👍 @arvinxx

Thank you for raising your pull request and contributing to our Community
Please make sure you have followed our contributing guidelines. We will review it as soon as possible.
If you encounter any problems, please feel free to connect with us.
非常感谢您提出拉取请求并为我们的社区做出贡献,请确保您已经遵循了我们的贡献指南,我们会尽快审查它。
如果您遇到任何问题,请随时与我们联系。

@dosubot dosubot bot added the 📝 Documentation Improvements or additions to documentation | 文档问题 label Feb 4, 2025
Copy link
Contributor

gru-agent bot commented Feb 4, 2025

TestGru Assignment

Summary

Link CommitId Status Reason
Detail 736220d ✅ Finished Cancelled by auto rebase

Files

File Pull Request
src/libs/langchain/loaders/pdf/index.ts 🛑 Cancelled (Job is canceled by user)
src/utils/sanitizeUTF8.ts 🛑 Cancelled (Job is canceled by user)

Tip

You can @gru-agent and leave your feedback. TestGru will make adjustments based on your input

Copy link
Contributor

gru-agent bot commented Feb 4, 2025

TestGru Assignment

Summary

Link CommitId Status Reason
Detail 9bcafed ✅ Finished Cancelled by auto rebase

Files

File Pull Request
src/database/repositories/dataImporter/index.ts 🛑 Cancelled (Job is canceled by user)
src/libs/langchain/loaders/pdf/index.ts 🛑 Cancelled (Job is canceled by user)
src/server/routers/async/file.ts 🛑 Cancelled (Job is canceled by user)
src/utils/sanitizeUTF8.ts 🛑 Cancelled (Job is canceled by user)

Tip

You can @gru-agent and leave your feedback. TestGru will make adjustments based on your input

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:XXL This PR changes 1000+ lines, ignoring generated files. labels Feb 4, 2025
Copy link
Contributor

gru-agent bot commented Feb 4, 2025

TestGru Assignment

Summary

Link CommitId Status Reason
Detail 133c83f ✅ Finished Cancelled by auto rebase

Files

File Pull Request
src/database/repositories/dataImporter/index.ts 🛑 Cancelled (Job is canceled by user)
src/libs/langchain/loaders/pdf/index.ts 🛑 Cancelled (Job is canceled by user)
src/server/routers/async/file.ts 🛑 Cancelled (Job is canceled by user)
src/utils/sanitizeUTF8.ts 🛑 Cancelled (Job is canceled by user)

Tip

You can @gru-agent and leave your feedback. TestGru will make adjustments based on your input

Copy link

codecov bot commented Feb 4, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.09%. Comparing base (d3d26d7) to head (1ce4b8b).
Report is 7 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff            @@
##             main    #5732     +/-   ##
=========================================
  Coverage   92.08%   92.09%             
=========================================
  Files         647      648      +1     
  Lines       57897    57913     +16     
  Branches     2712     4273   +1561     
=========================================
+ Hits        53317    53333     +16     
  Misses       4580     4580             
Flag Coverage Δ
app 92.09% <100.00%> (+<0.01%) ⬆️
server 98.01% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@arvinxx arvinxx changed the title docs/auth 🐛 fix: fix invalid utf8 character Feb 4, 2025
Copy link
Contributor

gru-agent bot commented Feb 4, 2025

TestGru Assignment

Summary

Link CommitId Status Reason
Detail 1ce4b8b ✅ Finished

Files

File Pull Request
src/database/repositories/dataImporter/index.ts ❌ Failure (I failed to write the unit tests for the file.)
src/libs/langchain/loaders/pdf/index.ts 🟢 Open #5738
src/server/routers/async/file.ts 🔴 Closed #5737
src/utils/sanitizeUTF8.ts 🚫 Skipped (There's no need to update the test code)

Tip

You can @gru-agent and leave your feedback. TestGru will make adjustments based on your input

@arvinxx arvinxx merged commit 2905cb5 into main Feb 4, 2025
9 of 11 checks passed
@arvinxx arvinxx deleted the docs/auth branch February 4, 2025 13:30
@lobehubbot
Copy link
Member

❤️ Great PR @arvinxx ❤️

The growth of project is inseparable from user feedback and contribution, thanks for your contribution! If you are interesting with the lobehub developer community, please join our discord and then dm @arvinxx or @canisminor1990. They will invite you to our private developer channel. We are talking about the lobe-chat development or sharing ai newsletter around the world.
项目的成长离不开用户反馈和贡献,感谢您的贡献! 如果您对 LobeHub 开发者社区感兴趣,请加入我们的 discord,然后私信 @arvinxx@canisminor1990。他们会邀请您加入我们的私密开发者频道。我们将会讨论关于 Lobe Chat 的开发,分享和讨论全球范围内的 AI 消息。

@vercel vercel bot temporarily deployed to Preview – lobe-chat-database February 4, 2025 13:33 Inactive
github-actions bot pushed a commit that referenced this pull request Feb 4, 2025
### [Version&nbsp;1.50.4](v1.50.3...v1.50.4)
<sup>Released on **2025-02-04**</sup>

#### 🐛 Bug Fixes

- **misc**: Fix invalid utf8 character.

<br/>

<details>
<summary><kbd>Improvements and Fixes</kbd></summary>

#### What's fixed

* **misc**: Fix invalid utf8 character, closes [#5732](#5732) ([2905cb5](2905cb5))

</details>

<div align="right">

[![](https://img.shields.io/badge/-BACK_TO_TOP-151515?style=flat-square)](#readme-top)

</div>
@lobehubbot
Copy link
Member

🎉 This PR is included in version 1.50.4 🎉

The release is available on:

Your semantic-release bot 📦🚀

@vercel vercel bot temporarily deployed to Preview – lobe-chat-preview February 4, 2025 13:59 Inactive
github-actions bot pushed a commit to bentwnghk/lobe-chat that referenced this pull request Feb 5, 2025
### [Version&nbsp;1.92.3](v1.92.2...v1.92.3)
<sup>Released on **2025-02-05**</sup>

#### 🐛 Bug Fixes

- **misc**: Fix invalid utf8 character.

#### 💄 Styles

- **misc**: Add/Update Aliyun Cloud Models, update GitHub Models, update model locale.

<br/>

<details>
<summary><kbd>Improvements and Fixes</kbd></summary>

#### What's fixed

* **misc**: Fix invalid utf8 character, closes [lobehub#5732](https://github.com/bentwnghk/lobe-chat/issues/5732) ([2905cb5](2905cb5))

#### Styles

* **misc**: Add/Update Aliyun Cloud Models, closes [lobehub#5613](https://github.com/bentwnghk/lobe-chat/issues/5613) ([95cd822](95cd822))
* **misc**: Update GitHub Models, closes [lobehub#5683](https://github.com/bentwnghk/lobe-chat/issues/5683) ([ed4e048](ed4e048))
* **misc**: Update model locale, closes [lobehub#5731](https://github.com/bentwnghk/lobe-chat/issues/5731) ([d3d26d7](d3d26d7))

</details>

<div align="right">

[![](https://img.shields.io/badge/-BACK_TO_TOP-151515?style=flat-square)](#readme-top)

</div>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📝 Documentation Improvements or additions to documentation | 文档问题 released size:M This PR changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] 分块失败 [Bug] 对某些上传的pdf文件分块失败
2 participants