Skip to content

Commit

Permalink
Force utf8 encoding on Windows
Browse files Browse the repository at this point in the history
  • Loading branch information
jncraton committed Jan 23, 2025
1 parent c84b018 commit fc9474b
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions languagemodels/preprocess.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,10 @@ def get_html_paragraphs(src: str):
3. Convert any newly merged text element with at least `min_length`
characters to a paragraph in the output text.
>>> get_html_paragraphs(open("test/wp.html").read())
>>> get_html_paragraphs(open("test/wp.html", encoding="utf-8").read())
'Bolu Province (Turkish: Bolu ili) is a province...'
>>> get_html_paragraphs(open("test/npr.html").read())
>>> get_html_paragraphs(open("test/npr.html", encoding="utf-8").read())
"First, the good news. Netflix reported a record ..."
"""

Expand Down

0 comments on commit fc9474b

Please sign in to comment.