Skip to content

Commit

Permalink
Merge branch 'master' into dependabot/pip/ipython-8.10.0
Browse files Browse the repository at this point in the history
  • Loading branch information
dogweather authored Dec 25, 2024
2 parents fd6511d + b55c9ac commit 9c4338f
Show file tree
Hide file tree
Showing 3 changed files with 51 additions and 26 deletions.
36 changes: 28 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,19 @@



# The new Dale-Chall readability formula
# Calculate the grade level of a text passage

I wrote this by ordering a copy of _Readability Revisited: The new Dale-Chall readability formula_. I used the book to code the library from scratch.
**Easily and accurately calculate a text's readability.**


**Installation:**

## Installation:

```bash
$ pip install new-dale-chall-readability
```

**Let's try it out:**
## Let's try it out:

```bash
$ ipython
Expand All @@ -30,23 +31,42 @@ In [2]: text = (

In [3]: reading_level(text)
Out[3]: '7-8'
```

_So it's grade 7–8 reading level._

```python
In [4]: cloze_score(text)
Out[4]: 36.91
```

## What's a "cloze score" and "reading level"?
_And yep, the 36.91 cloze score says it's moderately difficult._

So how is this useful? Well, here's one way:

<img width="458" alt="Screenshot 2023-06-14 at 4 31 18 PM" src="https://github.com/public-law/new-dale-chall-readability/assets/150670/8fb35c94-881a-4654-b48c-9d73ed914737">

My legal dictionary orders entries like [amicus curiae](https://www.public.law/dictionary/entries/amicus-curiae) from simplest to most complex. I think it helps with comprehension and learning. I coded the numeric cloze score as the sort key.

**Cloze** is a deletion test invented by Taylor (1953). The **36.91** score, above, means that roughly that 37% of the words could be deleted and the passage could still be understood. So, a
higher cloze score is more readable. They "range from 58 and above for the easiest passages to 10-15 and below for the most difficult" (Chall & Dale, p. 75).


## What's "reading level" and "cloze score"?

**Reading level** is the grade level of the material, in years of education. The scale is from
**1** to **16+**.

**Cloze** is a deletion test invented by Taylor (1953). The `36.91` score, above, means that roughly that 37% of the words could be deleted and the passage could still be understood. So, **a
higher cloze score is more readable**. They "range from 58 and above for the easiest passages to 10-15 and below for the most difficult" (Chall & Dale, p. 75).

See [the integration test file](https://github.com/public-law/new-dale-chall-readability/blob/master/tests/integration_test.py) for text samples from the book, along with their scores.

## Why yet another readability library?

Before creating this, I tried really hard to find a readability library that gave correct results, and also seemed to be a good algorithm. I realized I really like Dale-Chall. But I found show-stopping bugs in the existing libraries that cause them to give wrong answers.

There are a ton of low-effort blog posts about Dale-Chall: they all seem to have different ideas about how it works. So I wrote this by first ordering a copy of _Readability Revisited: The new Dale-Chall readability formula_. Then I used the book to code the library from scratch. My goal was to create the best library I could for analyzing text. It needs to start with giving correct results. I did my best to rigorously design and test the code. And secondly, it needs to be modern Python code that's super easy to use. So, no objects to instantiate and no odd module naming. Just a couple of functions which can be called.


## Why yet another Dale-Chall readability library?

It's 2022 and there are probably a half-dozen implementations on PyPI.
So why create another one?
Expand Down
14 changes: 9 additions & 5 deletions new_dale_chall_readability/utils.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,13 @@
import re
import warnings

from bs4 import BeautifulSoup
from .easy_words import EASY_WORDS as _EASY_WORDS

# Ignore MarkupResemblesLocatorWarning and other user warnings
# because this is library code.
warnings.filterwarnings("ignore", category=UserWarning, module="bs4")


def pct_unfamiliar_words(text: str) -> float:
words = _words(text)
Expand All @@ -26,8 +32,6 @@ def _words(in_text: str) -> tuple[str, ...]:


def _is_unfamiliar(word: str) -> bool:
match word:
case number if re.match(r"\d+$", number):
return False
case _:
return word not in _EASY_WORDS
if word.isdigit(): # Faster and simpler check for pure numbers
return False
return word not in _EASY_WORDS
27 changes: 14 additions & 13 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,22 +1,23 @@
[tool.poetry]
name = "new-dale-chall-readability"
version = "1.0.12"
version = "1.0.13"
description = "Implements the New Dale-Chall readability formula. Its output is tested against samples from the original publication."
authors = ["Robb Shecter <robb@public.law>"]
license = "MIT"
homepage = "https://github.com/public-law/new-dale-chall-readability"
repository = "https://github.com/public-law/new-dale-chall-readability"
homepage = "https://github.com/public-law/readability"
repository = "https://github.com/public-law/readability"
keywords = ["nlp", "readability", "dale-chall"]
classifiers = [
"Development Status :: 5 - Production/Stable",
"Intended Audience :: Developers",
"License :: OSI Approved :: MIT License",
"Natural Language :: English",
"Operating System :: OS Independent",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.10",
"Topic :: Text Processing :: Linguistic",
"Typing :: Typed"]
"Development Status :: 5 - Production/Stable",
"Intended Audience :: Developers",
"License :: OSI Approved :: MIT License",
"Natural Language :: English",
"Operating System :: OS Independent",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.10",
"Topic :: Text Processing :: Linguistic",
"Typing :: Typed",
]
readme = "README.md"


Expand Down Expand Up @@ -56,7 +57,7 @@ reportUnusedImport = "warning"
[tool.pytest.ini_options]
minversion = "7.1"
pythonpath = "."
python_files = ["*_test.py",]
python_files = ["*_test.py"]
python_classes = ["Test", "Describe"]
python_functions = ["test_", "it_", "and_", "but_", "they_"]
addopts = "-q --no-header --doctest-modules"
Expand Down

0 comments on commit 9c4338f

Please sign in to comment.