scrapling-schema

Schema-driven HTML extractor. Define extraction specs in Python (with full IDE type hints) or YAML, and get structured JSON out.

Install

pip install scrapling-schema

Requirements

Python >= 3.10
scrapling >= 0.4
PyYAML >= 6.0

Python API

Python type spec (recommended)

from scrapling_schema import Schema, Field, Options, Clear, RegexSub

spec = Schema(
    options=Options(clear=Clear(remove_tags=["script", "style"])),
    fields={
        "products": Field(
            css=".product",
            type="array<object>",
            fields={
                "sku":   Field(css="SELF", type="string", attr="data-sku"),
                "name":  Field(css=".name", type="string"),
                "url":   Field(css="a.link", type="string", attr="href"),
                "price": Field(css=".price", type="number", transform=[
                    RegexSub(pattern=r"[^0-9.]+"),
                ]),
                "tags":  Field(css=".tags li", type="array<string>"),
            },
        )
    },
)

result = spec.extract(html)
json_schema = spec.json_schema(title="Products")

Boolean fields (`type: "boolean"`)

Boolean output is derived from type, not a transform. The extractor coerces common truthy/falsey values:

truthy: "true", "t", "yes", "y", "on", "1" (case-insensitive, surrounding whitespace ignored)
falsey: "false", "f", "no", "n", "off", "0"
numbers: 1 → True, 0 → False (other numbers become None)

Python example:

from scrapling_schema import Schema, Field

html = "<span class='in-stock'> yes </span>"
spec = Schema(fields={"in_stock": Field(css=".in-stock", type="boolean")})

data = spec.extract(html)
assert data["in_stock"] is True

If you want invalid/missing values to fail fast, set nullable=False:

from scrapling_schema import Schema, Field, ValidationError

html = "<span class='in-stock'> maybe </span>"
spec = Schema(fields={"in_stock": Field(css=".in-stock", type="boolean", nullable=False)})

try:
    spec.extract(html)
except ValidationError:
    pass

YAML spec

options:
  clear:
    remove_tags: ["script", "style"]

fields:
  products:
    css: ".product"
    type: "array<object>"
    fields:
      sku:
        css: "SELF"
        type: "string"
        attr: "data-sku"
      name:
        css: ".name"
        type: "string"
      price:
        css: ".price"
        type: "number"
        transform:
          - regex_sub: { pattern: "[^0-9.]+", repl: "" }
  in_stock:
    css: ".in-stock"
    type: "boolean"

from scrapling_schema import extract_from_yaml

result = extract_from_yaml(html, yaml_spec)

CLI

scrapling-schema --spec spec.yml --html-file page.html
scrapling-schema --spec spec.yml --schema

Field reference

param	type	description
`css`	`str`	CSS selector. Use `"SELF"` to select the context node itself
`attr`	`str`	Extract an attribute value (or special `"innerHTML"`)
`type`	`str`	Output type: `"string"
`nullable`	`bool`	If `false`, missing values raise `ValidationError`
`defaultValue`	`any`	Fallback value used when the extracted value is empty
`fields`	`dict`	Nested fields (for `object` / `array<object>`)
`transform`	`list`	Transform pipeline (see below)
`callback`	`callable`	Field-level post-processing hook (Python API only)
`outputSchema`	`dict`	Override JSON Schema for this field (useful when `callback` changes the output type/shape)
`required`	`bool`	Raise `ValidationError` if value is empty

Notes:

type is required for every field.
Arrays must use type: "array<...>" (no items: and no list:).
attr supports special values:
- "innerHTML": extract HTML string from the selected node.
- "ownText": extract direct text for the selected node (excludes descendant text).

Transform reference

transform	shorthand	description
`RegexSub(pattern, repl)`	—	Regex substitution
`Split(delimiter)`	—	Split string into array items (requires `type:"array<...>"`)

Notes:

String outputs are stripped automatically (no transform needed).
Use field-level defaultValue for fallbacks (defaults are not supported inside transforms).

When to use `transform` vs `callback`

Both are meant for post-processing, but they work at different levels and have different ergonomics.

Use `transform` for value-centric pipelines

Good fit when you want a predictable, reusable pipeline on a single extracted value (e.g., regex cleanup, split).

Order of operations (scalar fields):

Extract raw string
Apply transform pipeline
Apply type coercion (number/integer/boolean)
Apply callback (if any)

Example: remove currency symbols before coercing to number:

from scrapling_schema import Schema, Field, RegexSub

spec = Schema(
    fields={
        "price": Field(
            css=".price",
            type="number",
            transform=[RegexSub(pattern=r"[^0-9.]+", repl="")],
        )
    }
)
data = spec.extract(html)

Use `callback` for whole-field logic (filtering, sorting, aggregation)

callback receives the final extracted value for the field:

scalar field → the scalar value (str|int|float|bool|None)
array<...> field → the whole list
object field → the whole dict

This is a better fit for list-level operations or aggregations.

When `callback` changes the output type/shape

callback is an arbitrary Python function, so the library cannot reliably infer a JSON Schema for its return value. If your callback changes the type/shape (e.g., list → object, object → string), set outputSchema on the field to keep spec.json_schema() in sync with the actual output.

Example: filter a list of objects (keep only items you care about):

from scrapling_schema.types import Schema, Field

def keep_only_a(items: list[dict]) -> list[dict]:
    return [item for item in items if "A" in item["name"]]

spec = Schema(
    fields={
        "products": Field(
            css=".item",
            type="array<object>",
            callback=keep_only_a,
            fields={
                "name": Field(css=".name", type="string"),
            },
        )
    }
)
data = spec.extract(html)

`array<object>` special case: `transform` is per-item

For type: "array<object>", transform is applied to each extracted object (each list element). If a transform returns None, the item is dropped from the list.

from scrapling_schema import extract

def drop_product_a(item: dict) -> dict | None:
    return None if item.get("name") == "Product A" else item

spec = {
    "fields": {
        "products": {
            "css": ".item",
            "type": "array<object>",
            "transform": [drop_product_a],
            "fields": {"name": {"css": ".name", "type": "string"}},
        }
    }
}
data = extract(html, spec)

YAML note

YAML specs support only the built-in transform steps (e.g., regex_sub, split). Python callables (transform: [my_fn] / callback: my_fn) are only supported via the Python API (typed Schema/Field or a Python dict spec), not via YAML text.

Testing

Install the dev dependencies (in a virtualenv) and run the test suite:

python -m pip install -e ".[dev]"
python -m pytest

Run a single test file:

python -m pytest tests/test_extractor.py

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
bin		bin
example		example
src/scrapling_schema		src/scrapling_schema
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
RELEASING.md		RELEASING.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scrapling-schema

Install

Requirements

Python API

Python type spec (recommended)

Boolean fields (`type: "boolean"`)

YAML spec

CLI

Field reference

Transform reference

When to use `transform` vs `callback`

Use `transform` for value-centric pipelines

Use `callback` for whole-field logic (filtering, sorting, aggregation)

When `callback` changes the output type/shape

`array<object>` special case: `transform` is per-item

YAML note

Testing

License

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

scrapling-schema

Install

Requirements

Python API

Python type spec (recommended)

Boolean fields (type: "boolean")

YAML spec

CLI

Field reference

Transform reference

When to use transform vs callback

Use transform for value-centric pipelines

Use callback for whole-field logic (filtering, sorting, aggregation)

When callback changes the output type/shape

array<object> special case: transform is per-item

YAML note

Testing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Boolean fields (`type: "boolean"`)

When to use `transform` vs `callback`

Use `transform` for value-centric pipelines

Use `callback` for whole-field logic (filtering, sorting, aggregation)

When `callback` changes the output type/shape

`array<object>` special case: `transform` is per-item

Packages