Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
28ca07e
Expand explicit dataclasses as precursor for re-enabling parquet and …
Aug 14, 2025
08b6215
Expose dataclasses
Aug 14, 2025
4fd87d7
Add IOFactory for dataclass to dict conversions, DatasetJoinableSpec …
Aug 14, 2025
f975f09
Expose IOFactory and DatasetJoinableSpec
Aug 14, 2025
f147ac1
Thread DatasetSpecs through preprocess function for handling, so user…
Aug 14, 2025
a927854
Update src/coffea/dataset_tools/__init__.py
NJManganelli Aug 14, 2025
c073b51
Update src/coffea/dataset_tools/__init__.py
NJManganelli Aug 14, 2025
60db77b
Update src/coffea/dataset_tools/preprocess.py
NJManganelli Aug 14, 2025
922602a
Error when filespec_to_dict receives an incompatible UprootFileSpec o…
Aug 14, 2025
f519f0b
Rearrange dataclass inheritance and expand iofactory method for conve…
Aug 15, 2025
6485b57
Add tests for dataclasses and iofactory, modified from suggested set …
Aug 15, 2025
bc93fb3
precommit fixes
Aug 15, 2025
e599197
Ensure simple filename:object_path dictionaries are converted to Coff…
Aug 16, 2025
2fc17d3
Add preprocess test to iofactory suite
Aug 16, 2025
e253bab
appease pre-commit
Aug 16, 2025
5fae595
Rewrite dataclasses as pydantic models with mostly self-validation an…
Aug 18, 2025
8043c2f
Partial tests of pydantic dataclasses, will need streamlining of IOFa…
Aug 18, 2025
48274b8
Remove dataclass-based types, import from coffea/dataset_tools/filesp…
Aug 18, 2025
e22120c
Updae imports to filespec.py for pydantic classes
Aug 18, 2025
41624e6
Appease pre-commit
Aug 18, 2025
84d8c4e
fixup filespec import
Aug 18, 2025
41f435c
fix pylance errors and remove unnecessary Optional models
Aug 18, 2025
8951e59
Remove iofactory.py
Aug 18, 2025
fb801fa
Remove DatasetSpecOptional from apply_processor.py
Aug 18, 2025
7b36503
Start cleaning up filespec.py: remove DatasetSpecOptional, tests embe…
Aug 18, 2025
2fda9c6
Let copilot convert some comprehensive tests from the original draft …
Aug 18, 2025
2e08862
Add json serialization roundtripping, and clean up more copilot cruft
Aug 18, 2025
fbacdbb
Add more assertions for the non-trivial DatasetSpec with mixed concre…
Aug 18, 2025
be5d7e1
Separate identify_file_format from IOFactory, more cleanup, add 'join…
Aug 18, 2025
6e4f686
appease pre-commit
Aug 18, 2025
895049a
pre-commit and parametrizing some copilot tests to cleanup, add a few…
Aug 18, 2025
a69591f
Remove Joinable spec
Aug 18, 2025
4628c9f
is None fix
Aug 18, 2025
7a38300
Fix promotion logic
Aug 18, 2025
7ef4792
promotion tests for CoffeaFileDict, DatasetSpec, FilesetSpec
Aug 18, 2025
d095991
Need a deepcopy on some of these filespec.py classes
Aug 19, 2025
dc4e94e
Make slice_files and filter_files compatible with pydantic models
Aug 19, 2025
96956a0
Add pydantic models to filter_files and slice_files tests
Aug 19, 2025
166f7bb
remove debug statement
Aug 19, 2025
9750ea9
Handle FileSpec in chunk slicing logic
Aug 19, 2025
1ce5cd2
Tests for pydantic filespec in slice_files, slice_chunks methods
Aug 19, 2025
0711253
failed_files support for filespec
Aug 19, 2025
18abc3d
test for failed_files with filespec
Aug 19, 2025
4358e44
simplest support for apply_to_fileset|dataset for pydantic models
Aug 19, 2025
364b9db
Test apply_to_fileset on FilesetSpec
Aug 19, 2025
d43760a
preprocess pydantic model
Aug 19, 2025
123bcc9
parametrize dict and FilesetSpec inputs for test
Aug 19, 2025
efc8bfb
appease pre-commit overlords
Aug 19, 2025
b8ba107
Add pydantic as hard dependency
Aug 19, 2025
fa98172
Fixup computed format field on CoffeaFileDict
Aug 19, 2025
6336ab9
Remove debug print
Aug 19, 2025
3315090
Remove unnecessary identify_format from IOFactory (standalone functio…
Aug 19, 2025
164b06d
Fixup identify_format tests
Aug 19, 2025
e97f987
Add filespec.ipynb examples after cleaning them up from copilot
Aug 19, 2025
69c236d
Add filespec.ipynb to the index for docs
Aug 19, 2025
2e7fe6f
pre-commit updates
Aug 19, 2025
6893e06
Fallback to import Self from typing_extensions prior to python 3.11
Aug 19, 2025
1fabc7f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 19, 2025
d05d888
Update to explicit Union for pydantic Models for python 3.9
Aug 20, 2025
08ad634
Until python 3.9 EOL (October) need this small package to make pydant…
Aug 20, 2025
6d8a76d
it comes in any color you like, so long as it's black
Aug 20, 2025
7956a4e
Protect against mixed (non-canonical) FilesetSpec in preprocess
Aug 20, 2025
bc7567f
Add tests for empty and mixed FilesetSpec in preprocess function
Aug 20, 2025
456862c
Add .pq identification and permit searching for indicator in middle o…
Aug 29, 2025
eff15fc
Factorize the form and format functions to share code between validat…
Aug 29, 2025
23ee888
Document IOFactory, showing the still-useful componenets and making a…
Aug 29, 2025
0ec2939
Merge branch 'master' into parquet-precursor-pydantic-datafactory
lgray Sep 1, 2025
d622338
Merge branch 'master' into parquet-precursor-pydantic-datafactory
ikrommyd Sep 8, 2025
b34f4c7
Enforce pydantic class over dictionary
Sep 24, 2025
ad2daa5
UprootFileSpec -> ROOTFileSpec
Sep 25, 2025
03b8d18
Factorize GenericFileSpec, switch to raise exception on format, pre-c…
Sep 29, 2025
b0a2bee
Split CoffeaFileDict into InputFiles and PreprocessedFiles, the latte…
Sep 30, 2025
ab0f265
Update test from CoffeaFileDict to InputFiles and PreprocessedFiles
Sep 30, 2025
80ffea4
precommit updates
Sep 30, 2025
71ba0ff
Update filespec.ipynb to InputFiles and PreprocessedFiles
Sep 30, 2025
76ed528
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 30, 2025
c4393d2
Merge branch 'master' into parquet-precursor-pydantic-datafactory
lgray Sep 30, 2025
7c21cff
Merge branch 'master' into parquet-precursor-pydantic-datafactory
lgray Oct 2, 2025
c52e944
Merge branch 'master' into parquet-precursor-pydantic-datafactory
lgray Oct 10, 2025
d3b2576
Merge branch 'master' into parquet-precursor-pydantic-datafactory
ikrommyd Oct 15, 2025
9c0162d
Merge branch 'master' into parquet-precursor-pydantic-datafactory
NJManganelli Oct 24, 2025
8466b0f
Make form a computed field, and compressed_form the original compress…
Oct 28, 2025
2f2beaf
propagate form + compressed_form through preprocess
Oct 28, 2025
3f09b66
propagate form + compressed_form through apply_processor
Oct 28, 2025
ebeddfd
propagate the form and compressed_form changes through tests
Oct 28, 2025
05d4e1b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 28, 2025
5ed4532
Remove single StepPair as option
Oct 28, 2025
0bad523
propagate single StepPair removal to tests
Oct 28, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ The following pages are rendered jupyter notebooks that provide an overview and
Each notebook builds on the previous one so it is recommended to go through them in order.

.. toctree::

notebooks/filespec.ipynb
notebooks/nanoevents.ipynb
notebooks/applying_corrections.ipynb
notebooks/packedselection.ipynb
Expand Down
3,619 changes: 3,619 additions & 0 deletions docs/source/notebooks/filespec.ipynb

Large diffs are not rendered by default.

2 changes: 2 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,8 @@ dependencies = [
"mplhep>=0.1.18",
"packaging",
"pandas",
"pydantic",
"eval_type_backport; python_version < '3.10'", #TODO: remove after python 3.9 EOL (filespec.py type Unions)
"hist>=2",
"cachetools",
"requests",
Expand Down
24 changes: 24 additions & 0 deletions src/coffea/dataset_tools/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,17 @@
from coffea.dataset_tools.apply_processor import apply_to_dataset, apply_to_fileset
from coffea.dataset_tools.filespec import (
CoffeaParquetFileSpec,
CoffeaParquetFileSpecOptional,
CoffeaROOTFileSpec,
CoffeaROOTFileSpecOptional,
DatasetSpec,
FilesetSpec,
InputFiles,
IOFactory,
ParquetFileSpec,
PreprocessedFiles,
ROOTFileSpec,
)
from coffea.dataset_tools.manipulations import (
filter_files,
get_failed_steps_for_dataset,
Expand All @@ -23,4 +36,15 @@
"slice_files",
"get_failed_steps_for_dataset",
"get_failed_steps_for_fileset",
"ROOTFileSpec",
"ParquetFileSpec",
"CoffeaROOTFileSpec",
"CoffeaROOTFileSpecOptional",
"CoffeaParquetFileSpec",
"CoffeaParquetFileSpecOptional",
"InputFiles",
"PreprocessedFiles",
"DatasetSpec",
"FilesetSpec",
"IOFactory",
]
28 changes: 13 additions & 15 deletions src/coffea/dataset_tools/apply_processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,15 @@
from collections.abc import Hashable
from typing import Any, Callable, Union

import awkward
import dask.base
import dask_awkward

from coffea.dataset_tools.preprocess import (
from coffea.dataset_tools.filespec import (
DatasetSpec,
DatasetSpecOptional,
FilesetSpec,
FilesetSpecOptional,
)
from coffea.nanoevents import BaseSchema, NanoAODSchema, NanoEventsFactory
from coffea.processor import ProcessorABC
from coffea.util import decompress_form

DaskOutputBaseType = Union[
dask.base.DaskMethodsMixin,
Expand All @@ -34,7 +30,7 @@

def apply_to_dataset(
data_manipulation: ProcessorABC | GenericHEPAnalysis,
dataset: DatasetSpec | DatasetSpecOptional,
dataset: DatasetSpec | dict,
schemaclass: BaseSchema = NanoAODSchema,
metadata: dict[Hashable, Any] = {},
uproot_options: dict[str, Any] = {},
Expand All @@ -46,7 +42,7 @@ def apply_to_dataset(
----------
data_manipulation : ProcessorABC or GenericHEPAnalysis
The user analysis code to run on the input dataset
dataset: DatasetSpec | DatasetSpecOptional
dataset: DatasetSpec | dict
The data to be acted upon by the data manipulation passed in.
schemaclass: BaseSchema, default NanoAODSchema
The nanoevents schema to interpret the input dataset with.
Expand All @@ -62,12 +58,12 @@ def apply_to_dataset(
report : dask_awkward.Array, optional
The file access report for running the analysis on the input dataset. Needs to be computed in simultaneously with the analysis to be accurate.
"""
maybe_base_form = dataset.get("form", None)
if maybe_base_form is not None:
maybe_base_form = awkward.forms.from_json(decompress_form(maybe_base_form))
files = dataset["files"]
if isinstance(dataset, dict):
dataset = DatasetSpec.model_validate(dataset)
maybe_base_form = dataset.form
files = dataset.files
events = NanoEventsFactory.from_root(
files,
files.model_dump(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NanoEventsFactory.from_root could accept type InputFiles. This could be for a future PR since this changeset focuses on dataset_tools only. If so, let's make an issue

metadata=metadata,
schemaclass=schemaclass,
known_base_form=maybe_base_form,
Expand All @@ -94,7 +90,7 @@ def apply_to_dataset(

def apply_to_fileset(
data_manipulation: ProcessorABC | GenericHEPAnalysis,
fileset: FilesetSpec | FilesetSpecOptional,
fileset: FilesetSpec | dict,
schemaclass: BaseSchema = NanoAODSchema,
uproot_options: dict[str, Any] = {},
) -> dict[str, DaskOutputType] | tuple[dict[str, DaskOutputType], dask_awkward.Array]:
Expand All @@ -105,7 +101,7 @@ def apply_to_fileset(
----------
data_manipulation : ProcessorABC or GenericHEPAnalysis
The user analysis code to run on the input dataset
fileset: FilesetSpec | FilesetSpecOptional
fileset: FilesetSpec
The data to be acted upon by the data manipulation passed in. Metadata within the fileset should be dask-serializable.
schemaclass: BaseSchema, default NanoAODSchema
The nanoevents schema to interpret the input dataset with.
Expand All @@ -119,10 +115,12 @@ def apply_to_fileset(
report : dask_awkward.Array, optional
The file access report for running the analysis on the input dataset. Needs to be computed in simultaneously with the analysis to be accurate.
"""
if isinstance(fileset, dict):
fileset = FilesetSpec.model_validate(fileset)
out = {}
report = {}
for name, dataset in fileset.items():
metadata = copy.deepcopy(dataset.get("metadata", {}))
metadata = copy.deepcopy(dataset.metadata)
if metadata is None:
metadata = {}
metadata.setdefault("dataset", name)
Expand Down
Loading
Loading