Skip to content

Commit f7c6d07

Browse files
mthrokci
authored andcommitted
Update docs
0 parents  commit f7c6d07

File tree

488 files changed

+133819
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

488 files changed

+133819
-0
lines changed

.nojekyll

Whitespace-only changes.

index.html

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
<!DOCTYPE html>
2+
<html>
3+
<head>
4+
<title>SPDL</title>
5+
</head>
6+
<body>
7+
<p>Redirecting to the main document.</p>
8+
<script>
9+
window.location.href= "./main/index.html";
10+
</script>
11+
</html>

main/_sources/api.rst.txt

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
API Reference
2+
=============
3+
4+
.. autosummary::
5+
:toctree: generated
6+
:template: _custom_autosummary_module.rst
7+
:caption: API Reference
8+
:recursive:
9+
10+
spdl.io
11+
spdl.pipeline
12+
spdl.dataloader
13+
spdl.source
14+
spdl.utils

main/_sources/best_practice.rst.txt

Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
Best Practices
2+
==============
3+
4+
Avoid creating intermediate tensors
5+
-----------------------------------
6+
7+
For efficient and performant data processing, it is advised to not create
8+
an intermediate Tensor for each individual media object (such as single image),
9+
instead create a batch Tensor directly.
10+
11+
We recommend decoding individual frames, then using :py:func:`spdl.io.convert_frames`
12+
to create a batch Tensor directly without creating an intermediate Tensors.
13+
14+
If you are decoding a batch of images, and you have pre-determined set of images
15+
that should go together into a same batch, use
16+
:py:func:`spdl.io.load_image_batch` (or its async variant
17+
:py:func:`spdl.io.async_load_image_batch`).
18+
19+
Otherwise, demux, decode and pre-process multiple media, then combine them with
20+
:py:func:`spdl.io.convert_frames` (or :py:func:`spdl.io.async_convert_frames`).
21+
For example, the following functions implement decoding and tensor creation
22+
separately.
23+
24+
.. code-block::
25+
26+
import spdl.io
27+
from spdl.io import ImageFrames
28+
29+
def decode_image(src: str) -> ImageFrames:
30+
packets = spdl.io.async_demux_image(src)
31+
return spdl.io.async_decode_packets(packets)
32+
33+
def batchify(frames: list[ImageFrames]) -> ImageFrames:
34+
buffer = spdl.io.convert_frames(frames)
35+
return spdl.io.to_torch(buffer)
36+
37+
They can be combined in :py:class:`~spdl.pipeline.Pipeline`, which automatically
38+
discards the items failed to process (for example due to invalid data), and
39+
keep the batch size consistent by using other items successfully processed.
40+
41+
.. code-block::
42+
43+
from spdl.pipeline import PipelineBuilder
44+
45+
pipeline = (
46+
PipelineBuilder()
47+
.add_source(...)
48+
.pipe(decode_image, concurrency=...)
49+
.aggregate(32)
50+
.pipe(batchify)
51+
.add_sink(3)
52+
.build(num_threads=...)
53+
)
54+
55+
.. seealso::
56+
57+
:py:mod:`multi_thread_preprocessing`
58+
59+
Make Dataset class composable
60+
-----------------------------
61+
62+
If you are publishing a dataset and providing an implementation of
63+
`Dataset` class, we recommend to make it composable.
64+
65+
That is, in addition to the conventional ``Dataset`` class that
66+
returns Tensors, make the components of the ``Dataset``
67+
implementation available by breaking down the implementation into
68+
69+
* Iterator (or map) interface that returns paths instead of Tensors.
70+
* A helper function that loads the source path into Tensor.
71+
72+
For example, the interface of a ``Dataset`` for image classification
73+
might look like the following.
74+
75+
.. code-block::
76+
77+
class Dataset:
78+
def __getitem__(self, key: int) -> tuple[Tensor, int]:
79+
...
80+
81+
We recommend to separate the source and process and make them additional
82+
public interface.
83+
(Also, as described above, we recommend to not convert each item into
84+
``Tensor`` for the performance reasons.)
85+
86+
.. code-block::
87+
88+
class Source:
89+
def __getitem__(self, key: int) -> tuple[str, int]:
90+
...
91+
92+
def load(data: tuple[str, int]) -> tuple[ImageFrames, int]:
93+
...
94+
95+
and if the processing is composed of stages with different bounding
96+
factor, then split them further into primitive functions.
97+
98+
.. code-block::
99+
100+
def download(src: tuple[str, int]) -> tuple[bytes, int]:
101+
...
102+
103+
def decode_and_preprocess(data: tuple[bytes, int]) -> tuple[ImageFrames, int]:
104+
...
105+
106+
then the original ``Dataset`` can be implemented as a composition
107+
108+
.. code-block::
109+
110+
class Dataset:
111+
def __init__(self, ...):
112+
self._src = Source(...)
113+
114+
def __getitem__(self, key:int) -> tuple[str, int]:
115+
metadata = self._src[key]
116+
item = download(metadata)
117+
frames, cls = decode_and_preprocess(item)
118+
tensor = spdl.io.to_torch(frames)
119+
return tensor, cls
120+
121+
Such decomposition makes the dataset compatible with SPDL's Pipeline,
122+
and allows users to run them more efficiently
123+
124+
.. code-block::
125+
126+
pipeline = (
127+
PipelineBuilder()
128+
.add_source(Source(...))
129+
.pipe(download, concurrency=8)
130+
.pipe(decode_and_preprocess, concurrency=4)
131+
...
132+
.build(...)
133+
)

main/_sources/cpp.rst.txt

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
API References (C++)
2+
====================
3+
4+
.. toctree::
5+
:caption: API References (C++)
6+
7+
Class List <generated/libspdl/libspdl_class>
8+
File List <generated/libspdl/libspdl_file>
9+
API <generated/libspdl/libspdl_api>

main/_sources/examples.rst.txt

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
Examples
2+
========
3+
4+
.. autosummary::
5+
:toctree: generated
6+
:template: _custom_autosummary_example.rst
7+
:caption: Examples
8+
:recursive:
9+
10+
image_dataloading
11+
video_dataloading
12+
imagenet_classification
13+
multi_thread_preprocessing

main/_sources/faq.rst.txt

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
Frequently Asked Questions
2+
==========================
3+
4+
How to work around GIL?
5+
-----------------------
6+
7+
In Python, GIL (Global Interpreter Lock) practically prevents the use of multi-threading, however extension modules that are written in low-level languages, such as C, C++ and Rust, can release GIL when executing operations that do not interact with Python interpreter.
8+
9+
Many libraries used for data loading release the GIL. To name a few;
10+
11+
- Pillow
12+
- OpenCV
13+
- Decord
14+
- tiktoken
15+
16+
Typically, the bottleneck of model training in loading and pre-processing the media data.
17+
So even though there are still parts of pipelines that are constrained by GIL,
18+
by taking advantage of pre-processing functions that release GIL,
19+
we can achieve high throughput.
20+
21+
What if a function does not release GIL?
22+
----------------------------------------
23+
24+
In case you need to use a function that takes long time to execute (e.g. network utilities)
25+
but it does not release GIL, you can delegate the stage to sub-process.
26+
27+
:py:meth:`spdl.pipeline.PipelineBuilder.pipe` method takes an optional ``executor`` argument.
28+
The default behavior of the ``Pipeline`` is to use the thread pool shared among all stages.
29+
You can pass an instance of :py:class:`concurrent.futures.ProcessPoolExecutor`,
30+
and that stage will execute the function in the sub-process.
31+
32+
.. code-block::
33+
34+
executor = ProcessPoolExecutor(max_workers=num_processes)
35+
36+
pipeline = (
37+
PipelineBuilder()
38+
.add_source(src)
39+
.pipe(stage1, executor=executor, concurrency=num_processes)
40+
.pipe(stage2, ...)
41+
.pipe(stage3, ...)
42+
.add_sink(1)
43+
.build()
44+
)
45+
46+
This will build pipeline like the following.
47+
48+
.. include:: ./plots/faq_subprocess_chart.txt
49+
50+
.. note::
51+
52+
Along with the function arguments and return values, the function itself is also
53+
serialized and passed to the sub-process. Therefore, the function to be executed
54+
must be a plain function. Closures and class methods cannot be passed.
55+
56+
.. tip::
57+
58+
If you need to perform one-time initialization in sub-process, you can use
59+
``initializer`` and ``initargs`` arguments.
60+
61+
The values passed as ``initializer`` and ``initargs`` must be picklable.
62+
If constructing an object in a process that does not support picke, then
63+
you can pass constructor arguments instead and store the resulting object
64+
in global scope. See also https://stackoverflow.com/a/68783184/3670924.
65+
66+
Example
67+
68+
.. code-block::
69+
70+
def init_resource(*args):
71+
global rsc
72+
rsc = ResourceClass(*args)
73+
74+
def process_with_resource(item):
75+
global rsc
76+
77+
return rsc.process(item)
78+
79+
executor = ProcessPoolExecutor(
80+
max_workers=4,
81+
mp_context=None,
82+
initializer=init_resource,
83+
initargs=(...),
84+
)
85+
86+
pipeline = (
87+
PipelineBuilder()
88+
.add_source()
89+
.pipe(
90+
process_with_resource,
91+
executor=executor,
92+
concurrency=4,
93+
)
94+
.add_sink(3)
95+
.build()
96+
)
97+
98+
Which functions hold the GIL?
99+
-----------------------------
100+
101+
The following is the list of functions that we are aware that they hold the GIL. So it is advised to use them with ``ProcessPoolExecutor`` or avoid using them in SPDL.
102+
103+
* `np.load <https://github.com/numpy/numpy/blob/maintenance/2.1.x/numpy/lib/_npyio_impl.py#L312-L500>`_
104+
105+
Why Async IO?
106+
-------------
107+
108+
When training a model with large amount of data, the data are retrieved from remote locations. Network utilities often provide APIs based on Async I/O.
109+
110+
The Async I/O allows to easily build complex data pre-processing pipeline and execute them while automatically parallelizing parts of the pipeline, achieving high throughput.
111+
112+
Synchronous operations that release GIL can be converted to async operations easily by running them in a thread pool. So by converting the synchronous pre-processing functions that release GIL into asynchronous operations, the entire data pre-processing pipeline can be executed in async event loop. The event loop handles the scheduling of data processing functions, and execute them concurrently.

0 commit comments

Comments
 (0)