Data point discontinuous in sampling with `--load_fast` (aka Rust board) implementation #6796

way-zer · 2024-03-18T02:38:44Z

Environment information (required)

Diagnostics

Diagnostics output

--- check: autoidentify
INFO: diagnose_tensorboard.py version df7af2c6fc0e4c4a5b47aeae078bc7ad95777ffa

--- check: general
INFO: sys.version_info: sys.version_info(major=3, minor=11, micro=8, releaselevel='final', serial=0)
INFO: os.name: posix
INFO: os.uname(): posix.uname_result(sysname='Linux', nodename='zw6f6', release='5.4.0-167-generic', version='#184-Ubuntu SMP Tue Oct 31 09:21:49 UTC 2023', machine='x86_64')
INFO: sys.getwindowsversion(): N/A

--- check: package_management
INFO: has conda-meta: True
INFO: $VIRTUAL_ENV: None

--- check: installed_packages
WARNING: no installation among: ['tb-nightly', 'tensorboard', 'tensorflow-tensorboard']
WARNING: no installation among: ['tensorflow', 'tensorflow-gpu', 'tf-nightly', 'tf-nightly-2.0-preview', 'tf-nightly-gpu', 'tf-nightly-gpu-2.0-preview']
WARNING: no installation among: ['tensorflow-estimator', 'tensorflow-estimator-2.0-preview', 'tf-estimator-nightly']

--- check: tensorboard_python_version
INFO: tensorboard.version.VERSION: '2.16.2'

--- check: tensorflow_python_version
Traceback (most recent call last):
  File "/root/rl4net/examples/diagnose_tensorboard.py", line 511, in main
    suggestions.extend(check())
                       ^^^^^^^
  File "/root/rl4net/examples/diagnose_tensorboard.py", line 81, in wrapper
    result = fn()
             ^^^^
  File "/root/rl4net/examples/diagnose_tensorboard.py", line 267, in tensorflow_python_version
    import tensorflow as tf
ModuleNotFoundError: No module named 'tensorflow'

--- check: tensorboard_data_server_version
INFO: data server binary: '/root/micromamba/envs/rl4net/lib/python3.11/site-packages/tensorboard_data_server/bin/server'
INFO: data server binary version: b'rustboard 0.7.0'

--- check: tensorboard_binary_path
INFO: which tensorboard: b'/root/micromamba/envs/rl4net/bin/tensorboard\n'

--- check: addrinfos
socket.has_ipv6 = True
socket.AF_UNSPEC = <AddressFamily.AF_UNSPEC: 0>
socket.SOCK_STREAM = <SocketKind.SOCK_STREAM: 1>
socket.AI_ADDRCONFIG = <AddressInfo.AI_ADDRCONFIG: 32>
socket.AI_PASSIVE = <AddressInfo.AI_PASSIVE: 1>
Loopback flags: <AddressInfo.AI_ADDRCONFIG: 32>
Loopback infos: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('127.0.0.1', 0))]
Wildcard flags: <AddressInfo.AI_PASSIVE: 1>
Wildcard infos: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('0.0.0.0', 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('::', 0, 0, 0))]

--- check: readable_fqdn
INFO: socket.getfqdn(): 'zw6f6'

--- check: stat_tensorboardinfo
INFO: directory: /tmp/.tensorboard-info
INFO: os.stat(...): os.stat_result(st_mode=16895, st_ino=62270220, st_dev=2097307, st_nlink=2, st_uid=0, st_gid=0, st_size=4096, st_atime=1710727568, st_mtime=1710727551, st_ctime=1710727551)
INFO: mode: 0o40777

--- check: source_trees_without_genfiles
INFO: tensorboard_roots (1): ['/root/micromamba/envs/rl4net/lib/python3.11/site-packages']; bad_roots (0): []

--- check: full_pip_freeze
INFO: pip freeze --all:
absl-py @ file:///home/conda/feedstock_root/build_artifacts/absl-py_1705494584803/work
aim==3.18.1
aim-ui==3.18.1
aimrecords==0.0.7
aimrocks==0.4.0
aiofiles==23.2.1
aiohttp @ file:///home/conda/feedstock_root/build_artifacts/aiohttp_1707669768135/work
aiosignal @ file:///home/conda/feedstock_root/build_artifacts/aiosignal_1667935791922/work
alembic==1.13.1
anyio @ file:///home/conda/feedstock_root/build_artifacts/anyio_1708355285029/work
ase @ file:///home/conda/feedstock_root/build_artifacts/ase_1638384343806/work
attrs @ file:///home/conda/feedstock_root/build_artifacts/attrs_1704011227531/work
base58==2.0.1
black @ file:///home/conda/feedstock_root/build_artifacts/black-recipe_1708248203050/work
blinker @ file:///home/conda/feedstock_root/build_artifacts/blinker_1698890160476/work
Brotli @ file:///home/conda/feedstock_root/build_artifacts/brotli-split_1695989787169/work
cached-property @ file:///home/conda/feedstock_root/build_artifacts/cached_property_1615209429212/work
cachetools==5.3.3
captum==0.7.0
certifi @ file:///home/conda/feedstock_root/build_artifacts/certifi_1707022139797/work/certifi
cffi @ file:///croot/cffi_1700254295673/work
charset-normalizer @ file:///home/conda/feedstock_root/build_artifacts/charset-normalizer_1698833585322/work
click @ file:///home/conda/feedstock_root/build_artifacts/click_1692311806742/work
cloudpickle==3.0.0
colorama @ file:///home/conda/feedstock_root/build_artifacts/colorama_1666700638685/work
contourpy @ file:///home/conda/feedstock_root/build_artifacts/contourpy_1699041375599/work
cryptography==42.0.5
cycler @ file:///home/conda/feedstock_root/build_artifacts/cycler_1696677705766/work
docutils @ file:///home/conda/feedstock_root/build_artifacts/docutils_1701882599793/work
exceptiongroup @ file:///home/conda/feedstock_root/build_artifacts/exceptiongroup_1704921103267/work
fastapi==0.110.0
filelock @ file:///home/conda/feedstock_root/build_artifacts/filelock_1698714947081/work
Flask @ file:///home/conda/feedstock_root/build_artifacts/flask_1707043907952/work
fonttools @ file:///home/conda/feedstock_root/build_artifacts/fonttools_1708049097969/work
frozenlist @ file:///home/conda/feedstock_root/build_artifacts/frozenlist_1702645450877/work
fsspec==2024.3.0
gmpy2 @ file:///home/conda/feedstock_root/build_artifacts/gmpy2_1666808665953/work
google-auth @ file:///opt/conda/conda-bld/google-auth_1646735974934/work
google-auth-oauthlib @ file:///work/ci_py311_2/google-auth-oauthlib_1679340681059/work
greenlet==3.0.3
grpcio @ file:///home/conda/feedstock_root/build_artifacts/grpc-split_1700258025969/work
h11 @ file:///home/conda/feedstock_root/build_artifacts/h11_1664132893548/work
h5py @ file:///home/conda/feedstock_root/build_artifacts/h5py_1702471424890/work
idna @ file:///home/conda/feedstock_root/build_artifacts/idna_1701026962277/work
imagecodecs @ file:///home/conda/feedstock_root/build_artifacts/imagecodecs_1704019718039/work
imageio @ file:///home/conda/feedstock_root/build_artifacts/imageio_1707730027807/work
importlib-metadata @ file:///home/conda/feedstock_root/build_artifacts/importlib-metadata_1703269254275/work
isodate @ file:///home/conda/feedstock_root/build_artifacts/isodate_1639582763789/work
itsdangerous @ file:///home/conda/feedstock_root/build_artifacts/itsdangerous_1648147185463/work
jedi @ file:///home/conda/feedstock_root/build_artifacts/jedi_1696326070614/work
Jinja2 @ file:///home/conda/feedstock_root/build_artifacts/jinja2_1704966972576/work
joblib @ file:///home/conda/feedstock_root/build_artifacts/joblib_1691577114857/work
kiwisolver @ file:///home/conda/feedstock_root/build_artifacts/kiwisolver_1695379920604/work
lazy_loader @ file:///home/conda/feedstock_root/build_artifacts/lazy_loader_1692295373316/work
lightning-utilities @ file:///home/conda/feedstock_root/build_artifacts/lightning-utilities_1705619433111/work
llvmlite==0.42.0
Mako==1.3.2
marimo @ file:///home/conda/feedstock_root/build_artifacts/marimo_1710189166977/work
Markdown @ file:///home/conda/feedstock_root/build_artifacts/markdown_1704908347571/work
MarkupSafe @ file:///home/conda/feedstock_root/build_artifacts/markupsafe_1706899926732/work
matplotlib @ file:///home/conda/feedstock_root/build_artifacts/matplotlib-suite_1708026439111/work
mpmath @ file:///home/conda/feedstock_root/build_artifacts/mpmath_1678228039184/work
multidict @ file:///home/conda/feedstock_root/build_artifacts/multidict_1707040702345/work
munkres==1.1.4
mypy-extensions @ file:///home/conda/feedstock_root/build_artifacts/mypy_extensions_1675543315189/work
networkx @ file:///home/conda/feedstock_root/build_artifacts/networkx_1698504735452/work
numba @ file:///home/conda/feedstock_root/build_artifacts/numba_1707024788644/work
numpy @ file:///home/conda/feedstock_root/build_artifacts/numpy_1707225376651/work/dist/numpy-1.26.4-cp311-cp311-linux_x86_64.whl#sha256=d08e1c9e5833ae7780563812aa73e2497db1ee3bd5510d3becb8aa18aa2d0c7c
oauthlib @ file:///croot/oauthlib_1679489621486/work
opt-einsum @ file:///home/conda/feedstock_root/build_artifacts/opt_einsum_1696448916724/work
packaging @ file:///home/conda/feedstock_root/build_artifacts/packaging_1696202382185/work
pandas @ file:///home/conda/feedstock_root/build_artifacts/pandas_1708708634263/work
parso @ file:///home/conda/feedstock_root/build_artifacts/parso_1638334955874/work
pathspec @ file:///home/conda/feedstock_root/build_artifacts/pathspec_1702249949303/work
patsy @ file:///home/conda/feedstock_root/build_artifacts/patsy_1704469236901/work
pillow @ file:///home/conda/feedstock_root/build_artifacts/pillow_1704252032614/work
pip==24.0
platformdirs @ file:///croot/platformdirs_1692205439124/work
protobuf==4.24.4
psutil @ file:///home/conda/feedstock_root/build_artifacts/psutil_1705722403006/work
pyasn1 @ file:///Users/ktietz/demo/mc3/conda-bld/pyasn1_1629708007385/work
pyasn1-modules==0.2.8
pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work
pydantic==2.6.3
pydantic_core==2.16.3
Pygments @ file:///home/conda/feedstock_root/build_artifacts/pygments_1700607939962/work
PyJWT @ file:///work/ci_py311/pyjwt_1676827385359/work
pymdown-extensions @ file:///home/conda/feedstock_root/build_artifacts/pymdown-extensions_1703982974286/work
pynndescent @ file:///home/conda/feedstock_root/build_artifacts/pynndescent_1700514549498/work
pyOpenSSL @ file:///croot/pyopenssl_1708380408460/work
pyparsing @ file:///home/conda/feedstock_root/build_artifacts/pyparsing_1690737849915/work
pyrallis==0.3.1
PySocks @ file:///home/conda/feedstock_root/build_artifacts/pysocks_1661604839144/work
python-dateutil @ file:///home/conda/feedstock_root/build_artifacts/python-dateutil_1709299778482/work
pytz @ file:///home/conda/feedstock_root/build_artifacts/pytz_1706886791323/work
PyWavelets @ file:///home/conda/feedstock_root/build_artifacts/pywavelets_1695567566807/work
PyYAML @ file:///home/conda/feedstock_root/build_artifacts/pyyaml_1695373611984/work
pyzmq @ file:///home/conda/feedstock_root/build_artifacts/pyzmq_1701783162530/work
rdflib @ file:///home/conda/feedstock_root/build_artifacts/rdflib-split_1690986372614/work
requests @ file:///home/conda/feedstock_root/build_artifacts/requests_1684774241324/work
requests-oauthlib==1.3.0
RestrictedPython==7.0
-e git+egg=rl4net
rsa @ file:///tmp/build/80754af9/rsa_1614366226499/work
ruff @ file:///home/conda/feedstock_root/build_artifacts/ruff_1709955894551/work
scikit-image @ file:///home/conda/feedstock_root/build_artifacts/scikit-image_1697028611470/work/dist/scikit_image-0.22.0-cp311-cp311-linux_x86_64.whl#sha256=53d8b95f752df47007e9e71dd1c9805b9334e1e4791cf48e3762abb922636f04
scikit-learn @ file:///home/conda/feedstock_root/build_artifacts/scikit-learn_1708073809211/work
scipy @ file:///home/conda/feedstock_root/build_artifacts/scipy-split_1706041487672/work/dist/scipy-1.12.0-cp311-cp311-linux_x86_64.whl#sha256=c4f0d8ecd4373069a033d0ee818c2fe5959c8828937fa46deb00a478190f703a
setuptools==69.1.1
six @ file:///home/conda/feedstock_root/build_artifacts/six_1620240208055/work
sniffio @ file:///home/conda/feedstock_root/build_artifacts/sniffio_1708952932303/work
SQLAlchemy==2.0.28
starlette==0.36.3
statsmodels @ file:///home/conda/feedstock_root/build_artifacts/statsmodels_1702575375433/work
sympy @ file:///home/conda/feedstock_root/build_artifacts/sympy_1684180540116/work
tensorboard @ file:///home/conda/feedstock_root/build_artifacts/tensorboard_1708285739699/work/tensorboard-2.16.2-py3-none-any.whl#sha256=9f2b4e7dad86667615c0e5cd072f1ea8403fc032a299f0072d6f74855775cc45
tensorboard-data-server @ file:///home/conda/feedstock_root/build_artifacts/tensorboard-data-server_1695425375375/work/tensorboard_data_server-0.7.0-py3-none-manylinux2014_x86_64.whl#sha256=4a87e32f17958007f01c1acb90cf7aab5877e41b1a929e3a016020697c37b53d
tensorboardX @ file:///tmp/build/80754af9/tensorboardx_1621440489103/work
tensordict==0.3.1
threadpoolctl @ file:///home/conda/feedstock_root/build_artifacts/threadpoolctl_1707930541534/work
tifffile @ file:///home/conda/feedstock_root/build_artifacts/tifffile_1707824820518/work
tomlkit @ file:///home/conda/feedstock_root/build_artifacts/tomlkit_1709043728182/work
torch==2.2.1
torch-scatter @ file:///usr/share/miniconda/envs/test/conda-bld/pytorch-scatter_1706804494952/work
torch_geometric @ file:///home/conda/feedstock_root/build_artifacts/pytorch_geometric_1708619951869/work
torchmetrics @ file:///home/conda/feedstock_root/build_artifacts/torchmetrics_1701462872995/work
torchrl==0.3.1
tornado @ file:///home/conda/feedstock_root/build_artifacts/tornado_1708363099148/work
tqdm @ file:///home/conda/feedstock_root/build_artifacts/tqdm_1707598593068/work
trimesh @ file:///home/conda/feedstock_root/build_artifacts/trimesh_1709252138892/work
triton==2.2.0
typing-inspect==0.9.0
typing_extensions @ file:///home/conda/feedstock_root/build_artifacts/typing_extensions_1708904622550/work
tzdata @ file:///home/conda/feedstock_root/build_artifacts/python-tzdata_1707747584337/work
urllib3 @ file:///home/conda/feedstock_root/build_artifacts/urllib3_1708239446578/work
uvicorn @ file:///home/conda/feedstock_root/build_artifacts/uvicorn-split_1707597428881/work
websockets @ file:///home/conda/feedstock_root/build_artifacts/websockets_1697914680106/work
Werkzeug @ file:///home/conda/feedstock_root/build_artifacts/werkzeug_1698235201373/work
wheel==0.42.0
yarl @ file:///home/conda/feedstock_root/build_artifacts/yarl_1705508295175/work
zipp @ file:///home/conda/feedstock_root/build_artifacts/zipp_1695255097490/work

For browser-related issues, please additionally specify:

Browser type and version (e.g., Chrome 64.0.3282.140): Microsoft Edge 122.0.2365.92
Screenshot, if it’s a visual issue:

Issue description

There is a significant interruption in data point sampling when using tensorboard.
Using EventAccumulator, I checked the data file is complete and use --samples_per_plugin=scalars=10000 also works but slowly.
Data file: events.out.tfevents.zip
Reproduce step: open the tfevents file with tensorboard.

The text was updated successfully, but these errors were encountered:

arcra · 2024-03-29T18:51:54Z

Correct, data is sampled, and the behavior can be overridden by the flag that you mentioned. This is working as intended.

The sampling algorithm has a few attributes that influenced the design choice:

While the data is still being written, it is not known how big the population will be.
Generally, users are interested in seeing the last logged value.
The implementation is deterministic, so you would always see the same results.

Due to this, we use a reservoir sampling implementation that keeps the last value. You can find it here. Unfortunately, as the population size grows larger than the sample size, it is likely that the algorithm will just keep replacing the latest read value.

It was interesting to think about this. I came up with an implementation that attempts to be more fair in keeping a representative sample, with a trade-off in memory usage. I think we would need to put more thought into this if we wanted to submit this change for the actual implementation, but you're welcome to fork our repo and change the implementation in the mean time, if you'd like.

Changing the code to something like this:

def AddItem(self, item, f=lambda x: x):
        """Add an item to the ReservoirBucket, replacing an old item if
        necessary.

        If the bucket has reached capacity, then an old item will be replaced
        with probability (_max_size/_num_items_seen).

        It is expected that the "add" operations will be far more frequent than
        the "read" operations. Therefore, the list keeps track of insertion
        order as a tuple (index, val), the replacement is done in-place at a
        random position in the items list; and when the elements are read via
        the Items() method, the list is sorted by insertion order using the
        first value in the tuple.

        This means, insertion is O(1) (at the cost of using more memory, but
        still O(k)). Reading the items should be O(n*log(n)).

        Args:
          item: The item to add to the bucket.
          f: A function to transform item before addition, if it will be kept in
            the reservoir.
        """
        with self._mutex:
            self._num_items_seen += 1
            # The count of num_items_seen serves as an index for data read, so
            # we can insert efficiently and only return the data in the order it
            # was read when the items are read.
            new_item = (self._num_items_seen, f(item))
            self._latest_seen = new_item
            if self._items_len < self._max_size or self._max_size == 0:
                self.items.append(new_item)
                self._items_len += 1
            else:
                # Attempts to make the sampling unlikely to entirely replace the
                # previously seen values. As the population grows larger, it
                # becomes less likely that a value will be replaced.
                sample_ratio = self._max_size / float(self._num_items_seen)
                if self._random.random() < sample_ratio:
                    r = self._random.randint(0, self._max_size - 1)
                    # replace item without sorting, for efficient writing.
                    self.items[r] = new_item

    def Items(self):
        """Get all the items in the bucket.

        If self.always_keep_last is true, it will replace the last element in
        the sample with the last element seen.

        Calling this method has O(n*log(n)) runtime complexity, but reads are
        less frequent than writes, which are O(1) with this implementation, and
        it keeps a somewhat more representative sample.

        Perhaps some optimizations can be done to avoid recalculating the list
        when nothing has changed.
        """
        with self._mutex:
            sorted_list = sorted(self.items, key=lambda x: x[0])
            if self.always_keep_last:
                sorted_list[-1] = self._latest_seen
            return [x[1] for x in sorted_list]

To compare, the view with the --samples_per_plugin=scalars=20000 looks like this:

And the sampled view with this implementation (still using the default sample size) looks like this:

Having said that, here are a few notes to consider:

This implementation only takes effect with the flag --load_fast=false, which is used by default whenever you also have installed the tensorboard-data-server package. Whenever load_fast is enabled, the Rust code will be used, rather than the python code.
I suppose, we could look at and change the implementation in the Rust code as well, but we don't have the bandwidth to look into that at the moment.
This might be the reason why changing the sampling was slow for you... if you don't have that package installed, perhaps if you install that package and using the sampling flag won't be as slow and it's a simpler alternative. You can learn about this a bit here (although this guide is for development... generally, if you install that other package, it should by default be faster in many cases).
This implementation hasn't been tested much, nor analyzed for broader use cases.

way-zer · 2024-03-30T04:53:56Z

First of all, thank you for your detailed explanation of sampling.

I would like to add some information.

The issue mentioned above is based on rust implementation(load_fast=true)
When use load_fast=false without samples_per_plugin, it also works with almost uniform sampling interval.
What I have said about slow is viewing the data, loading in frontend.
Less than 20 experiments, for the scalar env/delay with --samples_per_plugin=scalars=10000, it spends more than 10 seconds to load the data.
The main problem is why get long interruption in sampling when use rust implementation.

From this python code, It is difficult to always replace the last value to cause the long interruption.

tensorboard/tensorboard/backend/event_processing/reservoir.py

Lines 223 to 226 in cf27fe0

    
           r = self._random.randint(0, self._num_items_seen) 
        
           if r < self._max_size: 
        
               self.items.pop(r) 
        
               self.items.append(f(item))

arcra · 2024-04-02T01:08:23Z

Ah, you are correct! The python algorithm should work. It is the Rust implementation the one with the issue.

I didn't think of the Rust implementation at the beginning, and then I guess I was trying to fit an explanation of what happened to the code that I was looking at in python.

Anyway... I'll reopen this issue and rename to emphasize that the issue is with the Rust implementation, but honestly we haven't touched that code in a while, the people who wrote it are no longer working with the team, so it's unlikely that we will pick this up any time soon.

arcra closed this as completed Mar 29, 2024

arcra reopened this Apr 2, 2024

arcra changed the title ~~Data point discontinuous in sampling~~ Data point discontinuous in sampling with --load_fast (aka Rust board) implementation Apr 2, 2024

arcra added the core:rustboard //tensorboard/data/server/... label Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data point discontinuous in sampling with `--load_fast` (aka Rust board) implementation #6796

Data point discontinuous in sampling with `--load_fast` (aka Rust board) implementation #6796

way-zer commented Mar 18, 2024 •

edited

Loading

arcra commented Mar 29, 2024

way-zer commented Mar 30, 2024 •

edited

Loading

arcra commented Apr 2, 2024 •

edited

Loading

Data point discontinuous in sampling with --load_fast (aka Rust board) implementation #6796

Data point discontinuous in sampling with --load_fast (aka Rust board) implementation #6796

Comments

way-zer commented Mar 18, 2024 • edited Loading

Environment information (required)

Diagnostics

Issue description

arcra commented Mar 29, 2024

way-zer commented Mar 30, 2024 • edited Loading

arcra commented Apr 2, 2024 • edited Loading

Data point discontinuous in sampling with `--load_fast` (aka Rust board) implementation #6796

Data point discontinuous in sampling with `--load_fast` (aka Rust board) implementation #6796

way-zer commented Mar 18, 2024 •

edited

Loading

way-zer commented Mar 30, 2024 •

edited

Loading

arcra commented Apr 2, 2024 •

edited

Loading