Set rm_files to be a synchronous method #503

anjaliratnam-msft · 2025-07-09T18:55:54Z

This addresses the github issue where rm_files is not implemented. This was a simple fix where sync_wrapper(_rm_files) just needed to be set to rm_files. Tests were also added to make sure it works as expected.

kyleknap · 2025-07-14T23:20:15Z

adlfs/spec.py

            self.invalidate_cache(self._parent(file))

-    sync_wrapper(_rm_files)
+    rm_files = sync_wrapper(_rm_files)


Ah this is something that I missed as well when we went over the original GitHub feature request... Based on the fsspec specification there is no rm_files() shared interface; there's only an rm_file(). The original GitHub issue: #497 is also only requesting for rm_file().

So, it is not appropriate to be setting a sync wrapper like this because it does not appear rm_files to be a shared interface across file systems. Instead, it would probably make sense to add an async _rm_file that is a simplified wrapper over the _rm implementation to implement the feature request.

Even more interesting, it seems there used to be a _rm_file() implementation prior to this PR: #383 and because the underlying AsyncFileSystem mirrors methods, I suspect that adlfs might have actually at one point supported rm_file() and could be a regression. It would be great to confirm if adlfs ever supported rm_file() in a version prior to that PR.

This was the previous implementation of rm_file I found, but it looks like it was only ever used by rm and was not callable.

kyleknap

It's looking better. Just had a couple of follow up suggestions on the direction we take this.

kyleknap · 2025-08-22T20:24:29Z

adlfs/spec.py

+            if p != "":
+                await self._rm_files(container_name, [p.rstrip(delimiter)])
+            else:
+                await self._rmdir(container_name)


So looking more at rm_file it seems like it's sole purpose is to just delete a single file and does not support deleting a directory provided. This seems to be the intention based on other implementation's versions of rm_file as well:

s3fs - Only deletes a single object which would be a blob in adlfs

local - Uses os.remove() which only handles removing files and not directories.

I think it would make sense to stick with this contract to be consistent, especially if rm_file was not actually never exposed publicly.

kyleknap · 2025-08-22T20:29:48Z

adlfs/spec.py

    sync_wrapper(_rm_files)

+    async def _rm_file(
+        self, path: typing.Union[str, typing.List[str]], delimiter: str = "/", **kwargs


And on a similar note, I'm not sure if we should be exposing a delimiter for this method since there is not really any recursive nature to this public contract and instead always just use / if we need any splitting logic. Also not supporting a delimiter seems consistent with the other implementations I linked.

kyleknap

It's looking better. Just had some more suggestions on how to approach this.

kyleknap · 2025-09-02T21:03:12Z

adlfs/spec.py

+        except Exception as e:
+            raise RuntimeError("Failed to remove %s for %s", path, e) from e
+
+    rm_file = sync_wrapper(_rm_file)


Looking into fsspec core more, I don't think we need to be explicitly setting a sync wrapper here. Specifically, there is some dynamic method setting done here that will take the private _rm_file and convert it to a public synchronous rm_file. So, we should actually be able to remove this line.

kyleknap · 2025-09-02T21:23:54Z

adlfs/spec.py

+            pass
+        except Exception as e:
+            raise RuntimeError("Failed to remove %s for %s", path, e) from e
+


It looks like we should also invalidate the self.dircache similar to what adlfs does here. Specifically, a cache is maintained for previously retrieved blobs so if we do not do this, we can get unexpected behavior where adlfs will try to retrieve the cached info even though it is deleted. For example:

fs = get_fs() upload(fs) # ls method populates the cache fs.ls(f"{CONTAINER_NAME}/") fs.rm_file(f"{CONTAINER_NAME}/small.bin") # Even though the blob is deleted, the cached version will be returned, which is not correct print("Still cached:", fs.info(f"{CONTAINER_NAME}/small.bin"))

We should make sure we add a test or assertion to make sure that the cache is invalidated.

kyleknap · 2025-09-02T22:01:54Z

adlfs/spec.py

+        except ResourceNotFoundError:
+            pass
+        except FileNotFoundError:
+            pass
+        except Exception as e:
+            raise RuntimeError("Failed to remove %s for %s", path, e) from e


For the rm_file() method, I don't think we should be swallowing exceptions as the expectation is that if a single requested file deletion fails, the expectation will be that the error is propagated. This is also the same as other fsspec implementations, (e.g. local and s3fs) it raises any exceptions encountered.

I realize this is just copying what _rm() is doing but _rm() is more designed for bulk deletions so it may be omitting errors to better avoid fast failures and just generally have an interface of not throwing errors for individual parts of a bulk delete.

kyleknap · 2025-09-02T22:13:00Z

adlfs/spec.py

+        path: str
+            File to delete.
+        """
+        container_name, p, _ = self.split_path(path)


@martindurant question for you when you have the chance...

When using a version_aware file system and the rm_file, should we be deleting versioned objects if the version is specified in the path (e.g., data/root/a/file.txt?versionid=<some-verson-id>)? Azure Blob is similar to S3 in order to delete a version of the object/blob a version id must be provided to explicitly delete that version snapshot. It did not look like s3fs did this for its rm_file implementation but figured to ask.

@anjaliratnam-msft I'm thinking we go ahead and support version ids that are supplied as part of the path for rm_file. Specifically, I'm leaning this way because:

It is straightforward to plumb it in. We just need to save the version_id from split_path and provide it to delete_blob. Similar to how it's done in get_file()

Version id is retrieved from the path for many of the operations that act on a single path (e.g. info(), open())

If we don't support it, there's not really any way to delete version ids and would require using the SDK directly instead of adlfs interfaces

We should also make sure we add a test for deleting versioned blobs as well.

kyleknap · 2025-09-02T22:15:03Z

adlfs/tests/test_spec.py

+    path = "data/top_file.txt"
+
+    fs.rm_file(path)


For the happy case, we should consider explicitly uploading the file to ensure that it existed at some point. Otherwise, it is difficult to disambiguate if we are actually deleting the file in the first place especially since we are swallowing file not found exceptions currently.

kyleknap

I think we are close. Just a few more comments.

kyleknap · 2025-09-24T20:33:41Z

adlfs/spec.py

+        except Exception as e:
+            raise RuntimeError("Failed to remove %s for %s", path, e) from e


I'm not sure if we should instead be throwing RuntimeErrors for any exception. Mainly it is a fairly general exception that is hard to catch and programmatically act on. Instead, we should be only remapping exceptions to only known built-in exceptions that we want to handle.

Looking through the adlfs codebase, it looks like a common pattern is to map ResourceNotFound to the builtin FileNotFoundError error, but all other possible azure client errors are not touched. I think it makes sense to follow this pattern and reraise as FileNotFoundError similar to here.

We should also make sure to add a test that it reraises as FileNotFoundError when the blob does not exist.

kyleknap · 2025-09-24T20:57:30Z

adlfs/spec.py

+                await cc.delete_blob(p)
+        except Exception as e:
+            raise RuntimeError("Failed to remove %s for %s", path, e) from e
+        self.invalidate_cache(self._parent(path))


Poking more through the adlfs codebase, I'm thinking we will also need to invalidate the path as well. While it does not seem like some of the other methods handle this (and arguably something we should circle back to addressing eventually), we can run into inconsistencies where the path can still reside in the dircache. For example, if ls() is run on the file, it still shows up as present:

fs = get_fs() upload(fs) print(fs.ls(f"{CONTAINER_NAME}/small.bin")) fs.rm_file(f"{CONTAINER_NAME}/small.bin") print("Still cached:", fs.ls(f"{CONTAINER_NAME}/small.bin")) # Should throw FileNotFoundError

We should make sure to add a test case for this as well.

kyleknap · 2025-09-24T22:47:46Z

adlfs/spec.py

+        path: str
+            File to delete.
+        """
+        container_name, p, _ = self.split_path(path)


@anjaliratnam-msft I'm thinking we go ahead and support version ids that are supplied as part of the path for rm_file. Specifically, I'm leaning this way because:

It is straightforward to plumb it in. We just need to save the version_id from split_path and provide it to delete_blob. Similar to how it's done in get_file()

Version id is retrieved from the path for many of the operations that act on a single path (e.g. info(), open())

If we don't support it, there's not really any way to delete version ids and would require using the SDK directly instead of adlfs interfaces

We should also make sure we add a test for deleting versioned blobs as well.

kyleknap

Looks good! Just had a couple of small comments on the tests and we should be set.

kyleknap · 2025-10-01T21:28:18Z

adlfs/tests/test_spec.py

+
+    assert fs.exists(path)
+    fs.rm_file(path)
+    with pytest.raises(FileNotFoundError):


Let's also add another separate test case where we try to call rm_file() on a blob that does not exist. Mainly we want to be asserting that we are throwing that new FileNotFoundError.

kyleknap · 2025-10-01T21:38:14Z

adlfs/tests/test_spec.py

+    path = "data/test_file.txt?versionid=latest"
+    with fs.open(path, "wb") as f:
+        f.write(b"test content")


So in general, for version IDs, they are set by the service and not something we can necessarily define upfront. Furthermore, it looks like it is not supported by Azurite: Azure/Azurite#665

I'm thinking for testing that we respect versioning we can follow patterns taken by other test cases like this one. Specifically, I think we can just patch the underlying delete_blob method and just make sure it passed along the expected version id and that should suffice (we don't need to check if it exists after, etc. since we already did that in test_rm_file() case). We should also be able to use the DEFAULT_VERSION_ID constant from that test too.

kyleknap · 2025-10-01T22:41:51Z

adlfs/tests/test_spec.py

+    with fs.open(path, "wb") as f:
+        f.write(b"test content")
+
+    assert fs.exists(path)


Given we are patching the delete_blob() call we should be able to remove the upload and exists logic since we are not actually deleting the blob living in Azurite and that will help shorten the test body.

kyleknap

Looks good! 🚢

@martindurant this should be ready for you to take a look

anjaliratnam-msft requested a review from kyleknap July 9, 2025 18:56

kyleknap reviewed Jul 14, 2025

View reviewed changes

anjaliratnam-msft force-pushed the rm_files_fix branch from 4af870c to 0c62e00 Compare August 21, 2025 16:15

kyleknap reviewed Aug 22, 2025

View reviewed changes

kyleknap reviewed Sep 2, 2025

View reviewed changes

anjaliratnam-msft added 4 commits September 9, 2025 16:21

rm_files fix

78d8ba9

added rm_file

b62bb25

updates

c54a64a

updates

b8d09f3

anjaliratnam-msft force-pushed the rm_files_fix branch from dd18564 to b8d09f3 Compare September 9, 2025 23:25

kyleknap reviewed Sep 24, 2025

View reviewed changes

updates

cab1c15

kyleknap reviewed Oct 1, 2025

View reviewed changes

updates

915295c

kyleknap reviewed Oct 1, 2025

View reviewed changes

updates

ef84f97

kyleknap approved these changes Oct 1, 2025

View reviewed changes

		except Exception as e:
		raise RuntimeError("Failed to remove %s for %s", path, e) from e

Set rm_files to be a synchronous method #503

Are you sure you want to change the base?

Set rm_files to be a synchronous method #503

Uh oh!

Conversation

anjaliratnam-msft commented Jul 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kyleknap left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kyleknap left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kyleknap Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kyleknap left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kyleknap left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kyleknap left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kyleknap Sep 2, 2025 •

edited

Loading