Skip to content

Conversation

bolkedebruin
Copy link

OneLake uses (yet) another protocol. This adds support for that protocol through onelake:// but also abfss://...@onelake.dfs.fabric... URLs. Backwards compatiblity is there so that traditional are routed to AzureBlobFileSystem.

OneLake uses (yet) another protocol. This adds support
for that protocol through onelake:// but also abfss://...@onelake.dfs.fabric...
URLs. Backwards compatiblity is there so that traditional
are routed to AzureBlobFileSystem.
@martindurant
Copy link
Member

Could we please add some kind of documentation clarifying what this package now supports, and the relationship between the different types of storage? I, for one, am totally confused, but I'm not well versed on the MS/azure ecosystem.

@bolkedebruin
Copy link
Author

bolkedebruin commented Sep 12, 2025

Sure, I understand that (and honestly I think it's a mess). For reference:

ADLS Gen 1

  • Hierarchical file system, Posix compliant
  • To be retrired

WebHDFS REST API: https://<account>.azuredatalakestore.net
Data Lake Store filesystem: adl://<account>.azuredatalakestore.net

ADLS Gen 2

  • Based on Blob storage
  • Current
  • Supports multiple endpoints

Blob Service endpoints:

Primary: https://<account>.blob.core.windows.net
Secondary (if geo-redundant): https://<account>-secondary.blob.core.windows.net

Data Lake Storage endpoints:

Primary: https://<account>.dfs.core.windows.net
Secondary: https://<account>-secondary.dfs.core.windows.net

The .dfs.core.windows.net endpoint is specifically for the hierarchical namespace and Data Lake APIs, while .blob.core.windows.net is for traditional blob operations.

OneLake

This is Microsoft's newest approach to data storage, part of the Microsoft Fabric platform:

  • Unified data lake that automatically comes with every Fabric tenant
  • Delta Lake format as the standard, providing ACID transactions

OneLake endpoints:

Primary: https://onelake.dfs.fabric.microsoft.com

@martindurant
Copy link
Member

We could do with this info on the docs.

Also: what happened with onedrive (i.e., the one that people have on their personal machines for syncing)? I think one of the conversations in here said that should be possible too.

@bolkedebruin
Copy link
Author

will do

@bolkedebruin
Copy link
Author

bolkedebruin commented Sep 12, 2025

We could do with this info on the docs.

Also: what happened with onedrive (i.e., the one that people have on their personal machines for syncing)? I think one of the conversations in here said that should be possible too.

Onedrive / teams / sharepoint are supported by msgraph(fs) . Again completely different protocol

@martindurant
Copy link
Member

Ah, this one. Sorry, my memory...

@bolkedebruin
Copy link
Author

No worries. The MS ecosystem for this is a mess.

@kyleknap
Copy link
Collaborator

Hi @bolkedebruin. Thanks for diving into this! One question that I have is that do we need an entirely new fsspec filesystem for this? Does supporting OneLake just through AzureBlobFileSystem suffice? From the conversation on this GitHub issue: #486, it seems like OneLake works just with a little bit of extra configuration with the current AzureBlobFileSystem.

In general, I'd prefer we hold off adding any new protocols and filesystems to adlfs as:

  1. I agree with both yours and @martindurant sentiment that it is hard to keep track of the different storage options. So, adding a new protocol/filesystem would likely add to the problem.
  2. The recommendation from the OneLake team is to use the current Azure Storage SDKs and Tools to access it: https://learn.microsoft.com/en-us/fabric/onelake/onelake-access-api. So, supporting OneLake through the existing abfs:// protocol and AzureBlobFileSystem interfaces would be consistent with this recommendation. That being said, if OneLake came out with their own SDKs and own official onelake:// protocol, we should reconsider this stance.

I'm wondering if instead of adding a new protocol, we could:

  1. Document in the readme how to connect to OneLake using the current abfs:// protocol and file system
  2. Explore how to automatically handle one lake URIs (e.g., abfs://workspace@onelake.dfs.fabric.microsoft.com/lakehouse/file) to not require all of the additional configuration.

I like that these would be smaller changes that would still improve the end experience but not prevent us from building out an official onelake:// protocol and filesystem in the future if needed. Thoughts?

@bolkedebruin
Copy link
Author

bolkedebruin commented Sep 12, 2025

Hi @bolkedebruin. Thanks for diving into this! One question that I have is that do we need an entirely new fsspec filesystem for this? Does supporting OneLake just through AzureBlobFileSystem suffice? From the conversation on this GitHub issue: #486, it seems like OneLake works just with a little bit of extra configuration with the current AzureBlobFileSystem.

In general, I'd prefer we hold off adding any new protocols and filesystems to adlfs as:

1. I agree with both yours and @martindurant sentiment that it is hard to keep track of the different storage options. So, adding a new protocol/filesystem would likely add to the problem.

2. The recommendation from the OneLake team is to use the current Azure Storage SDKs and Tools to access it: https://learn.microsoft.com/en-us/fabric/onelake/onelake-access-api. So, supporting OneLake through the existing `abfs://` protocol and `AzureBlobFileSystem` interfaces would be consistent with this recommendation. That being said, if OneLake came out with their own SDKs and own official `onelake://` protocol, we should reconsider this stance.

I'm wondering if instead of adding a new protocol, we could:

1. Document in the readme how to connect to OneLake using the current `abfs://` protocol and file system

2. Explore how to automatically handle one lake URIs (e.g., `abfs://workspace@onelake.dfs.fabric.microsoft.com/lakehouse/file`) to not require all of the additional configuration.

I like that these would be smaller changes that would still improve the end experience but not prevent us from building out an official onelake:// protocol and filesystem in the future if needed. Thoughts?

Unfortunately, OneLake doesn't fully work with the existing AzureBlobFileSystem despite initial appearances. While basic operations may seem to work, several core filesystem operations like iterdir and ls fail with "endpoint does not support this operation" errors. This is a fundamental limitation (afaik) that makes the existing filesystem based on AzureBlobFileSystem insufficient for OneLake support.

The root issue is that OneLake implements a subset of the Azure Data Lake Storage Gen2 API, not the full Blob Storage API that AzureBlobFileSystem relies on . According to Microsoft's documentation:

Why a separate filesystem is necessary:

  • API Compatibility: OneLake requires the newer Azure Storage SDK features that handle these API differences gracefully.
  • Future-proofing: As OneLake evolves, having a dedicated implementation allows us to adapt to OneLake-specific features and limitations - Onelake seems more file oriented

While I've used onelake:// for clearer distinction, the official protocol is, I think - this is what Fabric give back, abfss:// with the OneLake-specific endpoint (https://onelake.dfs.fabric.microsoft.com)

but as you seemingly have a microsoftie active here... @anjaliratnam-msft might want to chime in. Oh and I see you are yourself as well :-P

BTW: I am happy to be proven wrong. One less thing to maintain!
P.S. I just used a different endpoint (onelake.blob.fabric.microsoft.com) and that seems to give better results. So maybe we are lucky :-)

@kyleknap
Copy link
Collaborator

@bolkedebruin thanks for adding more details. I have not had the chance yet to try OneLake against adlfs. So, I'll probably spend some cycles trying it out and understand what the current gaps are in order to better understand the options.

In general, the ADLS Gen 2 SDK is built on top of the Blob SDK and uses the Blob SDK for many of its method calls. So, the compatibility gap should be limited to methods in the ADLS Gen 2 SDK that do not just call out to the Blob SDK. So, one thing I'd like to explore is whether we can just call out to the Datalake specific APIs when it is needed all while still working under the AzureBlobFileSystem. If we can get a pattern like this working, there would be advantages, beyond just getting OneLake working, as today in adlfs even if an ADLS Gen 2 endpoint is provided, the library always uses the blob endpoint and is not able to take advantage of any of the ADLS Gen 2 APIs (e.g., recurisve deletes, renames).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants