-
Notifications
You must be signed in to change notification settings - Fork 109
feat: add onelake support #513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Could we please add some kind of documentation clarifying what this package now supports, and the relationship between the different types of storage? I, for one, am totally confused, but I'm not well versed on the MS/azure ecosystem. |
Sure, I understand that (and honestly I think it's a mess). For reference: ADLS Gen 1
WebHDFS REST API: ADLS Gen 2
Blob Service endpoints: Primary: Data Lake Storage endpoints: Primary: The .dfs.core.windows.net endpoint is specifically for the hierarchical namespace and Data Lake APIs, while .blob.core.windows.net is for traditional blob operations. OneLake This is Microsoft's newest approach to data storage, part of the Microsoft Fabric platform:
OneLake endpoints: Primary: |
We could do with this info on the docs. Also: what happened with onedrive (i.e., the one that people have on their personal machines for syncing)? I think one of the conversations in here said that should be possible too. |
will do |
Onedrive / teams / sharepoint are supported by msgraph(fs) . Again completely different protocol |
Ah, this one. Sorry, my memory... |
No worries. The MS ecosystem for this is a mess. |
Hi @bolkedebruin. Thanks for diving into this! One question that I have is that do we need an entirely new fsspec filesystem for this? Does supporting OneLake just through In general, I'd prefer we hold off adding any new protocols and filesystems to
I'm wondering if instead of adding a new protocol, we could:
I like that these would be smaller changes that would still improve the end experience but not prevent us from building out an official |
Unfortunately, OneLake doesn't fully work with the existing AzureBlobFileSystem despite initial appearances. While basic operations may seem to work, several core filesystem operations like iterdir and ls fail with "endpoint does not support this operation" errors. This is a fundamental limitation (afaik) that makes the existing filesystem based on AzureBlobFileSystem insufficient for OneLake support. The root issue is that OneLake implements a subset of the Azure Data Lake Storage Gen2 API, not the full Blob Storage API that AzureBlobFileSystem relies on . According to Microsoft's documentation:
Why a separate filesystem is necessary:
While I've used onelake:// for clearer distinction, the official protocol is, I think - this is what Fabric give back, abfss:// with the OneLake-specific endpoint (https://onelake.dfs.fabric.microsoft.com) but as you seemingly have a microsoftie active here... @anjaliratnam-msft might want to chime in. Oh and I see you are yourself as well :-P BTW: I am happy to be proven wrong. One less thing to maintain! |
@bolkedebruin thanks for adding more details. I have not had the chance yet to try OneLake against In general, the ADLS Gen 2 SDK is built on top of the Blob SDK and uses the Blob SDK for many of its method calls. So, the compatibility gap should be limited to methods in the ADLS Gen 2 SDK that do not just call out to the Blob SDK. So, one thing I'd like to explore is whether we can just call out to the Datalake specific APIs when it is needed all while still working under the |
OneLake uses (yet) another protocol. This adds support for that protocol through onelake:// but also abfss://...@onelake.dfs.fabric... URLs. Backwards compatiblity is there so that traditional are routed to AzureBlobFileSystem.