Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Lake: Iterating over result of get_paths method on FileSystemClient raises HTTP error #35617

Open
ShivnarenSrinivasan opened this issue May 14, 2024 · 3 comments
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention This issue is responsible by Azure service team. Storage Storage Service (Queues, Blobs, Files)

Comments

@ShivnarenSrinivasan
Copy link

Describe the bug
Calling the get_paths method, and iterating over the result is throwing a HTTPResponseError.

HttpResponseError: (InvalidQueryParameterValue) Value for one of the query parameters specified in the request URI is invalid.

To Reproduce
Steps to reproduce the behavior:

import os
from azure.storage.filedatalake DataLakeServiceClient
from azure.identity import ClientSecretCredential

ACCOUNT_NAME = os.getenv('ACCOUNT_NAME')

credential = ClientSecretCredential(os.getenv('TENANT_ID'), os.getenv('CLIENT_ID'), os.getenv('CLIENT_SECRET'))
service = DataLakeServiceClient(account_url=f"https://{ACCOUNT_NAME}.dfs.core.windows.net", credential=credential)
# this connection works for creating, deleting, and modifying files and directories

filesystem = service.get_file_system_client('data')
for path in filesystem.get_paths('tmp'):
    print(path.name)
    # raises http exception instead

Expected behavior
The iterator object returned by get_paths should yield valid filesystem files/directories.

Screenshots
Error raised upon iteration:
image

Additional context
I do not believe this is a direct bug in the SDK, as I was able to replicate this issue while calling the underlying REST API directly--however, hoping there is some insight on the overall process.
In any case, if the method does not work as documented, perhaps some changes are necessary.

@github-actions github-actions bot added Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention This issue is responsible by Azure service team. Storage Storage Service (Queues, Blobs, Files) labels May 14, 2024
Copy link

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @jalauzon-msft @vincenttran-msft.

@weirongw23-msft weirongw23-msft self-assigned this May 14, 2024
@vincenttran-msft vincenttran-msft self-assigned this May 14, 2024
@vincenttran-msft
Copy link
Member

Hi @ShivnarenSrinivasan , thanks for the inquiry! After taking a closer look at your RequestId, we were able to come up with a successful repro of the error you are facing! The sample snippet code provided above actually does not face this issue, and so here is a seperate code example that should help explain what is the root cause.

image
Here is an example screenshot from my Azure Portal.

  • filesystemlevel is created using the "+ Container" tooltip, and so is fundamentally different than any other hierarchical structure (i.e. folders, files, etc.) This is your actual File System
    image
  • firstdirectorylevel and all preceding structures are created using the "Upload" or "Add Directory" tooltip. These are thus not file systems, and instead are files or directories
    image

With that being said, taking a look at your RequestId reveals that you are wrongly passing in a directory to the get_file_system_client API.

For example, your code that reproduces the failure in this example would look like: service.get_file_system_client('filesystemlevel/firstdirectorylevel')

Whereas the correct code snippet would look like:
service.get_file_system_client('filesystemlevel')

Then, if your goal is to drill down to the paths in tmp, you would pass:
filesystem.get_paths('firstdirectorylevel/seconddirectorylevel/tmp')

In short, the root cause of the issue is that you were specifying more than just the file system when getting a file system client. Hopefully this example makes sense and should unblock your workflow, otherwise please do not hesitate to reach out again!

Thanks!

@ShivnarenSrinivasan
Copy link
Author

ShivnarenSrinivasan commented May 15, 2024

Thanks a lot, @vincenttran-msft -- in an attempt to simplify the code I was working with, I seem to have left out the most critical detail. My apologies.
One piece to add, is I am part of an organization where I do not have admin privileges, and the directory I was trying to access was merely provisioned for me; hence I was unaware of the container/directory distinction.

This explanation is very helpful, and after making the changes, I'm good to go.
The issue I raised is certainly closed, but this does feel like a "gotcha" to an uninitiated user (esp. since the HTTPException is so generic).

Docs

I am not well versed with the terminology yet, but I couldn't find any specification of what a filesystem represents, or the restrictions (i.e should be a container) in the docs (which I believe is the README in the relevant git directory). Would it be worth adding some? Happy to submit a PR, it could be in the README, or in the get_file_system method itself.

Runtime Check

Further, I don't know if this is a correct assumption, but if containers cannot be nested--that means that the only valid argument to the get_file_system call would be a root level path.
Would it be appropriate to add a runtime check to ensure there only a single path (i.e top level) is passed, rather than what I did?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention This issue is responsible by Azure service team. Storage Storage Service (Queues, Blobs, Files)
Projects
None yet
Development

No branches or pull requests

3 participants