Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PageIterator skipping a page when browsing list_objects_v2 with Delimiter #3119

Open
dboyadzhiev opened this issue Feb 14, 2024 · 5 comments
Assignees
Labels
bug This issue is a confirmed bug. p2 This is a standard priority issue s3

Comments

@dboyadzhiev
Copy link

dboyadzhiev commented Feb 14, 2024

Describe the bug

The pagination of S3 list_objects_v2 skip pages when using CommondPrefixes (i.e. Delimiter) and StartingToken

Use case:
Our API provides a list of S3 "folders" and supports pagination. It is a wrapper over our internal S3 bucket and forwards the information. The first response of the API returns a list of common prefixes and the next token provided by the PageIterator. The second request uses this token to continue the listing.

Expected Behavior

Using the paginator.paginate() method with the Delimiter parameter and not setting StartingToken should return all pages starting from the first one and its next token.
Using it again but this time with a given StartingToken (the first page next token) should return all pages starting from the second one and its next token.

Current Behavior

When the paginator.paginate() is called with StartingToken it returns the second page with an empty CommonPrefixes list but the third with a valid CommonPrefixes list

Reproduction Steps

You need a bucket with date partitions and files in them.

S3://by_bucket/2023-01-01/file1.json
S3://by_bucket/2023-01-01/file2.json
S3://by_bucket/2023-01-02/file1.json
S3://by_bucket/2023-01-02/file2.json
...
S3://by_bucket/2023-12-01/file1.json
S3://by_bucket/2023-12-01/file2.json
import boto3

BUCKET_NAME = ""
PREFIX = ""
token = None

s3_client = boto3.client("s3")
paginator = s3_client.get_paginator('list_objects_v2')

def request_page(token):
    paginator = s3_client.get_paginator('list_objects_v2')
    return paginator.paginate(
        Bucket=BUCKET_NAME,
        Delimiter='/',
        Prefix=PREFIX,
        PaginationConfig={'PageSize': 5, 'StartingToken': token}
    )

# simolate multi requests to an API 
steps = 0

# First request 
# print page 1 prefixes
# keep the token for page 2
print("Request 1")
for page in request_page(token):
    steps += 1

    print(page['CommonPrefixes'])
    next_token = page['NextContinuationToken']

    if page['CommonPrefixes']:
        print(f"done in step: {steps}")
        break

# Second request 
# print page 2 prefixes
# keep the token for page 2
print("Request 2")
for page in request_page(next_token):
    steps += 1

    print(page['CommonPrefixes'])
    next_token = page['NextContinuationToken']

    if page['CommonPrefixes']:
        print(f"done in step: {steps}")
        break

Output:

> Request 1
> S3://by_bucket/2023-01-01
> S3://by_bucket/2023-01-02
> S3://by_bucket/2023-01-03
> S3://by_bucket/2023-01-04
> S3://by_bucket/2023-01-05
> done in step: 1
>
> Request 2
> []
> S3://by_bucket/2023-01-11
> S3://by_bucket/2023-01-12
> S3://by_bucket/2023-01-13
> S3://by_bucket/2023-01-14
> S3://by_bucket/2023-01-15
> done in step: 3

Possible Solution

No response

Additional Information/Context

I followed the issue down to PageIterator.__iter__() (.venv/lib/python3.11/site-packages/botocore/paginate.py)

            if first_request:
                # The first request is handled differently.  We could
                # possibly have a resume/starting token that tells us where
                # to index into the retrieved page.
                if self._starting_token is not None:
                    starting_truncation = self._handle_first_request(
                        parsed, primary_result_key, starting_truncation
                    )
                first_request = False
                self._record_non_aggregate_key_values(parsed)

The primary_result_key is initiated a few lines before that as self.result_keys[0] and result_keys are essentially coming from a JSON schema from venv/lib/python3.11/site-packages/botocore/data/s3/2006-03-01/paginators-1.json

"ListObjectsV2": {
      "more_results": "IsTruncated",
      "limit_key": "MaxKeys",
      "output_token": "NextContinuationToken",
      "input_token": "ContinuationToken",
      "result_key": [
        "Contents",
        "CommonPrefixes"
      ]
    },

where result_key is Contents which is missing in the S3 response body parsed

SDK version used

1.31.17

Environment details (OS name and version, etc.)

MacOS 14.2.1 (23C71)

@dboyadzhiev dboyadzhiev added bug This issue is a confirmed bug. needs-triage This issue or PR still needs to be triaged. labels Feb 14, 2024
@RyanFitzSimmonsAK RyanFitzSimmonsAK self-assigned this May 9, 2024
@RyanFitzSimmonsAK RyanFitzSimmonsAK added investigating This issue is being investigated and/or work is in progress to resolve the issue. s3 p2 This is a standard priority issue and removed needs-triage This issue or PR still needs to be triaged. labels May 9, 2024
@amberkushwaha
Copy link

investigating the prolonged fortage in it.

@RyanFitzSimmonsAK
Copy link
Contributor

Hey @dboyadzhiev, thanks for reaching out and for the detailed reproduction steps. I was able to reproduce this behavior, and will bring it up with the team. I'll provide an update when I know more.

@RyanFitzSimmonsAK RyanFitzSimmonsAK added response-requested Waiting on additional info and feedback. needs-review This issue or pull request needs review from a core team member. and removed investigating This issue is being investigated and/or work is in progress to resolve the issue. response-requested Waiting on additional info and feedback. labels Aug 6, 2024
@RyanFitzSimmonsAK
Copy link
Contributor

Hi @dboyadzhiev, thanks for your patience. Could you clarify why you have the first call separate from the rest? I was able to get all the common prefixes by using just one loop, and initializing next_token to None. This seems to be what you're trying to achieve, unless I'm misunderstanding the problem.

@RyanFitzSimmonsAK RyanFitzSimmonsAK added response-requested Waiting on additional info and feedback. and removed needs-review This issue or pull request needs review from a core team member. labels Aug 14, 2024
@dboyadzhiev
Copy link
Author

We used that logic to implement pagination. With the code above I simulated two different requests. Imagine you have an app with a list of 20 files per page, and this is to click on the button "next".

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. label Aug 16, 2024
@amberkushwaha
Copy link

GitHub actions were not supported by file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a confirmed bug. p2 This is a standard priority issue s3
Projects
None yet
Development

No branches or pull requests

3 participants