PageIterator skipping a page when browsing `list_objects_v2` with Delimiter #3119

dboyadzhiev · 2024-02-14T12:48:55Z

Describe the bug

The pagination of S3 list_objects_v2 skip pages when using CommondPrefixes (i.e. Delimiter) and StartingToken

Use case:
Our API provides a list of S3 "folders" and supports pagination. It is a wrapper over our internal S3 bucket and forwards the information. The first response of the API returns a list of common prefixes and the next token provided by the PageIterator. The second request uses this token to continue the listing.

Expected Behavior

Using the paginator.paginate() method with the Delimiter parameter and not setting StartingToken should return all pages starting from the first one and its next token.
Using it again but this time with a given StartingToken (the first page next token) should return all pages starting from the second one and its next token.

Current Behavior

When the paginator.paginate() is called with StartingToken it returns the second page with an empty CommonPrefixes list but the third with a valid CommonPrefixes list

Reproduction Steps

You need a bucket with date partitions and files in them.

S3://by_bucket/2023-01-01/file1.json
S3://by_bucket/2023-01-01/file2.json
S3://by_bucket/2023-01-02/file1.json
S3://by_bucket/2023-01-02/file2.json
...
S3://by_bucket/2023-12-01/file1.json
S3://by_bucket/2023-12-01/file2.json

import boto3

BUCKET_NAME = ""
PREFIX = ""
token = None

s3_client = boto3.client("s3")
paginator = s3_client.get_paginator('list_objects_v2')

def request_page(token):
    paginator = s3_client.get_paginator('list_objects_v2')
    return paginator.paginate(
        Bucket=BUCKET_NAME,
        Delimiter='/',
        Prefix=PREFIX,
        PaginationConfig={'PageSize': 5, 'StartingToken': token}
    )

# simolate multi requests to an API 
steps = 0

# First request 
# print page 1 prefixes
# keep the token for page 2
print("Request 1")
for page in request_page(token):
    steps += 1

    print(page['CommonPrefixes'])
    next_token = page['NextContinuationToken']

    if page['CommonPrefixes']:
        print(f"done in step: {steps}")
        break

# Second request 
# print page 2 prefixes
# keep the token for page 2
print("Request 2")
for page in request_page(next_token):
    steps += 1

    print(page['CommonPrefixes'])
    next_token = page['NextContinuationToken']

    if page['CommonPrefixes']:
        print(f"done in step: {steps}")
        break

Output:

> Request 1
> S3://by_bucket/2023-01-01
> S3://by_bucket/2023-01-02
> S3://by_bucket/2023-01-03
> S3://by_bucket/2023-01-04
> S3://by_bucket/2023-01-05
> done in step: 1
>
> Request 2
> []
> S3://by_bucket/2023-01-11
> S3://by_bucket/2023-01-12
> S3://by_bucket/2023-01-13
> S3://by_bucket/2023-01-14
> S3://by_bucket/2023-01-15
> done in step: 3

Possible Solution

No response

Additional Information/Context

I followed the issue down to PageIterator.__iter__() (.venv/lib/python3.11/site-packages/botocore/paginate.py)

            if first_request:
                # The first request is handled differently.  We could
                # possibly have a resume/starting token that tells us where
                # to index into the retrieved page.
                if self._starting_token is not None:
                    starting_truncation = self._handle_first_request(
                        parsed, primary_result_key, starting_truncation
                    )
                first_request = False
                self._record_non_aggregate_key_values(parsed)

The primary_result_key is initiated a few lines before that as self.result_keys[0] and result_keys are essentially coming from a JSON schema from venv/lib/python3.11/site-packages/botocore/data/s3/2006-03-01/paginators-1.json

"ListObjectsV2": {
      "more_results": "IsTruncated",
      "limit_key": "MaxKeys",
      "output_token": "NextContinuationToken",
      "input_token": "ContinuationToken",
      "result_key": [
        "Contents",
        "CommonPrefixes"
      ]
    },

where result_key is Contents which is missing in the S3 response body parsed

SDK version used

1.31.17

Environment details (OS name and version, etc.)

MacOS 14.2.1 (23C71)

The text was updated successfully, but these errors were encountered:

amberkushwaha · 2024-08-05T09:30:32Z

investigating the prolonged fortage in it.

RyanFitzSimmonsAK · 2024-08-06T22:34:35Z

Hey @dboyadzhiev, thanks for reaching out and for the detailed reproduction steps. I was able to reproduce this behavior, and will bring it up with the team. I'll provide an update when I know more.

RyanFitzSimmonsAK · 2024-08-14T17:03:49Z

Hi @dboyadzhiev, thanks for your patience. Could you clarify why you have the first call separate from the rest? I was able to get all the common prefixes by using just one loop, and initializing next_token to None. This seems to be what you're trying to achieve, unless I'm misunderstanding the problem.

dboyadzhiev · 2024-08-15T07:58:44Z

We used that logic to implement pagination. With the code above I simulated two different requests. Imagine you have an app with a list of 20 files per page, and this is to click on the button "next".

amberkushwaha · 2024-11-05T10:11:24Z

GitHub actions were not supported by file.

dboyadzhiev added bug This issue is a confirmed bug. needs-triage This issue or PR still needs to be triaged. labels Feb 14, 2024

RyanFitzSimmonsAK self-assigned this May 9, 2024

RyanFitzSimmonsAK added investigating This issue is being investigated and/or work is in progress to resolve the issue. s3 p2 This is a standard priority issue and removed needs-triage This issue or PR still needs to be triaged. labels May 9, 2024

RyanFitzSimmonsAK added response-requested Waiting on additional info and feedback. and removed needs-review This issue or pull request needs review from a core team member. labels Aug 14, 2024

github-actions bot removed the response-requested Waiting on additional info and feedback. label Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PageIterator skipping a page when browsing `list_objects_v2` with Delimiter #3119

PageIterator skipping a page when browsing `list_objects_v2` with Delimiter #3119

dboyadzhiev commented Feb 14, 2024 •

edited

Loading

amberkushwaha commented Aug 5, 2024

RyanFitzSimmonsAK commented Aug 6, 2024

RyanFitzSimmonsAK commented Aug 14, 2024

dboyadzhiev commented Aug 15, 2024

amberkushwaha commented Nov 5, 2024

PageIterator skipping a page when browsing list_objects_v2 with Delimiter #3119

PageIterator skipping a page when browsing list_objects_v2 with Delimiter #3119

Comments

dboyadzhiev commented Feb 14, 2024 • edited Loading

Describe the bug

Expected Behavior

Current Behavior

Reproduction Steps

Possible Solution

Additional Information/Context

SDK version used

Environment details (OS name and version, etc.)

amberkushwaha commented Aug 5, 2024

RyanFitzSimmonsAK commented Aug 6, 2024

RyanFitzSimmonsAK commented Aug 14, 2024

dboyadzhiev commented Aug 15, 2024

amberkushwaha commented Nov 5, 2024

PageIterator skipping a page when browsing `list_objects_v2` with Delimiter #3119

PageIterator skipping a page when browsing `list_objects_v2` with Delimiter #3119

dboyadzhiev commented Feb 14, 2024 •

edited

Loading