HADOOP-18073. Upgrades Head, Copy, Put & Get operations. #5

ahmarsuhail · 2022-10-06T14:03:43Z

Description of PR

This PR updates client config, and the list, put, copy, head and get object operations.

When reviewing, it's probably easier to separate by commits.

Changes for client and config and update list operations - These have already been reviewed by @dannycjones here. That PR has been closed without merging, and will be replaced by this one.
Changes for Copy, Put & Head : Changes made are explained in Overview of changes for GetObjectMetadata, PutObject & CopyObject.pdf.
Changes for Get : Contributed by @passaro

Still TODO

There are some issues that I'm still looking into:

Transfer Manager Issues:
- Something goes wrong when you add in the Transfer Manager dependency. In AWSClienConfig, org.apache.http.client.utils.URIBuilder; is no longer available. And get a class file for org.apache.commons.logging.Log not found in S3ADelegationTokens.
- When I try to use progress listener with the TM, I get a log4j error.
- The TM does not return copyObjectResult which has the version ID, which causes an issue in the ChangeTracker.
- Copy with SSE-C does not work with the TM.
Have raised these with the AWS SDK team.
Previously a call to getObjectMetadata for a base path, ie with an empty key would return some metadata. (bucket region, content type). headObject() fails without a key. I'm not sure what to do here, will comment on the code too.

How was this patch tested?

Tested in eu-west-1 by running mvn -Dparallel-tests -DtestsThreadCount=16 clean verify

The following tests are failing:

Test Failing	Reason
ITestS3AAWSCredentialsProvider.testBadCredentialsConstructor	Fails because SDKV2 throws SdkException, will work once errors are handled and translated properly.
ITestAuditAccessChecks.testDirAccessDenied	Fails because list requests are not wired up to auditor yet
ITestAuditAccessChecks.testFileAccessDenied	Fails because list requests are not wired up to auditor yet
ITestAuditAccessChecks.testMissingPathAccessFNFE	Fails because list requests are not wired up to auditor yet
ITestAuditManager.testRequestHandlerBinding	Fails because list requests are not wired up to auditor yet
ITestAWSStatisticCollection.testCommonCrawlStatistics	Fails as not tied up to auditor yet, so does not update stats when getObjectMetadata is called
ITestAWSStatisticCollection.testLandsatStatistics	Fails as not tied up to auditor yet, so does not update stats when getObjectMetadata is called
ITestMarkerTool.testRunWrongBucket	Throws NoSuchBucketException, which is not yet translated
ITestS3AConfiguration.testAutomaticProxyPortSelection	Fails because SDKV2 throws SdkException, will work once errors are handled and translated properly.
ITestS3AConfiguration.testProxyConnection	Fails because SDKV2 throws SdkException, will work once errors are handled and translated properly.
ITestS3AConfiguration.shouldBeAbleToSwitchOnS3PathStyleAccessViaConfigProperty	Fails due to errors not being translated It throws Software.S3Exception: The bucket you are attempting to access must be addressed using the specified endpoint. will be fixed during error translation
ITestS3AStorageClass.testCreateAndCopyObjectWithStorageClassGlacier	Copy works ok, but fails because it expects AccessDeniedException when it tries to get the object from glaciers, and errors are not translated yet. Will be updated during error translation work.
ITestS3ATemporaryCredentials.testInvalidSTSBinding, testSTS	Fails because it throws a 400, which is translated to AWSBadRequestException. Will be fixed during error translation.
ITestSessionDelegationInFileystem.testDelegatedFileSystem, ITestXAttrCost.testXAttrRoot	Fails because tries to do headObject() with empty key which now fails, need to look at how to fix. Note: ITestSessionDelegationInFileystem.testDelegatedFileSystem does not fail as i've temporarily removed the headobject call from it
ITestS3AEncryptionSSEC	Fails due to SSE-C not working with the TM
TestStreamChangeTracker	Fails due having to comment out some code as TM response does not yet have `CopyObjectResult`

ahmarsuhail · 2022-10-06T14:38:15Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/RequestFactoryImpl.java

@@ -173,6 +192,8 @@ protected String getBucket() {
   * if the encryption secrets contain the information/settings for this.
   * @return an optional set of KMS Key settings
   */
+  // TODO: This method can be removed during getObject work, as the key now comes directly from


This and generateSEECustomerKey() will be removed as part of MPU work.

ahmarsuhail · 2022-10-06T14:39:33Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/HeaderProcessing.java

-    maybeSetHeader(headers, XA_STORAGE_CLASS,
-        md.getReplicationStatus());
+        md.storageClassAsString());
+    // TODO: check this, looks wrong.


Can we remove this?

ahmarsuhail · 2022-10-06T14:40:17Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/ChangeTracker.java

-    }
+    // TODO: Commenting out temporarily, due to the TM not returning copyObjectResult
+    //  in the response.
+//    String newRevisionId = policy.getRevisionId(copyObjectResponse);


This causes TestStreamChangeTracker to fail.

ahmarsuhail · 2022-10-06T14:42:14Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java

-              + keyToQualifiedPath(key))
-          .initCause(e);
-    }
+//    } catch (InterruptedException e) {


Really not sure what to do here and for copy, any help would be appreciated.

I assume there will be some kind of CompletetionException thrown with cause of Interrupted or something? is this the right direction?

ahmarsuhail · 2022-10-06T14:48:13Z

hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/impl/ITestXAttrCost.java

@@ -64,6 +64,9 @@ public ITestXAttrCost() {
  @Test
  public void testXAttrRoot() throws Throwable {
    describe("Test xattr on root");
+    // TODO: Previously a call to getObjectMetadata for a base path, ie with an empty key would
+    //  return some metadata. (bucket region, content type). headObject() fails without a key, check
+    // how this can be fixed. 
    Path root = new Path("/");


The issue is that with the V1 Client, when you did s3V1.getObjectMetadata("my-bucket", "") , so the key was empty as you were probing the root path, this worked ok and returned the bucket region and content-type. With V2, when you do s3v2.headObject(HeadObjectRequest.builder().bucket("my-bucket").build()), you get an error as the key must be specified. Instead we’ll need to call headBucket() here, but this has other implications. We will need some special code to handle this. Do we need to support XAttr operations on root dirs?

The other case is in testDelegatedFileSystem, where getObjectMetadata is used as a probe to check endpoint correctness. Can this simply be replaced by headBucket()?

It does seem like we'll have to add checks for if FS root and switch between headObject and headBucket. Appears painful at first but maybe it's better to be more explicit about the operations we're making rather than rely on this behaviour of SDK V1.

Key change: `getObject` now returns a `ResponseInputStream<GetObjectResponse>` rather than a `S3Object`. This makes it simpler to handle the input stream lifetime in various classes such as `S3AInputStream`, `S3ARemoteObject`, or `SDKStreamDrainer`.

passaro · 2022-10-10T11:13:43Z

More test failures:

ITestS3AFileContextStatistics.testStatistics - Mismatch in stats, likely same cause as ITestAuditAccessChecks.
ITestS3AOpenCost.testOpenFileLongerLength - Introduced with the getObject change, but just another issue with exception translation, to be fixed later.

dannycjones

I've done a very quick review of the code, skimming over most of it. I wanted to provide early, general feedback rather than wait ages for in-depth feedback.

There's a few comments on the code itself, but core feedback:

Whenever we are adding methods like "doSomething" (V1) and "doSomethingV2", can we instead use overloading? With this, I hope to reduce the diff for reviewers.
General thought - if we are changing public methods in classes like S3AFileSystem, can we think about limiting the visibility/scoping? Maybe not public/private but at least the visibility annotations used elsewhere. We are making a breaking change to the interface, so I would argue we are within our right to break it more by moving it out of S3AFileSystem. S3AFileSystem is way too big and this is a chance to clean up the interface a bit. (Something to discuss with community rather than address here perhaps)
This PR is huge - it's difficult to review it. How can we make these smaller? Can we try and get some milestones which let us get working code into the feature branch, and review operations one at a time? Maybe we do some setup PRs too like adding error handling or auditor methods for both V1 and V2 and then we just focus on operations in each PR.

Let's discuss - I think there's great work in here, it's just difficult to review it in its current size.

dannycjones · 2022-10-18T17:39:06Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Invoker.java

      throw (SdkBaseException) caught;
+    } else {
+      throw (AwsServiceException) caught;


Why do we cast this in the first place? Can we not just do throw caught;? (not sure)

dannycjones · 2022-10-18T17:40:24Z

...op-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/ProgressableProgressListener.java

-    if (progress != null) {
-      progress.progress();
-    }
+  public void  transferInitiated(TransferListener.Context.TransferInitiated context) {


nit: too many spaces before method name

dannycjones · 2022-10-18T17:41:12Z

...op-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/ProgressableProgressListener.java

+
+  @Override
+  public void bytesTransferred(TransferListener.Context.BytesTransferred context) {
+


nit: drop empty line

dannycjones · 2022-10-18T17:41:26Z

...op-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/ProgressableProgressListener.java

+    if(progress != null) {
+      progress.progress();
    }


nit: missing space on if statement

dannycjones · 2022-10-18T17:42:16Z

...op-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/ProgressableProgressListener.java

-    long delta = upload.getProgress().getBytesTransferred() -
-        lastBytesTransferred;
+  public long uploadCompleted(ObjectTransfer upload) {
+


nit: drop empty line

dannycjones · 2022-10-18T18:38:58Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/ErrorTranslation.java

+  // TODO: This method will be replace isUnkownBucket() during error translation work.
+  /**
+   * Does this exception indicate that the AWS Bucket was unknown.
+   * @param e exception.
+   * @return true if the status code and error code mean that the
+   * remote bucket is unknown.
+   */
+  public static boolean isUnknownBucketV2(AwsServiceException e) {
+    return e.statusCode() == SC_404
+        && AwsErrorCodes.E_NO_SUCH_BUCKET.equals(e.awsErrorDetails().errorCode());
+  }


There's a lot of methods like these where we have the original and then v2 version.

Can we actually use overloading here instead? i.e. this method would be boolean isUnknownBucket(AwsServiceException e).

If we can do that, we avoid a lot of "if instance of this, cast and use v1. else cast and use v2.". We'd just let the methods be polymorphic and reduce the size of the diff.

We can add "todo" to the method above to say "we will remove after". I thought about @deprecated but I imagine that'll just make Yetus really unhappy.

dannycjones · 2022-10-18T18:43:13Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/RequestFactoryImpl.java

@@ -159,7 +168,7 @@ private <T extends AmazonWebServiceRequest> T prepareRequest(T t) {
   */
  // TODO: Currently this is a NOOP, will be completed separately as part of auditor work.
  @Retries.OnceRaw
-  private <T extends AwsRequest> T prepareV2Request(T t) {
+  private <T extends AwsRequest.Builder> T prepareV2Request(T t) {


can we use overloading here? (i genuinely have no idea if we can do it with generics. maybe?)

i.e. <T extends AwsRequest.Builder> T prepareRequest(T t)

again, ideally to avoid needing switches and renames in the code base.

dannycjones · 2022-10-18T18:46:41Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/RequestFactoryImpl.java

+    HeadObjectRequest.Builder headObjectRequestBuilder =
+        HeadObjectRequest.builder().bucket(getBucket()).key(key);


prefer like

HeadObjectRequest.Builder headObjectRequestBuilder = HeadObjectRequest.builder() .bucket(getBucket()) .key(key);

dannycjones · 2022-10-18T18:49:51Z

hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/audit/AbstractAuditingTest.java

+  // TODO: Temporary change as auditor still expects V1 request, will be updated during auditor work.
  protected GetObjectMetadataRequest head() {
-    return manager.beforeExecution(
-        requestFactory.newGetObjectMetadataRequest("/"));
+//    return manager.beforeExecution(
+//        requestFactory.newGetObjectMetadataRequest("/"));
+    return manager.beforeExecution(new GetObjectMetadataRequest("test", "/"));


can we add no-op implementations to auditor with todo and accept failing test? and so we can avoid recreating v1 requests here.

dannycjones · 2022-10-18T18:51:40Z

hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/impl/ITestXAttrCost.java

@@ -64,6 +64,9 @@ public ITestXAttrCost() {
  @Test
  public void testXAttrRoot() throws Throwable {
    describe("Test xattr on root");
+    // TODO: Previously a call to getObjectMetadata for a base path, ie with an empty key would
+    //  return some metadata. (bucket region, content type). headObject() fails without a key, check
+    // how this can be fixed. 
    Path root = new Path("/");


It does seem like we'll have to add checks for if FS root and switch between headObject and headBucket. Appears painful at first but maybe it's better to be more explicit about the operations we're making rather than rely on this behaviour of SDK V1.

ahmarsuhail · 2022-10-19T16:28:28Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/AWSClientConfig.java

+   */
+  private static URI buildURI(String host, int port) {
+    try {
+      return new URIBuilder().setHost(host).setPort(port).build();


this needs to be updated to set scheme!

ahmarsuhail added 2 commits September 22, 2022 13:47

configures s3 client, updates list operation

77c5cf0

updates getObjectMetadata, putObject & copyObject operations

4085a8d

ahmarsuhail commented Oct 6, 2022

View reviewed changes

Upgrade GetObject to use SDK v2.

6a03f91

Key change: `getObject` now returns a `ResponseInputStream<GetObjectResponse>` rather than a `S3Object`. This makes it simpler to handle the input stream lifetime in various classes such as `S3AInputStream`, `S3ARemoteObject`, or `SDKStreamDrainer`.

ahmarsuhail force-pushed the HADOOP-18073-sdk-upgrade-head-get branch from 570c5a7 to 6a03f91 Compare October 7, 2022 20:05

dannycjones reviewed Oct 18, 2022

View reviewed changes

ahmarsuhail commented Oct 19, 2022

View reviewed changes

ahmarsuhail closed this Dec 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HADOOP-18073. Upgrades Head, Copy, Put & Get operations. #5

HADOOP-18073. Upgrades Head, Copy, Put & Get operations. #5

ahmarsuhail commented Oct 6, 2022 •

edited

Loading

ahmarsuhail Oct 6, 2022

ahmarsuhail Oct 6, 2022

ahmarsuhail Oct 6, 2022

ahmarsuhail Oct 6, 2022

dannycjones Oct 18, 2022

ahmarsuhail Oct 6, 2022

dannycjones Oct 18, 2022

passaro commented Oct 10, 2022

dannycjones left a comment

dannycjones Oct 18, 2022

dannycjones Oct 18, 2022

dannycjones Oct 18, 2022

dannycjones Oct 18, 2022

dannycjones Oct 18, 2022

dannycjones Oct 18, 2022

dannycjones Oct 18, 2022

dannycjones Oct 18, 2022

dannycjones Oct 18, 2022

dannycjones Oct 18, 2022

ahmarsuhail Oct 19, 2022


		@Override
		public void bytesTransferred(TransferListener.Context.BytesTransferred context) {

		HeadObjectRequest.Builder headObjectRequestBuilder =
		HeadObjectRequest.builder().bucket(getBucket()).key(key);

HADOOP-18073. Upgrades Head, Copy, Put & Get operations. #5

HADOOP-18073. Upgrades Head, Copy, Put & Get operations. #5

Conversation

ahmarsuhail commented Oct 6, 2022 • edited Loading

Description of PR

Still TODO

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

passaro commented Oct 10, 2022

dannycjones left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahmarsuhail commented Oct 6, 2022 •

edited

Loading