-
Notifications
You must be signed in to change notification settings - Fork 387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LzoInputFormat's listStatus() can take prohibitively long on S3 because it invokes FileInputFormat's listStatus() implementation #426
Comments
@buci Does not super.listStatus(job) ultimately translate to whateverFs#listStatus ? You can control LIST_STATUS_NUM_THREADS by specifying the conf The real issue is whether S3 returns |
@gerashegalov True -- Let me know if I'm wrong about whether |
Oh, I guess other FileSystems are affected if the user has set |
👍 |
Did you notice this issue too, pkallos? |
absolutely yes! |
How are you solving it? The issue seems dead, but it's real, and I'm happy to code something different if someone proposes a better strategy. |
not getting around it at the moment, just absorbing the time-cost which is painful. does your solution in #428 work as advertised? |
I've tested a few times on a few inputs, and it's listed all files quickly. Add a comment if you find an issue. |
Note that solution #428 is not backwards compatible as it does not support globs and path filters. |
@buci how is your input specified? What if input is '/user/name/dir*' ? Not sure of contract for globStatus. one work around would be to specify parent directory (making sure the directory has only one file) |
LzoInputFormat's listStatus begins with
where super refers to FileInputFormat. FileInputFormat's listStatus() calls singleThreadedListStatus() (when LIST_STATUS_NUM_THREADS == 1), which is not optimized for S3. This becomes an issue when listing a directory with many files when a job begins; I've observed that a single listStatus() call can take 17 minutes when there are 50k files in an input path.
Proposed solution: use the listStatus method of the FileSystem of the appropriate input path (obtained from getInputPaths(job)). This will call the listStatus method of whatever class is specified by fs.s3[n].impl in core-site.xml.
The text was updated successfully, but these errors were encountered: