Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LzoInputFormat's listStatus() can take prohibitively long on S3 because it invokes FileInputFormat's listStatus() implementation #426

Open
nellore opened this issue Dec 13, 2014 · 11 comments

Comments

@nellore
Copy link

nellore commented Dec 13, 2014

LzoInputFormat's listStatus begins with

List<FileStatus> files = super.listStatus(job);

where super refers to FileInputFormat. FileInputFormat's listStatus() calls singleThreadedListStatus() (when LIST_STATUS_NUM_THREADS == 1), which is not optimized for S3. This becomes an issue when listing a directory with many files when a job begins; I've observed that a single listStatus() call can take 17 minutes when there are 50k files in an input path.

Proposed solution: use the listStatus method of the FileSystem of the appropriate input path (obtained from getInputPaths(job)). This will call the listStatus method of whatever class is specified by fs.s3[n].impl in core-site.xml.

@nellore nellore changed the title LzoInputFormat's listStatus() can take prohibitively long on S3 because it invokes FileInputFormat's listStatus implementation LzoInputFormat's listStatus() can take prohibitively long on S3 because it invokes FileInputFormat's listStatus() implementation Dec 13, 2014
@gerashegalov
Copy link
Contributor

@buci Does not super.listStatus(job) ultimately translate to whateverFs#listStatus ?

You can control LIST_STATUS_NUM_THREADS by specifying the conf mapreduce.input.fileinputformat.list-status.num-threads

The real issue is whether S3 returns LocatedFileStatus as HDFS does. If not FIF will have to do another round of RPC's for each File to getFileBlockLocations.

@nellore
Copy link
Author

nellore commented Dec 16, 2014

@gerashegalov True -- FileInputFormat does have to do a round of RPCs to getFileBlockLocations if file is not an instance of LocatedFileStatus. I experimented with increasing mapreduce.input.fileinputformat.list-status.num-threads to 20 to before making the pull request. It appeared to have zero effect on speed when I tried FileInputFormat's listStatus, and I think the issue is the globStatus calls. They call glob on Globber objects, which -- correct me if I'm wrong -- leads to lots of fs method invocations (like isDirectory) that each separately make S3 requests. The pull request resolves the issue on EMR by calling only methods overrided in emrfs (decompile emrfs-1.0.0.jar on one of the latest AMIs for details) -- that is, listStatus and getFileBlockLocations, which appear to be designed to minimize S3 requests. For other FileSystems, speed should not be affected.

Let me know if I'm wrong about whether globStatus is the problem. If not, there must be some other explanation why I notice a significant performance improvement with my implementation.

@nellore
Copy link
Author

nellore commented Dec 16, 2014

Oh, I guess other FileSystems are affected if the user has set mapreduce.input.fileinputformat.list-status.num-threads....

@pkallos
Copy link

pkallos commented Feb 28, 2015

👍

@nellore
Copy link
Author

nellore commented Feb 28, 2015

Did you notice this issue too, pkallos?

@pkallos
Copy link

pkallos commented Mar 2, 2015

absolutely yes!

@nellore
Copy link
Author

nellore commented Mar 3, 2015

How are you solving it? The issue seems dead, but it's real, and I'm happy to code something different if someone proposes a better strategy.

@pkallos
Copy link

pkallos commented Mar 3, 2015

not getting around it at the moment, just absorbing the time-cost which is painful. does your solution in #428 work as advertised?

@nellore
Copy link
Author

nellore commented Mar 3, 2015

I've tested a few times on a few inputs, and it's listed all files quickly. Add a comment if you find an issue.

@gerashegalov
Copy link
Contributor

Note that solution #428 is not backwards compatible as it does not support globs and path filters.

@rangadi
Copy link
Contributor

rangadi commented Mar 4, 2015

@buci how is your input specified? What if input is '/user/name/dir*' ? Not sure of contract for globStatus. one work around would be to specify parent directory (making sure the directory has only one file)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants