Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: clarify AWS Lambda storage #2477

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Conversation

connorads
Copy link
Contributor

There is ephemeral storage in /tmp
https://docs.aws.amazon.com/lambda/latest/api/API_EphemeralStorage.html

Which could technically be used if desired
CRAWLEE_STORAGE_DIR=/tmp/crawlee/storage

There is ephemeral storage in `/tmp`
https://docs.aws.amazon.com/lambda/latest/api/API_EphemeralStorage.html

Which could technically be used if desired
`CRAWLEE_STORAGE_DIR=/tmp/crawlee/storage`
@connorads connorads changed the title Clarify AWS Lambda storage docs: clarify AWS Lambda storage May 20, 2024
@connorads connorads marked this pull request as ready for review May 20, 2024 08:40
Copy link
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am honestly not sure if this is adding any clarity, to me its actually adding confusion (as now you say its crawlee that has read only storage?). If you want to improve this, why not mention what you said in the PR description explicitly?

@B4nan B4nan added the t-tooling Issues with this label are in the ownership of the tooling team. label May 22, 2024
@barjin
Copy link
Contributor

barjin commented May 23, 2024

Which could technically be used if desired (CRAWLEE_STORAGE_DIR=/tmp/crawlee/storage)

This is only true to an extent - the ephemeral storage can be shared between different Lambda invocations, provided they run in the same execution environment (i.e. if you call the Lambdas one after another, AWS will repurpose the running Lambda environment). This might cause some very hard-to-debug issues (stuck shared state from the previous runs) - even though Crawlee should always purge the previous state, you can never be too cautious with these things :) This is especially important if you want to run multiple crawler instances in one Lambda.

I agree w/ @B4nan that explaining all these whys and wherefores is rather counterproductive - I'd show the one and only way to do this rather than confusing the reader with (more or less) irrelevant details.

@connorads
Copy link
Contributor Author

Thanks for your feedback @B4nan and @barjin

Sounds like your saying we should use in-memory storage not because of the readonly Lambda filesystem but because it will cause the "statefulness" and potential hard to debug issues. I've tried to update it to express that 70a4fdd.

If you still think its worse than before then feel free to edit it and/or close this pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants