-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Added Cassandra-Backend #128
base: master
Are you sure you want to change the base?
Conversation
…mance enhancements
self.cluster = Cluster(cluster_ips, cluster_port) | ||
self.models = dict([(name, load_object(klass)) for name, klass in models.items()]) | ||
|
||
self.session_cls = self.cluster.connect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like a session, not a session class. May be worth renaming?
In general looks awesome! It definitely requires some small changes, but this is already a great contribution! |
So tests are broken, we either have to find a way how to test it with Travis CI, or disable this test for now. |
def __init__(self, manager): | ||
self.manager = manager | ||
settings = manager.settings | ||
cluster_ips = settings.get('CASSANDRABACKEND_CLUSTER_IPS') # Format: ['192.168.0.1', '192.168.0.2'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to include these settings in documentation and move format description there. I know docs are boring, but there is nothing we can do.
How can i activate an re-test in travis after i've made changes? |
…ed because of the complex test-setup
Updated Docs readme to install rdt theme if missing. Updated Code to only run in FIFO algorithm because cassandras limiting ORDER functionality
OK, my fork now runs without errors through travis: |
@@ -30,6 +30,9 @@ from this dir:: | |||
|
|||
Documentation will be generated (in HTML format) inside the ``build/html`` dir. | |||
|
|||
If you get the error "ImportError: No module named sphinx_rtd_theme" run: | |||
sudo pip install sphinx-rtd-theme |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not need, because error message is pretty clear states that module is missing.
I'm marking these PR as WIP. Meaning it's Work in Progress, and we shouldn't merge it. OK? |
class QueueModel(Model): | ||
__table_name__ = 'queue' | ||
|
||
crawl = Text(primary_key=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Michael. I was thinking about data model you designed for Queue, and it's hard for me to see the logic there. What goals did you set? What access pattern (e.g. queries) you see? Is this designed for broad crawling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Alex,
it designed for:
- Get next results
Here we need to filter after crawlid and partition_id.
Thats why they are at the first fields of the table/model. - Sorting get_next_results (for disributed mode: score, for non disributed mode: created_at)
Like i have written if you want to order by a column you must have the colums which are in order bevore the colum you like to sort in where or order by part of the query.
This: "SELECT * FROM queue WHERE crawl = 'default' AND partition_id=1 ORDER BY created_at" would not work bevause missing score.
If you put fingerprint bevor created_ar you must also define it in the query.
Thats because cassandra has a special way it stores data...
Also fingerprint must be in the clustering colum / primary key because of deletion with the key.
It's also possible to use indexes for some of the mentioned things - but they are slower and not recommendet for production usecases...
Hope i've understand your question right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you've got my question right.
Just a few remarks:
- sort column shouldn't be connected with run mode. Sorting by score can be used in single process mode, and by creation date (FIFO) in distributed. It depends on application goals.
- fingerprint is needed by Frontera to operate, but it's wrong to use it for deletion of items in the queue (e.g in
get_next_results
). The same fingerprint can be scheduled few times, and it's not a mistake, it can be required by the application. Auto incrementing field, is a possible solution.
I have a big doubts that someone will be using Frontera with Cassandra in non-distributed mode. Cassandra is designed for big volumes of data and it's distributed, so I would limit support of single process mode to minimum. Mostly for debugging and testing purposes. Therefore, IMO sorting only by score, should be enough.
Hey @wpxgit what do you think of all that? Do you plan to contribute more? |
Hello @sibiryakov @wpxgit , is there any plan to continue development? |
I haven't heard anything @maisumbruno . At Scrapinghub we're fine with HBase so far. |
We are comfortable with how Cassandra works. If there are no plans to implement, @sibiryakov would there be any hints on how I can do this myself? |
@maisumbruno Definitely. I would recommend to inspire from HBaseBackend, where you can find a queue suitable for large scale crawling. Also you can start implementing it by parts, say first States, then Queue and Metadata if needed. You can send a PR any time and I'll have a look. But you know, the most important part is battle testing, on a large volume storages are starting to work slower and this often require refactoring, schema change or various optimizations. |
Thanks @sibiryakov |
Added Cassandra as Backend on base of SQLAlchemy Code.