Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Added Cassandra-Backend #128

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

MichaelVIU
Copy link

Added Cassandra as Backend on base of SQLAlchemy Code.

self.cluster = Cluster(cluster_ips, cluster_port)
self.models = dict([(name, load_object(klass)) for name, klass in models.items()])

self.session_cls = self.cluster.connect()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like a session, not a session class. May be worth renaming?

@sibiryakov
Copy link
Member

In general looks awesome! It definitely requires some small changes, but this is already a great contribution!

@sibiryakov
Copy link
Member

So tests are broken, we either have to find a way how to test it with Travis CI, or disable this test for now.

def __init__(self, manager):
self.manager = manager
settings = manager.settings
cluster_ips = settings.get('CASSANDRABACKEND_CLUSTER_IPS') # Format: ['192.168.0.1', '192.168.0.2']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to include these settings in documentation and move format description there. I know docs are boring, but there is nothing we can do.

@MichaelVIU
Copy link
Author

How can i activate an re-test in travis after i've made changes?

@MichaelVIU
Copy link
Author

OK, my fork now runs without errors through travis:
https://travis-ci.org/wpxgit/frontera/builds/120262078

@@ -30,6 +30,9 @@ from this dir::

Documentation will be generated (in HTML format) inside the ``build/html`` dir.

If you get the error "ImportError: No module named sphinx_rtd_theme" run:
sudo pip install sphinx-rtd-theme
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not need, because error message is pretty clear states that module is missing.

@sibiryakov
Copy link
Member

I'm marking these PR as WIP. Meaning it's Work in Progress, and we shouldn't merge it. OK?

@sibiryakov sibiryakov changed the title Added Cassandra-Backend [WIP] Added Cassandra-Backend Apr 4, 2016
class QueueModel(Model):
__table_name__ = 'queue'

crawl = Text(primary_key=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Michael. I was thinking about data model you designed for Queue, and it's hard for me to see the logic there. What goals did you set? What access pattern (e.g. queries) you see? Is this designed for broad crawling?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Alex,

it designed for:

  1. Get next results
    Here we need to filter after crawlid and partition_id.
    Thats why they are at the first fields of the table/model.
  2. Sorting get_next_results (for disributed mode: score, for non disributed mode: created_at)
    Like i have written if you want to order by a column you must have the colums which are in order bevore the colum you like to sort in where or order by part of the query.
    This: "SELECT * FROM queue WHERE crawl = 'default' AND partition_id=1 ORDER BY created_at" would not work bevause missing score.
    If you put fingerprint bevor created_ar you must also define it in the query.
    Thats because cassandra has a special way it stores data...

Also fingerprint must be in the clustering colum / primary key because of deletion with the key.

It's also possible to use indexes for some of the mentioned things - but they are slower and not recommendet for production usecases...

Hope i've understand your question right.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you've got my question right.
Just a few remarks:

  • sort column shouldn't be connected with run mode. Sorting by score can be used in single process mode, and by creation date (FIFO) in distributed. It depends on application goals.
  • fingerprint is needed by Frontera to operate, but it's wrong to use it for deletion of items in the queue (e.g in get_next_results). The same fingerprint can be scheduled few times, and it's not a mistake, it can be required by the application. Auto incrementing field, is a possible solution.

I have a big doubts that someone will be using Frontera with Cassandra in non-distributed mode. Cassandra is designed for big volumes of data and it's distributed, so I would limit support of single process mode to minimum. Mostly for debugging and testing purposes. Therefore, IMO sorting only by score, should be enough.

@sibiryakov
Copy link
Member

Hey @wpxgit what do you think of all that? Do you plan to contribute more?

@bnopacheco
Copy link

Hello @sibiryakov @wpxgit , is there any plan to continue development?

@sibiryakov
Copy link
Member

I haven't heard anything @maisumbruno . At Scrapinghub we're fine with HBase so far.

@bnopacheco
Copy link

bnopacheco commented Dec 4, 2017

We are comfortable with how Cassandra works. If there are no plans to implement, @sibiryakov would there be any hints on how I can do this myself?

@sibiryakov
Copy link
Member

@maisumbruno Definitely. I would recommend to inspire from HBaseBackend, where you can find a queue suitable for large scale crawling. Also you can start implementing it by parts, say first States, then Queue and Metadata if needed. You can send a PR any time and I'll have a look.

But you know, the most important part is battle testing, on a large volume storages are starting to work slower and this often require refactoring, schema change or various optimizations.

@bnopacheco
Copy link

Thanks @sibiryakov

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants