[WIP] Added Cassandra-Backend #128

MichaelVIU · 2016-03-30T14:47:33Z

Added Cassandra as Backend on base of SQLAlchemy Code.

…mance enhancements

sibiryakov · 2016-03-31T08:57:48Z

frontera/contrib/backends/cassandra/__init__.py

+        self.cluster = Cluster(cluster_ips, cluster_port)
+        self.models = dict([(name, load_object(klass)) for name, klass in models.items()])
+
+        self.session_cls = self.cluster.connect()


It looks like a session, not a session class. May be worth renaming?

sibiryakov · 2016-03-31T09:10:12Z

In general looks awesome! It definitely requires some small changes, but this is already a great contribution!

sibiryakov · 2016-03-31T09:10:59Z

So tests are broken, we either have to find a way how to test it with Travis CI, or disable this test for now.

sibiryakov · 2016-03-31T09:12:30Z

frontera/contrib/backends/cassandra/__init__.py

+    def __init__(self, manager):
+        self.manager = manager
+        settings = manager.settings
+        cluster_ips = settings.get('CASSANDRABACKEND_CLUSTER_IPS')      # Format: ['192.168.0.1', '192.168.0.2']


It would be great to include these settings in documentation and move format description there. I know docs are boring, but there is nothing we can do.

MichaelVIU · 2016-04-01T11:17:35Z

How can i activate an re-test in travis after i've made changes?

…ed because of the complex test-setup

Updated Docs readme to install rdt theme if missing. Updated Code to only run in FIFO algorithm because cassandras limiting ORDER functionality

MichaelVIU · 2016-04-02T16:15:02Z

OK, my fork now runs without errors through travis:
https://travis-ci.org/wpxgit/frontera/builds/120262078

…ation

Conflicts: requirements/tests.txt

sibiryakov · 2016-04-04T16:17:12Z

docs/README

@@ -30,6 +30,9 @@ from this dir::

 Documentation will be generated (in HTML format) inside the ``build/html`` dir.

+If you get the error "ImportError: No module named sphinx_rtd_theme" run:
+    sudo pip install   sphinx-rtd-theme


I think this is not need, because error message is pretty clear states that module is missing.

sibiryakov · 2016-04-04T16:36:30Z

I'm marking these PR as WIP. Meaning it's Work in Progress, and we shouldn't merge it. OK?

sibiryakov · 2016-04-05T09:59:39Z

frontera/contrib/backends/cassandra/models.py

+class QueueModel(Model):
+    __table_name__ = 'queue'
+
+    crawl = Text(primary_key=True)


Hi Michael. I was thinking about data model you designed for Queue, and it's hard for me to see the logic there. What goals did you set? What access pattern (e.g. queries) you see? Is this designed for broad crawling?

Hi Alex,

it designed for:

Get next results
Here we need to filter after crawlid and partition_id.
Thats why they are at the first fields of the table/model.

Sorting get_next_results (for disributed mode: score, for non disributed mode: created_at)
Like i have written if you want to order by a column you must have the colums which are in order bevore the colum you like to sort in where or order by part of the query.
This: "SELECT * FROM queue WHERE crawl = 'default' AND partition_id=1 ORDER BY created_at" would not work bevause missing score.
If you put fingerprint bevor created_ar you must also define it in the query.
Thats because cassandra has a special way it stores data...

Also fingerprint must be in the clustering colum / primary key because of deletion with the key.

It's also possible to use indexes for some of the mentioned things - but they are slower and not recommendet for production usecases...

Hope i've understand your question right.

Yes, you've got my question right.
Just a few remarks:

sort column shouldn't be connected with run mode. Sorting by score can be used in single process mode, and by creation date (FIFO) in distributed. It depends on application goals.

fingerprint is needed by Frontera to operate, but it's wrong to use it for deletion of items in the queue (e.g in get_next_results). The same fingerprint can be scheduled few times, and it's not a mistake, it can be required by the application. Auto incrementing field, is a possible solution.

I have a big doubts that someone will be using Frontera with Cassandra in non-distributed mode. Cassandra is designed for big volumes of data and it's distributed, so I would limit support of single process mode to minimum. Mostly for debugging and testing purposes. Therefore, IMO sorting only by score, should be enough.

sibiryakov · 2016-04-22T11:13:31Z

Hey @wpxgit what do you think of all that? Do you plan to contribute more?

bnopacheco · 2017-11-29T19:38:53Z

Hello @sibiryakov @wpxgit , is there any plan to continue development?

sibiryakov · 2017-11-30T12:20:58Z

I haven't heard anything @maisumbruno . At Scrapinghub we're fine with HBase so far.

bnopacheco · 2017-12-04T15:48:20Z

We are comfortable with how Cassandra works. If there are no plans to implement, @sibiryakov would there be any hints on how I can do this myself?

sibiryakov · 2017-12-05T09:29:37Z

@maisumbruno Definitely. I would recommend to inspire from HBaseBackend, where you can find a queue suitable for large scale crawling. Also you can start implementing it by parts, say first States, then Queue and Metadata if needed. You can send a PR any time and I'll have a look.

But you know, the most important part is battle testing, on a large volume storages are starting to work slower and this often require refactoring, schema change or various optimizations.

bnopacheco · 2017-12-05T17:13:32Z

Thanks @sibiryakov

MichaelVIU added 6 commits February 26, 2016 17:10

Initial Commit with Cassandra Backend

4f9c92a

Created Cassandra Backend based on SELAlchemy files

125622e

Changed Cassandra Backend to run in Dirsibuted Mode, Made much perfor…

7bb2605

…mance enhancements

Added cassandra-driver to test requirements

6238355

Removed not existing classes from test

039fd0e

Changed Loggin in defaults back to false

667e65c

sibiryakov reviewed Mar 31, 2016
View reviewed changes

MichaelVIU added 8 commits April 1, 2016 13:22

Remove .idea

559400e

Changed nameing of session from session_cls to session

4ffe038

Changed retry logic to cassanrda driver based retry logic. Isn't test…

2e95323

…ed because of the complex test-setup

Added Documentation about Cassandra settings.

c53c3f7

Updated Docs readme to install rdt theme if missing. Updated Code to only run in FIFO algorithm because cassandras limiting ORDER functionality

change the usage of crawl_id

e190256

Undo changes on conf.py, backend/init, db.py

b1bf7bc

Changes to get Travis run

2019a71

Next Try Travis...

ed6b561

MichaelVIU added 4 commits April 2, 2016 17:49

Make counting Table optional, own class for counting, update dokument…

fef1d83

…ation

Removed Caching from Metadata. Some changes to fit travis ci conventions

5a0967f

Code convention changes

29a0bdd

Merge https://github.com/scrapinghub/frontera

2919e79

Conflicts: requirements/tests.txt

sibiryakov reviewed Apr 4, 2016
View reviewed changes

sibiryakov changed the title ~~Added Cassandra-Backend~~ [WIP] Added Cassandra-Backend Apr 4, 2016

sibiryakov reviewed Apr 5, 2016
View reviewed changes

sibiryakov mentioned this pull request Sep 26, 2016

Added tests for filters, formatters, handlers #206

Merged

3 tasks

voith mentioned this pull request Nov 11, 2016

[WIP] Added Cassandra backend #225

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Added Cassandra-Backend #128

[WIP] Added Cassandra-Backend #128

MichaelVIU commented Mar 30, 2016

sibiryakov Mar 31, 2016

sibiryakov commented Mar 31, 2016

sibiryakov commented Mar 31, 2016

sibiryakov Mar 31, 2016

MichaelVIU commented Apr 1, 2016

MichaelVIU commented Apr 2, 2016

sibiryakov Apr 4, 2016

sibiryakov commented Apr 4, 2016

sibiryakov Apr 5, 2016

MichaelVIU Apr 5, 2016

sibiryakov Apr 6, 2016

sibiryakov commented Apr 22, 2016

bnopacheco commented Nov 29, 2017

sibiryakov commented Nov 30, 2017

bnopacheco commented Dec 4, 2017 •

edited

Loading

sibiryakov commented Dec 5, 2017

bnopacheco commented Dec 5, 2017

[WIP] Added Cassandra-Backend #128

Are you sure you want to change the base?

[WIP] Added Cassandra-Backend #128

Conversation

MichaelVIU commented Mar 30, 2016

sibiryakov Mar 31, 2016

Choose a reason for hiding this comment

sibiryakov commented Mar 31, 2016

sibiryakov commented Mar 31, 2016

sibiryakov Mar 31, 2016

Choose a reason for hiding this comment

MichaelVIU commented Apr 1, 2016

MichaelVIU commented Apr 2, 2016

sibiryakov Apr 4, 2016

Choose a reason for hiding this comment

sibiryakov commented Apr 4, 2016

sibiryakov Apr 5, 2016

Choose a reason for hiding this comment

MichaelVIU Apr 5, 2016

Choose a reason for hiding this comment

sibiryakov Apr 6, 2016

Choose a reason for hiding this comment

sibiryakov commented Apr 22, 2016

bnopacheco commented Nov 29, 2017

sibiryakov commented Nov 30, 2017

bnopacheco commented Dec 4, 2017 • edited Loading

sibiryakov commented Dec 5, 2017

bnopacheco commented Dec 5, 2017

bnopacheco commented Dec 4, 2017 •

edited

Loading