Add partitioned support to Sequence and Table sources #8

sodre · 2020-07-17T02:15:46Z

This PR tries to add support for Partitioned access to a Solr collection. By default, we define the partition_len to be 1024 records. During metadata lookup we can get a hit on the number of records in the solr collection. The number of partitions becomes ceil(numRecords/partition_len)

📜 Note:

Current implementation uses the simple pagination features (start, rows). This is problematic when we do deep-pagination.
We could implement cursors as an option, as long as the user understands that:
- There is an additional cost the first time they access partitions out of order;
- With my current knowledge of DASK, an all-upfront cost when we call to_dask;
- The user provides us with the pre-computed cursors. I suggest creating a SOLRCursorSource to help in create this content.

The problem is that the cursors can only be obtained by iterating one page at a time(can't be paralleized). Fortunately, we only need the document id's and not the entire SOLR response, so the network transfer cost is small.

sodre

@martindurant, let me know if you think this approach works for handling partitions in intake-solr.

sodre · 2020-07-17T04:27:21Z

intake_solr/source.py

@@ -28,18 +30,25 @@ class SOLRSequenceSource(base.DataSource):
    zoocollection: bool or str
        If using Zookeeper to orchestrate SOLR, this is the name of the
        collection to connect to.
+    partition_len: int or None
+        The desired partition size. [default: 1024]


We add a new parameter that limits the number of rows that are returned per partition.

I looked at intake-es and it seems they use the npartitions as an input instead of partition_len.

❓ Let me know if you prefer that option.

I don't have a strong preference, but consistency might be good

sodre · 2020-07-17T04:29:04Z

intake_solr/source.py

+        self.partition_len = partition_len
+
+        if partition_len and partition_len <= 0:
+            raise ValueError(f"partition_len must be None or positive, got {partition_len}")


When partition_len is None, we get the old behavior of not setting the row count.

We should verify that the old behavior was actually working. On my system, if we don't set the number of rows to return, then we only get back the first ten records of the dataset.

A test would be good. Does setting partition_len -> +inf (very large number) work for one-partition output?

sodre · 2020-07-17T04:30:01Z

intake_solr/source.py

+        """Do a 0 row query and get the number of hits from the response"""
+        qargs = self.qargs.copy()
+        qargs["rows"] = 0
+        start = qargs.get("start", 0)


The user may want to start the query at a different position, so we take that into account.

📜 If we add support for Cursors, then we can't use the start option, according to SOLR documentation.

I don't know what people would normally use.
Does offsetting with start cause the server to scan the whole table, or is solr smart here?

sodre · 2020-07-17T04:31:09Z

intake_solr/source.py

+            datashape=None,
+            dtype=None,
+            shape=(results.hits - start,),


❓ What is the difference between datashape and shape?

datashape isn't used; it was meant for forward compatibility with complex types (struct, nested list)

sodre · 2020-07-17T04:33:05Z

intake_solr/source.py

+        if self.partition_len is not None:
+            qargs["start"] = qargs.get("start", 0) + index * self.partition_len
+            qargs["rows"] = self.partition_len
+        return self.solr.search(self.query, **qargs)


lets return the raw results of the query at this point. There are other valuable fields, like facets, that we can be used in sub-classes.

ok.
Are facets another useful way to partition? Are there shards too?

sodre · 2020-07-17T04:36:53Z

intake_solr/source.py

-
-    def _get_partition(self, _):
-        """Downloads all data
+        schema = super()._get_schema()


We get the Schema from SOLRSequenceSource. This contains the number of partitions and the total number of records, but not the dtype.

Is it not worth grabbing the result head to figure this out (on request)?

sodre · 2020-07-17T04:39:21Z

intake_solr/source.py

-        """Downloads all data
+        schema = super()._get_schema()
+
+        df = self._get_partition(0)


This loads the first partition into a dataframe and uses it to discover the returned schema.
Note that the schema might be different from the overall SOLR core schema because the user can select a subset of fields using the fl qarg.

sodre · 2020-07-17T04:40:31Z

tests/test_intake_solr.py

@@ -9,7 +9,7 @@
 from .util import start_solr, stop_docker, TEST_CORE

 CONNECT = {'host': 'localhost', 'port': 9200}
-TEST_DATA_DIR = 'tests'
+TEST_DATA_DIR = os.path.abspath(os.path.dirname(__file__))


I was having issues running the test from within PyCharm, this fixed the problem for me.

seems good practice - should not require a particular CWD

sodre · 2020-07-17T04:41:52Z

intake_solr/__init__.py

@@ -1,7 +1,4 @@
 from ._version import get_versions
 __version__ = get_versions()['version']
 del get_versions
-
-import intake  # Import this first to avoid circular imports during discovery.
-del intake


I got exceptions when running with dask distributed. Removing these lines fixed the issue.

With the move to entrypoints to declare the drivers, I hope this is no longer needed

martindurant

Looking pretty good.
I think some tests will clarify usage and correctness.
The current code clearly demonstrates my ignorance of how SOLR really works - thank you for this!

martindurant · 2020-07-17T14:35:10Z

intake_solr/__init__.py

@@ -1,7 +1,4 @@
 from ._version import get_versions
 __version__ = get_versions()['version']
 del get_versions
-
-import intake  # Import this first to avoid circular imports during discovery.
-del intake


With the move to entrypoints to declare the drivers, I hope this is no longer needed

martindurant · 2020-07-17T14:37:38Z

intake_solr/source.py

+        self.partition_len = partition_len
+
+        if partition_len and partition_len <= 0:
+            raise ValueError(f"partition_len must be None or positive, got {partition_len}")


A test would be good. Does setting partition_len -> +inf (very large number) work for one-partition output?

martindurant · 2020-07-17T14:38:38Z

intake_solr/source.py

+        """Do a 0 row query and get the number of hits from the response"""
+        qargs = self.qargs.copy()
+        qargs["rows"] = 0
+        start = qargs.get("start", 0)


I don't know what people would normally use.
Does offsetting with start cause the server to scan the whole table, or is solr smart here?

martindurant · 2020-07-17T14:39:28Z

intake_solr/source.py

+            datashape=None,
+            dtype=None,
+            shape=(results.hits - start,),


datashape isn't used; it was meant for forward compatibility with complex types (struct, nested list)

martindurant · 2020-07-17T14:41:22Z

intake_solr/source.py

+        if self.partition_len is not None:
+            qargs["start"] = qargs.get("start", 0) + index * self.partition_len
+            qargs["rows"] = self.partition_len
+        return self.solr.search(self.query, **qargs)


ok.
Are facets another useful way to partition? Are there shards too?

martindurant · 2020-07-17T14:42:24Z

intake_solr/source.py

-
-    def _get_partition(self, _):
-        """Downloads all data
+        schema = super()._get_schema()


Is it not worth grabbing the result head to figure this out (on request)?

martindurant · 2020-07-17T14:44:24Z

intake_solr/source.py

+
+        self._load_metadata()
+        return dask.dataframe.from_delayed(
+            [delayed(self.read_partition)(i) for i in range(self.npartitions)]


There is also bag.to_dataframe, which may be less code, reuse the sequence partitions

martindurant · 2020-07-17T14:44:49Z

tests/test_intake_solr.py

@@ -9,7 +9,7 @@
 from .util import start_solr, stop_docker, TEST_CORE

 CONNECT = {'host': 'localhost', 'port': 9200}
-TEST_DATA_DIR = 'tests'
+TEST_DATA_DIR = os.path.abspath(os.path.dirname(__file__))


seems good practice - should not require a particular CWD

sodre · 2020-07-23T01:15:30Z

Looking pretty good.
I think some tests will clarify usage and correctness.
The current code clearly demonstrates my ignorance of how SOLR really works - thank you for this!

Hi @martindurant, I was pulled into a different task at work, and I won't be able to come back to this task for about a month. If it is okay, I would like to keep this still open as a draft.

martindurant · 2020-07-23T13:27:21Z

No problem - ping me when you want me to have a look.

sodre added 2 commits July 16, 2020 22:13

Add initial support for partitioned sequence

34f7454

Add partitioning for SOLRTable

77b6d14

sodre changed the title ~~Add partitioned support to SOLRSequenceSource~~ Add partitioned support to Sequence and Table sources Jul 17, 2020

sodre added 3 commits July 17, 2020 00:18

Implement SequenceSource.read

613c781

Implement TableSource.read

89553a6

Don't hard-code TEST_DATA_DIR

108b0f5

sodre commented Jul 17, 2020

View reviewed changes

Only read the first 10 records to discover the schema.

8663472

martindurant reviewed Jul 17, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add partitioned support to Sequence and Table sources #8

Add partitioned support to Sequence and Table sources #8

sodre commented Jul 17, 2020 •

edited

Loading

sodre left a comment

sodre Jul 17, 2020

sodre Jul 17, 2020 •

edited

Loading

martindurant Jul 17, 2020

sodre Jul 17, 2020

martindurant Jul 17, 2020

sodre Jul 17, 2020

sodre Jul 17, 2020

martindurant Jul 17, 2020

sodre Jul 17, 2020 •

edited

Loading

martindurant Jul 17, 2020

sodre Jul 17, 2020 •

edited

Loading

martindurant Jul 17, 2020

sodre Jul 17, 2020

martindurant Jul 17, 2020

sodre Jul 17, 2020 •

edited

Loading

sodre Jul 17, 2020

martindurant Jul 17, 2020

sodre Jul 17, 2020

martindurant Jul 17, 2020

martindurant left a comment

martindurant Jul 17, 2020

martindurant Jul 17, 2020

martindurant Jul 17, 2020

martindurant Jul 17, 2020

martindurant Jul 17, 2020

martindurant Jul 17, 2020

martindurant Jul 17, 2020

martindurant Jul 17, 2020

sodre commented Jul 23, 2020

martindurant commented Jul 23, 2020

Add partitioned support to Sequence and Table sources #8

Are you sure you want to change the base?

Add partitioned support to Sequence and Table sources #8

Conversation

sodre commented Jul 17, 2020 • edited Loading

sodre left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sodre Jul 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sodre Jul 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sodre Jul 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sodre Jul 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martindurant left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sodre commented Jul 23, 2020

martindurant commented Jul 23, 2020

sodre commented Jul 17, 2020 •

edited

Loading

sodre Jul 17, 2020 •

edited

Loading

sodre Jul 17, 2020 •

edited

Loading

sodre Jul 17, 2020 •

edited

Loading

sodre Jul 17, 2020 •

edited

Loading