Skip to content

Feature datasource standalone #81

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

gfl94
Copy link
Member

@gfl94 gfl94 commented Jan 6, 2017

Osm, geojson, shapefile datasources support for Simba.

Resolve #80 .

@@ -31,6 +32,9 @@ private[simba] class ShapeType extends UserDefinedType[Shape] {
s match {
case o: Shape =>
new GenericArrayData(ShapeSerializer.serialize(o))
case g: JTSPolygon => // An ugly hack here
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Skyprophet I met a serialization problem here, I've wrapped all elements in ShapefileRdd with a Simba Shape class, but it still give me a JTSPolygon serialization issue, so I add an this hack to fix, if the following three lines are commented, then it gives me a MatchError, here is the error stack:

scala.MatchError: POLYGON ((123.37969855694418 -10.691464934618121, 123.25330435203686 -10.73496681414191, 123.22308124558195 -10.829817485613273, 122.82189151450935 -10.91975208822901, 122.8071467465544 -10.780668259096789, 123.04320974794253 -10.729391677999741, 123.19637927774315 -10.592360700189607, 123.3108896398211 -10.592360700189607, 123.21251782973363 -10.579743286815225, 123.37764455941812 -10.430828466175731, 123.42540000189905 -10.47154163142446, 123.36986871164088 -10.543871687163643, 123.44227212443457 -10.495456031192182, 123.3439003143471 -10.586051993502416, 123.42613357244409 -10.62397759068006, 123.37969855694418 -10.691466080822098, 123.37969855694418 -10.691464934618121, 123.37969855694418 -10.691464934618121)) (of class com.vividsolutions.jts.geom.Polygon)
	at edu.utah.cs.simba.ShapeType.serialize(ShapeType.scala:32)
	at org.apache.spark.sql.catalyst.CatalystTypeConverters$UDTConverter.toCatalystImpl(CatalystTypeConverters.scala:142)
	at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
	at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
	at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:59)
	at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:56)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
	at scala.collection.AbstractIterator.to(Iterator.scala:1157)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)

Do you have some idea on UDT serialize problem here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show me where you construct the Shape Object in the data source.

val gf = new GeometryFactory()
val shapes = ShapeFile.Parser(path)(gf)
val shapefileRdd = sqlContext.sparkContext
.parallelize(shapes).map(_.g match {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Skyprophet Here. Every _.g is a JTSGeometry, and it's transformed into Simba Shape with a apply function in the following lines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provide it an explicit schema and try again.

@merlintang
Copy link

merlintang commented Jan 8, 2017 via email

@gfl94
Copy link
Member Author

gfl94 commented Jan 8, 2017

The problem comes from Row.fromTuple method, this method use product to extract tuple, Simba Polygon is a light wrapper for JTSPolygon, it is extracted by Row.fromTuple method. So I change another construction method for Row, it works now.

@gfl94
Copy link
Member Author

gfl94 commented Jan 8, 2017

@merlintang Simba uses UDT for spatial datatype since it is transplanted to standalone branch, this is an easy and feasible way for spatial datatype.

The original way procssing spatial datatype in simba-spark-1.6 is using a ShapeConverter for spatial datatype converter, which can save the time for object serializing and deserializing.

Further performance evaluation can be conducted if needed :)

@dongx-psu
Copy link
Member

@gfl94 Could you write some unit tests using scalatest for this group of feature. Make sure everything works and then inform me again.

@merlintang
Copy link

@gfl94 can you merge this function to 2.1 as well, I think this feature is very important.

@gfl94
Copy link
Member Author

gfl94 commented Feb 19, 2017

@Skyprophet Unit test added, pls check it :)

@dongx-psu
Copy link
Member

@gfl94 Try to import these to standalone-2.1 branch and see if it works. Do be careful about the scope since we are altering it to org.apache.spark.simba. Thanks!

@dongx-psu dongx-psu closed this Apr 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants