-
Notifications
You must be signed in to change notification settings - Fork 32
Open
Labels
Description
We have a data generator for a KMeans benchmark and want to use it with the PEEL framework.
The generator produces 2 files, points and centers and run as a flink job. We want to save these files in <hdfs-root-directory >/kmeans
using the GeneratedDataSet
class and then pick these files with the KMeans flink job.
My question is: How can we configure PEEL to create the directory kmeans
in HDFS and then copy the files to that directory? With our current configuration shown below that does not work.
<!--************************************************************************
* Data Generators
*************************************************************************-->
<bean id="datagen.kmeans" class="org.peelframework.flink.beans.job.FlinkJob">
<constructor-arg name="runner" ref="flink-1.0.3"/>
<constructor-arg name="command">
<value><![CDATA[
-v -c org.apache.flink.examples.java.clustering.util.KMeansDataGenerator \
${app.path.datagens}/KMeans.jar \
--points ${datagen.points} \
--k ${datagen.k} \
--output ${system.hadoop-2.path.input}/kmeans
]]>
</value>
</constructor-arg>
</bean>
<!--************************************************************************
* Data Sets
*************************************************************************-->
<bean id="dataset.kmeans.generated" class="org.peelframework.core.beans.data.GeneratedDataSet">
<constructor-arg name="src" ref="datagen.kmeans"/>
<constructor-arg name="dst" value="${system.hadoop-2.path.input}/kmeans"/>
<constructor-arg name="fs" ref="hdfs-2.7.1"/>
</bean>
The usage of our data generator is similar to the WordGenetator
except that it produces 2 files instead of just one.
Do you have an idea how we could solve this problem with PEEL or do we have to adjust our data generator?
Thanks!
Metadata
Metadata
Assignees
Labels
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
aalexandrov commentedon Sep 29, 2016
I think this should work, and at runtime the generator will run only once.
I suggest to try the latest SNAPSHOT from the master as it fixes some issues related to the setup / teardown logic of systems and dataset materialization.
noproblem666 commentedon Sep 29, 2016
Thanks for your fast reply!
Unfortunately, this does not work. I got the following exception:
It seems that the bean cannot used twice.
aalexandrov commentedon Sep 29, 2016
Then duplicate the bean definition as well (using the same
command
andrunner
values).noproblem666 commentedon Sep 29, 2016
Sorry, my fault! I forgot to change the bean id for each
GeneratedDataSet
.Now, the experiment starts but the job does not finish successfully.
This is the experiment configuration for the data generation:
This is the error message from stdout:
And this is from the log
run.err
:Have you an idea what could be wrong?
Thanks a lot!
aalexandrov commentedon Sep 29, 2016
Try
(with an extra
/
in the beginning).aalexandrov commentedon Sep 29, 2016
Actually, can you show me the Java / Scala code that parses the
--output $PATH
value?noproblem666 commentedon Oct 13, 2016
Sorry for the delay!
We use the KMeans benchmark and the KMeans data generator from the "official" flink examples on GitHub:
https://github.com/apache/flink/blob/d7b59d761601baba6765bb4fc407bcd9fd6a9387/flink-examples/flink-examples-batch/src/main/java/org/apache/flink/examples/java/clustering/KMeans.java