-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Data Frames #8
Comments
@chrisbetz I've been looking into this and have a local branch wrapping the How do you plan to version Sparkling across Spark versions? Would you rather try to support both in a release or keep things separate? Put up a minimal example of what it took to get my flambo tests green in 1.3 here: sorenmacbeth/flambo#48 Might be hard to run things in parallel, but once there's essential 1.3 compat I don't imagine it'd be too hard to build out some functions to work with |
Hi, thanks for offering to contribute! That’s great. I’ve not looked into the DataFrame API any further (just checked out the announcement document). But it looks promising and I really would like to support it. Concerning the versioning - I’d like to think about that over the Easter holiday. Currently, I see two options: a) Having different namespaces in the same project If you see any other good options, just tell me. I’ll come back to you regarding this. Sincerly, Chris
|
@chrisbetz Great. Both those sound like viable options; once you pick a route I'll see where we could take support for this. Have a great Easter. |
Hi guys! Any update on providing support for Data Frames? |
Hi, sorry, no support for that yet, as we need to support at least spark 1.1 from CDH, spark 1.2.x and spark 1.3 and I need to find a way to support all of them. Currently, I'm on serialization tasks and thus a little busy. Data Frame Support will definitively be the next thing to add, so stay tuned. Sorry, but coming up with a way to go requires some researching and testing around. |
@erasmas Currently I'm working on getting dataframe support into Flambo at the moment since that's what I'm using in prod (looking at switching to sparkling once I get some time to compare). Codes getting there but I've been having some issues getting Spark 1.3 to run on the cluster for final testing. |
Hi @chrisbetz @chetmancini Any updates on Data Frames support ? |
I may have some time to wrap some of the code that I've written, but I've only ever used Spark 1.5.x. @chrisbetz let me know how you'd like to proceed. |
Out of interest what form would a DataFrames wrapper take? For the reading & queries side of things would it be some declarative DSL similar to Datomic Datalog for example? |
I doubt it would look like Datalog. Considering that Sparkling's RDD Going to far beyond that would probably impose quite an impedance On 1 March 2016 at 00:25, alzadude notifications@github.com wrote:
|
I'd really like to help, started putting something together the other day nabacg@ae935a5 |
I would like to help but I've not had much time to work on this lately - What I have is mostly just code that uses DataFrames; I hadn't really On 28 April 2016 at 09:08, Grzegorz Caban notifications@github.com wrote:
|
Hi, |
I have the following functionality that I could add:
One of the bigger outstanding problems that I see is how DataFrame joins work. The Java syntax needs a good macro wrapper, but I haven't had time to finish my attempt. I don't want to step on @MarchLiu's efforts, so I'll wait until his changes are sorted out before I throw any of this into the mix. It looks solid. I like the how you made thread-ability a key part of your implementation. There were a couple spots where I should have done that but didn't. |
@NeilMenne would very much be interested in Parquet Support, if possible. |
To be clear, you can work with DataFrames and use parquet files in the
existing version, it's just annoying. You have to use the Java API more or
less directly, and it suffers from some warts between Java <-> Scala
interop, particularly in the area of varargs.
I used it successfully but there was plenty of ugly code with creating and
filling type specific arrays and weird calls where you have one string and
then an array of strings, etc.
It is currently do-able, just ugly.
I talk a bit about it at the talk I gave at ClojureConj in 2015:
https://youtu.be/ARBiyYyW4Ow?t=689
Slides:
https://www.slideshare.net/ZalandoTech/spark-clojure-for-topic-discovery-zalando-tech-clojureconj-talk
starting around slide 20-21
H
EDIT: I posted this before I saw the 2.0 sparkling release, obviously!
…On 2 March 2017 at 04:27, Marcus Oladell ***@***.***> wrote:
@NeilMenne <https://github.com/NeilMenne> would very much be interested
in Parquet Support, if possible.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AARND4Ph3ew3ufYznFK5HRRjARFl8FLAks5rhkVDgaJpZM4DimQL>
.
|
I no longer have access to the code I wrote at OpenTable. If there's a need for it, I could probably do a clean room implementation. I still use Spark at my current position, so it's fresh in my mind. |
@NeilMenne That would be great, especially in the area of more idiomatic support for Parquet and RDD <-> DataFrames. If it is a ton of work, don't worry about it but would definitely be useful if you had the time. |
I'll have to get back up to speed on sparkling, but I'll see what I can do. |
Awesome! Thanks so much. |
My team has a hack project coming up and we were planning on using Sparkling as part of the implementation. I'm going to take a crack at building a API to data frames. If successful, I'll submit it as a PR. Just on background, it seems like there is some support already using a combination of the Java API + the new SQL API that was added in 2.x. Are there any examples of using the new SQL API and/or native (to Sparkling) data frame support? Just want to get a good picture of where I'm starting from in hopes I can avoid duplicating effort. |
Hi,
Unfortunately, I do not have working examples for this. Maybe anybody out there?
Please, share your question on the sparkling google group. If you ask on twitter, I could retweet from gorillalabs to reach out.
Happy hacking!
Chris
… Am 06.04.2017 um 21:56 schrieb Marcus Oladell ***@***.***>:
My team has a hack project coming up and we were planning on using Sparkling as part of the implementation. I'm going to take a crack at building a API to data frames. If successful, I'll submit it as a PR. Just on background, it seems like there is some support already using a combination of the Java API + the new SQL API that was added in 2.x. Are there any examples of using the new SQL API and/or native (to Sparkling) data frame support? Just want to get a good picture of where I'm starting from in hopes I can avoid duplicating effort.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I submitted a PR for adding support for |
I realise that I may be flogging a dead horse, and that another PR was merged in instead of @MafcoCinco's, however there were some really nice utility functions which @MafcoCinco had written which I would have loved. Specifically the dataframe->rdd-of- functions. @chrisbetz would you be open to negotiation on brining in some of these functions, are has that ship sailed. Should I rather be building these functions as a utility library for my projects. Again, forgive me if this is out of line, I just think they're incredibly useful utilities and something I've found myself reaching for recently. |
Hi,
thanks for your input, and yes, I'm open to these additions. If you like, just create a PR with the things you'd like to see and I will look into it after my vacation.
Cheers,
Chris
… Am 23.08.2017 um 14:34 schrieb Guy Taylor ***@***.***>:
I realise that I may be flogging a dead horse, and that another PR was merged in instead of @MafcoCinco's, however there were some really nice utility functions which @MafcoCinco had written which I would have loved. Specifically the dataframe->rdd-of- functions. @chrisbetz would you be open to negotiation on brining in some of these functions, are has that ship sailed. Should I rather be building these functions as a utility library for my projects.
Again, forgive me if this is out of line, I just think they're incredibly useful utilities and something I've found myself reaching for recently.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
See https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
The text was updated successfully, but these errors were encountered: