Skip to content

Commit

Permalink
Changes for 0.4.0
Browse files Browse the repository at this point in the history
This PR prepares the release for 0.4.0.

This will include the changes below:
  - Support for PERMISSIVE/DROPMALFORMED mode and corrupt record option. #107
  - Change default values for valueTag and attributePrefix to avoid always require field escape for some apis #142
  - deprecates saveAsXmlFile and promote the usage of write(). #150
  - deprecates xmlFile and promote the usage of read(). #150
  - drops 1.x compatibility from 0.4.0. #150
  - makes not supporting UserDefinedType as it became private. #150

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #176 from HyukjinKwon/version-0.4.0.
  • Loading branch information
HyukjinKwon committed Sep 10, 2016
1 parent a97bc33 commit 8e15a09
Show file tree
Hide file tree
Showing 2 changed files with 13 additions and 134 deletions.
145 changes: 12 additions & 133 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,36 +10,41 @@ The structure and test tools are mostly copied from [CSV Data Source for Spark](

## Requirements

This library requires Spark 1.3+
This library requires Spark 2.0+ for 0.4.x.

For version that works with Spark 1.x, please check for [branch-0.3](https://github.com/databricks/spark-xml/tree/branch-0.3).


## Linking
You can link against this library in your program at the following coordinates:

### Scala 2.10

```
groupId: com.databricks
artifactId: spark-xml_2.10
version: 0.3.4
version: 0.4.0
```

### Scala 2.11

```
groupId: com.databricks
artifactId: spark-xml_2.11
version: 0.3.4
version: 0.4.0
```

## Using with Spark shell
This package can be added to Spark using the `--packages` command line option. For example, to include it when starting the spark shell:

### Spark compiled with Scala 2.10
```
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-xml_2.10:0.3.4
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-xml_2.10:0.4.0
```

### Spark compiled with Scala 2.11
```
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-xml_2.11:0.3.4
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-xml_2.11:0.4.0
```

## Features
Expand Down Expand Up @@ -173,7 +178,6 @@ OPTIONS (path "books.xml", rowTag "book")
```

### Scala API
__Spark 1.4+:__

```scala
import org.apache.spark.sql.SQLContext
Expand Down Expand Up @@ -222,50 +226,7 @@ selectedData.write
.save("newbooks.xml")
```

__Spark 1.3:__

```scala
import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)
val df = sqlContext.load(
"com.databricks.spark.xml",
Map("path" -> "books.xml", "rowTag" -> "book"))

val selectedData = df.select("author", "_id")
selectedData.save("com.databricks.spark.xml",
SaveMode.ErrorIfExists,
Map("path" -> "newbooks.xml", "rootTag" -> "books", "rowTag" -> "book"))
```

You can manually specify the schema when reading data:
```scala
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};

val sqlContext = new SQLContext(sc)
val customSchema = StructType(Array(
StructField("_id", StringType, nullable = true),
StructField("author", StringType, nullable = true),
StructField("description", StringType, nullable = true),
StructField("genre", StringType ,nullable = true),
StructField("price", DoubleType, nullable = true),
StructField("publish_date", StringType, nullable = true),
StructField("title", StringType, nullable = true)))

val df = sqlContext.load(
"com.databricks.spark.xml",
schema = customSchema,
Map("path" -> "books.xml", "rowTag" -> "book"))

val selectedData = df.select("author", "_id")
selectedData.save("com.databricks.spark.xml",
SaveMode.ErrorIfExists,
Map("path" -> "newbooks.xml", "rootTag" -> "books", "rowTag" -> "book"))
```

### Java API
__Spark 1.4+:__

```java
import org.apache.spark.sql.SQLContext
Expand Down Expand Up @@ -312,58 +273,8 @@ df.select("author", "_id").write()
.save("newbooks.xml");
```



__Spark 1.3:__

```java
import org.apache.spark.sql.SQLContext

SQLContext sqlContext = new SQLContext(sc);

HashMap<String, String> options = new HashMap<String, String>();
options.put("rowTag", "book");
options.put("path", "books.xml");
DataFrame df = sqlContext.load("com.databricks.spark.xml", options);

HashMap<String, String> options = new HashMap<String, String>();
options.put("rowTag", "book");
options.put("rootTag", "books");
options.put("path", "newbooks.xml");
df.select("author", "_id").save("com.databricks.spark.xml", SaveMode.ErrorIfExists, options)
```

You can manually specify schema:
```java
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.types.*;

SQLContext sqlContext = new SQLContext(sc);
StructType customSchema = new StructType(new StructField[] {
new StructField("_id", DataTypes.StringType, true, Metadata.empty()),
new StructField("author", DataTypes.StringType, true, Metadata.empty()),
new StructField("description", DataTypes.StringType, true, Metadata.empty()),
new StructField("genre", DataTypes.StringType, true, Metadata.empty()),
new StructField("price", DataTypes.DoubleType, true, Metadata.empty()),
new StructField("publish_date", DataTypes.StringType, true, Metadata.empty()),
new StructField("title", DataTypes.StringType, true, Metadata.empty())
});

HashMap<String, String> options = new HashMap<String, String>();
options.put("rowTag", "book");
options.put("path", "books.xml");
DataFrame df = sqlContext.load("com.databricks.spark.xml", customSchema, options);

HashMap<String, String> options = new HashMap<String, String>();
options.put("rowTag", "book");
options.put("rootTag", "books");
options.put("path", "newbooks.xml");
df.select("author", "_id").save("com.databricks.spark.xml", SaveMode.ErrorIfExists, options)
```

### Python API

__Spark 1.4+:__

```python
from pyspark.sql import SQLContext
Expand Down Expand Up @@ -402,45 +313,13 @@ df.select("author", "_id").write \
.save('newbooks.xml')
```


__Spark 1.3:__

```python
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

df = sqlContext.load(source="com.databricks.spark.xml", rowTag = 'book', path = 'books.xml')
df.select("author", "_id").save('newbooks.xml', rootTag = 'books', rowTag = 'book', path = 'newbooks.xml')
```

You can manually specify schema:
```python
from pyspark.sql import SQLContext
from pyspark.sql.types import *

sqlContext = SQLContext(sc)
customSchema = StructType([ \
StructField("_id", StringType(), True), \
StructField("author", StringType(), True), \
StructField("description", StringType(), True), \
StructField("genre", StringType(), True), \
StructField("price", DoubleType(), True), \
StructField("publish_date", StringType(), True), \
StructField("title", StringType(), True)])

df = sqlContext.load(source="com.databricks.spark.xml", rowTag = 'book', schema = customSchema, path = 'books.xml')
df.select("author", "_id").save('newbooks.xml', rootTag = 'books', rowTag = 'book', path = 'newbooks.xml')
```


### R API
__Spark 1.4+:__

Automatically infer schema (data types)
```R
library(SparkR)

Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-xml_2.10:0.3.4" "sparkr-shell"')
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-xml_2.10:0.4.0" "sparkr-shell"')
sqlContext <- sparkRSQL.init(sc)

df <- read.df(sqlContext, "books.xml", source = "com.databricks.spark.xml", rowTag = "book")
Expand All @@ -453,7 +332,7 @@ You can manually specify schema:
```R
library(SparkR)

Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:0.3.4" "sparkr-shell"')
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:0.4.0" "sparkr-shell"')
sqlContext <- sparkRSQL.init(sc)
customSchema <- structType(
structField("_id", "string"),
Expand Down
2 changes: 1 addition & 1 deletion build.sbt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
name := "spark-xml"

version := "0.4.0-SNAPSHOT"
version := "0.4.0"

organization := "com.databricks"

Expand Down

0 comments on commit 8e15a09

Please sign in to comment.