Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Staging #10

Merged
merged 8 commits into from
Feb 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,6 @@ project
target

docs/build
.bsp
.idea
.DS_Store
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ This library contains several APIs to read data from various sources of differen
This library supports below source systems:

* Text
* Excel

## text

Expand All @@ -17,3 +18,8 @@ Supported text formats are:
* HTML Table

Please see the detailed documentation [here](text/README.md).

## excel

User can use this library to read the data from an excel file and parse it to the spark dataframe.
Please see the detailed documentation [here](excel/README.md).
25 changes: 24 additions & 1 deletion build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,8 @@ val scalaParserCombinatorsVersion = "2.3.0"
val sparkVersion = "3.4.1"
val sparkXMLVersion = "0.16.0"
val zioConfigVersion = "4.0.0-RC16"
val crealyticsVersion = "3.4.1_0.19.0"
val poiVersion = "5.2.5"

// ----- TOOL DEPENDENCIES ----- //

Expand Down Expand Up @@ -80,6 +82,14 @@ val zioConfigDependencies = Seq(
"dev.zio" %% "zio-config-magnolia" % zioConfigVersion
).map(_ excludeAll ("org.scala-lang.modules", "scala-collection-compat"))

val crealyticsDependencies = Seq(
"com.crealytics" %% "spark-excel" % crealyticsVersion
).map(_.cross(CrossVersion.for3Use2_13))

val poiDependencies = Seq(
"org.apache.poi" % "poi" % poiVersion
)

// ----- MODULE DEPENDENCIES ----- //

val textDependencies =
Expand All @@ -89,17 +99,30 @@ val textDependencies =
sparkXMLDependencies ++
zioConfigDependencies

val excelDependencies =
dataScalaxyTestUtilDependencies ++
crealyticsDependencies ++
poiDependencies ++
sparkDependencies ++
zioConfigDependencies

// ----- PROJECTS ----- //

lazy val `data-scalaxy-reader` = (project in file("."))
.settings(
publish / skip := true,
publishLocal / skip := true
)
.aggregate(`reader-text`)
.aggregate(`reader-text`, `reader-excel`)

lazy val `reader-text` = (project in file("text"))
.settings(
version := "2.0.0",
libraryDependencies ++= textDependencies
)

lazy val `reader-excel` = (project in file("excel"))
.settings(
version := "1.0.0",
libraryDependencies ++= excelDependencies
)
59 changes: 59 additions & 0 deletions excel/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Excel

User needs to add below dependency to the `build.sbt` file:

```Scala
ThisBuild / resolvers += "Github Repo" at "https://maven.pkg.github.com/teamclairvoyant/data-scalaxy-reader/"

ThisBuild / credentials += Credentials(
"GitHub Package Registry",
"maven.pkg.github.com",
System.getenv("GITHUB_USERNAME"),
System.getenv("GITHUB_TOKEN")
)

ThisBuild / libraryDependencies += "com.clairvoyant.data.scalaxy" %% "reader-excel" % "1.0.0"
```

Make sure you add `GITHUB_USERNAME` and `GITHUB_TOKEN` to the environment variables.

`GITHUB_TOKEN` is the Personal Access Token with the permission to read packages.

## API

The library provides below `read` APIs in type class `ExcelToDataFrameReader` in order to parse an Excel file into spark dataframe:

```scala
def read(
bytes: Array[Byte],
excelFormat: ExcelFormat,
originalSchema: Option[StructType] = None,
adaptSchemaColumns: StructType => StructType = identity
) (using sparkSession: SparkSession): DataFrame
```

The `read` method takes below arguments:

| Argument Name | Default Value | Description |
|:-------------------|:-------------:|:-------------------------------------------------------------|
| bytes | - | An Excel file in bytes to be parsed to the dataframe. |
| excelFormat | - | The `ExcelFormat` representation for the format of the text. |
| originalSchema | None | The schema for the dataframe. |
| adaptSchemaColumns | identity | The function to modify the inferred schema of the dataframe. |

User can provide below options to the `ExcelFormat` instance:

| Parameter Name | Default Value | Description |
|:------------------------------|:---------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| header | true | Boolean flag to tell whether given excel sheet contains header names or not. |
| dataAddress | A1 | The location of the data to read from. Following address styles are supported: <br/> `B3:` Start cell of the data. Returns all rows below and all columns to the right. <br/> `B3:F35:` Cell range of data. Reading will return only rows and columns in the specified range. <br/> `'My Sheet'!B3:F35:` Same as above, but with a specific sheet. <br/> `MyTable[#All]:` Table of data. Returns all rows and columns in this table. |
| treatEmptyValuesAsNulls | true | Treats empty values as null |
| setErrorCellsToFallbackValues | false | If set false errors will be converted to null. If true, any ERROR cell values (e.g. #N/A) will be converted to the zero values of the column's data type. |
| usePlainNumberFormat | false | If true, format the cells without rounding and scientific notations |
| inferSchema | false | Infers the input schema automatically from data. |
| addColorColumns | false | If it is set to true, adds field with coloured format |
| timestampFormat | "yyyy-mm-dd hh:mm:ss" | String timestamp format |
| excerptSize | 10 | If set and if schema inferred, number of rows to infer schema from |
| maxRowsInMemory | None | If set, uses a streaming reader which can help with big files (will fail if used with xls format files) |
| maxByteArraySize | None | See https://poi.apache.org/apidocs/5.0/org/apache/poi/util/IOUtils.html#setByteArrayMaxOverride-int- |
| tempFileThreshold | None | Number of bytes at which a zip entry is regarded as too large for holding in memory and the data is put in a temp file instead |
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
package com.clairvoyant.data.scalaxy.reader.excel

import zio.config.derivation.nameWithLabel

@nameWithLabel
case class ExcelFormat(
header: Boolean = true,
dataAddress: String = "A1",
treatEmptyValuesAsNulls: Boolean = true,
setErrorCellsToFallbackValues: Boolean = false,
usePlainNumberFormat: Boolean = false,
inferSchema: Boolean = false,
addColorColumns: Boolean = false,
timestampFormat: String = "yyyy-mm-dd hh:mm:ss",
excerptSize: Int = 10,
maxRowsInMemory: Option[Long] = None,
maxByteArraySize: Option[Long] = None,
tempFileThreshold: Option[Long] = None
)
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
package com.clairvoyant.data.scalaxy.reader.excel

import org.apache.poi.xssf.usermodel.XSSFWorkbook
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.{DataFrame, SparkSession}

import java.io.{ByteArrayInputStream, File, FileOutputStream, PrintWriter}

object ExcelToDataFrameReader {

def read(
bytes: Array[Byte],
excelFormat: ExcelFormat,
originalSchema: Option[StructType] = None,
adaptSchemaColumns: StructType => StructType = identity
)(using sparkSession: SparkSession): DataFrame =

import sparkSession.implicits.*

def saveBytesToTempExcelFiles(bytes: Array[Byte]) = {
val workbook = new XSSFWorkbook(new ByteArrayInputStream(bytes))

val file = File.createTempFile("excel-data-", ".xlsx")
file.deleteOnExit()
val fileOut = new FileOutputStream(file)
new PrintWriter(file) {
try {
workbook.write(fileOut)
} finally {
close()
}
}
file
}

val tempExcelFile = saveBytesToTempExcelFiles(bytes)

val excelDataFrameReader = sparkSession.read
.format("com.crealytics.spark.excel")
.options(
Map(
"header" -> excelFormat.header,
"dataAddress" -> excelFormat.dataAddress,
"treatEmptyValuesAsNulls" -> excelFormat.treatEmptyValuesAsNulls,
"setErrorCellsToFallbackValues" -> excelFormat.setErrorCellsToFallbackValues,
"usePlainNumberFormat" -> excelFormat.usePlainNumberFormat,
"inferSchema" -> excelFormat.inferSchema,
"addColorColumns" -> excelFormat.addColorColumns,
"timestampFormat" -> excelFormat.timestampFormat,
"excerptSize" -> excelFormat.excerptSize
).map((optionName, optionValue) => (optionName, optionValue.toString))
)
.options(
Map(
"maxRowsInMemory" -> excelFormat.maxRowsInMemory,
"maxByteArraySize" -> excelFormat.maxByteArraySize,
"tempFileThreshold" -> excelFormat.tempFileThreshold
).collect { case (optionName, Some(optionValue)) =>
(optionName, optionValue.toString)
}
)

excelDataFrameReader
.schema {
originalSchema.getOrElse {
adaptSchemaColumns {
excelDataFrameReader
.load(tempExcelFile.getAbsolutePath)
.schema
}
}
}
.load(tempExcelFile.getAbsolutePath)

}
Binary file added excel/src/test/resources/sample_data.xlsx
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
package com.clairvoyant.data.scalaxy.reader.excel

import com.clairvoyant.data.scalaxy.test.util.matchers.DataFrameMatcher
import com.clairvoyant.data.scalaxy.test.util.readers.DataFrameReader

import java.io.FileInputStream
import scala.util.Using

class ExcelToDataFrameReaderSpec extends DataFrameReader with DataFrameMatcher {

"read() - with excel filepath" should "return a dataframe with correct count and schema" in {

val expectedDF = readJSONFromText(
"""
| [
| {
| "Created": "2021-07-29 10:35:12",
| "Advertiser": "Zola",
| "Transaction ID": "1210730000580100000",
| "Earnings": "$0.68",
| "SID": "wlus9",
| "Status": "CONFIRMED",
| "ClickPage": "https://www.zola.com/"
| },
| {
| "Created": "2022-04-18 07:23:54",
| "Advertiser": "TradeInn",
| "Transaction ID": "1220419021230020000",
| "Earnings": "$12.48",
| "SID": "wles7",
| "Status": "CONFIRMED",
| "ClickPage": "https://www.tradeinn.com/"
| }
| ]
|""".stripMargin
)

val file = new java.io.File("excel/src/test/resources/sample_data.xlsx")
val byteArray: Array[Byte] =
Using(new FileInputStream(file)) { fis =>
val byteArray = new Array[Byte](file.length.toInt)
fis.read(byteArray)
byteArray
}.get

ExcelToDataFrameReader.read(
byteArray,
ExcelFormat(dataAddress = "'Transactions Report'!A2:G4")
) should matchExpectedDataFrame(expectedDF)
}

}
Loading