GitHub - shwethags/atlas-lineage: Example to create lineage in Atlas with sqoop and spark

Setting up atlas lineage with sqoop and spark

Setup employees mysql db

Setup an RDS mysql instance
Setup employees db:

git clone https://github.com/datacharmer/test_db.git employees_db
cd employees_db
mysql -h <rds_endpoint> -P 3306 -u admin -p < employees.sql
mysql -h <rds_endpoint> -P 3306 -u admin -p -t < test_employees_sha.sql

Setup employees db in hive

hive -e "create database employees"

Sqoop import of mysql tables

These sqoop commands setup the lineage in atlas from mysql to hive:

sqoop import --hive-database employees --table employees --connect jdbc:mysql://<rds_endpoint>:3306/employees --username <rds_username> --password <rds_password> --hive-import -m 1
   
sqoop import --hive-database employees --table departments --connect jdbc:mysql://<rds_endpoint>:3306/employees --username <rds_username> --password <rds_password> --hive-import -m 1
   
sqoop import --hive-database employees --table dept_emp --connect jdbc:mysql://<rds_endpoint>:3306/employees --username <rds_username> --password <rds_password> --hive-import -m 1

Note: If the sqoop jobs fail with 'Argument list too long', its because of the long HADOOP_CLASSPATH set by sqoop. Add the following code to configure-sqoop script in <sqoop_home>/bin, just before the final export commands:

HADOOP_CLASSPATH=$(echo "$HADOOP_CLASSPATH" | perl -ne 'chomp; for(split ":") { $s .= ":$_" if $h{$_} ne 1; $h{$_} = 1}; print "$s"')

Spark lineage

Run the spark etl job which creates new hive table emp_dept_flat: https://github.com/bernhard-42/Spark-ETL-Atlas/blob/master/spark-etl.scala. Note down the spark job id
Import the hive table emp_dept_flat to Atlas <atlas_home>/hook-bin/import-hive.sh

Note: If the import-hive.sh fails with classnot found errors:
```
a. export ATLASCPPATH=<atlas_home>/server/webapp/atlas/WEB-INF/lib/*
b. export HADOOP_CLASSPATH=`hadoop classpath`
```
Create spark lineage

a. Download atlas-lineage-0.1.jar from https://github.com/shwethags/atlas-lineage/blob/master/atlas-lineage-0.1.jar
b. export HADOOP_CLASSPATH=<atlas_home>/server/webapp/atlas/WEB-INF/lib/*:atlas-lineage-0.1.jar
c. export HADOOP_USER_CLASSPATH_FIRST=true
d. hadoop  jar atlas-lineage-0.1.jar org.apache.atlas.lineage.SparkLineage -a <atlas_endpoint> -c <atlas_cluster_name> -j <spark_job_id> -p <spark_process_name>

This creates the lineage like below:

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src/main		src/main
README.md		README.md
atlas-lineage-0.1.jar		atlas-lineage-0.1.jar
pom.xml		pom.xml
spark-sqoop-lineage.png		spark-sqoop-lineage.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Setting up atlas lineage with sqoop and spark

Setup employees mysql db

Setup employees db in hive

Sqoop import of mysql tables

Spark lineage

About

Releases

Packages

Languages

shwethags/atlas-lineage

Folders and files

Latest commit

History

Repository files navigation

Setting up atlas lineage with sqoop and spark

Setup employees mysql db

Setup employees db in hive

Sqoop import of mysql tables

Spark lineage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages