Skip to content
Open

hw0 #10

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions hw0/gusarov/build.sbt
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
name := "gusarov"

version := "0.1"

scalaVersion := "2.12.10"


libraryDependencies += "org.apache.spark" %% "spark-core" % "3.0.1"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.0.1"
// https://mvnrepository.com/artifact/ch.hsr/geohash
libraryDependencies += "ch.hsr" % "geohash" % "1.4.0"

49,081 changes: 49,081 additions & 0 deletions hw0/gusarov/data/AB_NYC_2019.csv

Large diffs are not rendered by default.

65 changes: 65 additions & 0 deletions hw0/gusarov/results.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
##Результаты

Путь до csv в коде указан полный, потому что он почему-то по умолчанию находится не в проекте, а в `/home`

+---------------+------------------+
| room_type| avg(price)|
+---------------+------------------+
| Shared room| 70.12758620689655|
|Entire home/apt|211.79424613325986|
| Private room| 89.78097285675894|
+---------------+------------------+

+---------------+----+
| room_type|mode|
+---------------+----+
| Shared room| 35|
|Entire home/apt| 150|
| Private room| 50|
+---------------+----+

+---------------+---+
| room_type|med|
+---------------+---+
| Shared room| 45|
|Entire home/apt|160|
| Private room| 70|
+---------------+---+

+---------------+------------------+
| room_type| dispersion|
+---------------+------------------+
| Shared room|10348.026848353218|
|Entire home/apt| 80679.63674113646|
| Private room| 25665.72608991132|
+---------------+------------------+

MOST EXPENSIVE OFFER
+--------+--------------------+-------+---------+-------------------+-------------+--------+---------+---------------+-----+--------------+-----------------+-----------+-----------------+------------------------------+----------------+
| id| name|host_id|host_name|neighbourhood_group|neighbourhood|latitude|longitude| room_type|price|minimum_nights|number_of_reviews|last_review|reviews_per_month|calculated_host_listings_count|availability_365|
+--------+--------------------+-------+---------+-------------------+-------------+--------+---------+---------------+-----+--------------+-----------------+-----------+-----------------+------------------------------+----------------+
|13894339|Luxury 1 bedroom ...|5143901| Erin| Brooklyn| Greenpoint| 40.7326|-73.95739|Entire home/apt|10000| 5| 5| 2017-07-27| 0.16| 1| 0|
+--------+--------------------+-------+---------+-------------------+-------------+--------+---------+---------------+-----+--------------+-----------------+-----------+-----------------+------------------------------+----------------+
only showing top 1 row

CHEAPEST OFFER
+--------+--------------------+-------+---------+-------------------+------------------+--------+---------+------------+-----+--------------+-----------------+-----------+-----------------+------------------------------+----------------+
| id| name|host_id|host_name|neighbourhood_group| neighbourhood|latitude|longitude| room_type|price|minimum_nights|number_of_reviews|last_review|reviews_per_month|calculated_host_listings_count|availability_365|
+--------+--------------------+-------+---------+-------------------+------------------+--------+---------+------------+-----+--------------+-----------------+-----------+-----------------+------------------------------+----------------+
|18750597|Huge Brooklyn Bro...|8993084| Kimberly| Brooklyn|Bedford-Stuyvesant|40.69023|-73.95428|Private room| 0| 4| 1| 2018-01-06| 0.05| 4| 28|
+--------+--------------------+-------+---------+-------------------+------------------+--------+---------+------------+-----+--------------+-----------------+-----------+-----------------+------------------------------+----------------+
only showing top 1 row

res8: Double = 0.04279933414330361


res9: Double = -0.04795422658266245


geohash_udf: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction(<function>,StringType,List(Some(class[value[0]: double]), Some(class[value[0]: double]), Some(class[value[0]: int])),None,true,true)
+-------+----------+
|geohash|avg(price)|
+-------+----------+
| dr5wf| 350.0|
+-------+----------+
only showing top 1 row
48 changes: 48 additions & 0 deletions hw0/gusarov/src/main/scala/HomeWork.sc
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
import ch.hsr.geohash.GeoHash
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{DoubleType, IntegerType}

val spark = SparkSession
.builder()
.appName("hw0")
.config("spark.master", "local")
.getOrCreate()

var df = spark.read.options(Map("header" -> "true", "escape" -> "\"", "multiLine" -> "true"))
.csv("Документы/opensource_coding/bigData2020/hw0/gusarov/data/AB_NYC_2019.csv")
df = df.withColumn("price", col("price").cast(IntegerType))

df.groupBy("room_type").mean("price").show()

df.groupBy("room_type", "price")
.count().sort(desc("count")).groupBy("room_type")
.agg(first("price").alias("mode"))
.show()

df.groupBy("room_type")
.agg(expr("percentile_approx(price, 0.5)").alias("med"))
.show()

df.groupBy("room_type")
.agg((stddev("price") * stddev("price")).alias("dispersion"))
.show()
print("MOST EXPENSIVE OFFER")
df.sort(desc("price")).show(1)
print("CHEAPEST OFFER")
df.sort("price").show(1)

df.withColumn("minimum_nights", col("minimum_nights").cast(IntegerType))
.stat.corr("price", "minimum_nights")

df.withColumn("number_of_reviews", col("number_of_reviews").cast(IntegerType))
.stat.corr("price", "number_of_reviews")

val geohash_udf = udf(GeoHash.geoHashStringWithCharacterPrecision _)
df.withColumn("latitude", col("latitude").cast(DoubleType))
.withColumn("longitude", col("longitude").cast(DoubleType))
.withColumn("geohash", geohash_udf(col("latitude"), col("longitude"), lit(5)).alias("geohash"))
.groupBy("geohash")
.mean("price")
.sort(desc("avg(price)"))
.show(1)
Loading