본문 바로가기

OpenSource

Spark Dataframe, Dataset

Dataframes 와 Datasets 에 대해 잘 설명해놓은 글 발견한 김에 메모.

글 요약은 나중으로 미루고 메모 하는 김에.

덤 1. https://phoenixnap.com/kb/rdd-vs-dataframe-vs-dataset

       
  RDD DataFrame Dataset
Release version Spark 1.0 Spark 1.3 Spark 1.6
Data Representation Distributed collection of elements. Distributed collection of data organized into columns. Combination of RDD and DataFrame.
Data Formats Structured and unstructured are accepted. Structured and semi-structured are accepted. Structured and unstructured are accepted.
Data Sources Various data sources. Various data sources. Various data sources.
Immutability and Interoperability Immutable partitions that easily transform into DataFrames. Transforming into a DataFrame loses the original RDD. The original RDD regenerates after transformation.
Compile-time type safety Available compile-time type safety. No compile-time type safety. Errors detect on runtime. Available compile-time type safety.
Optimization No built-in optimization engine. Each RDD is optimized individually. Query optimization through the Catalyst optimizer. Query optimization through the Catalyst optimizer, like DataFrames.
Serialization RDD uses Java serialization to encode data and is expensive. Serialization requires sending both the data and structure between nodes. There is no need for Java serialization and encoding. Serialization happens in memory in binary format. Encoder handles conversions between JVM objects and tables, which is faster than Java serialization.
Garbage Collection Creating and destroying individual objects creates garbage collection overhead. Avoids garbage collection when creating or destroying objects. No need for garbage collection
Efficiency Efficiency decreased for serialization of individual objects. In-memory serialization reduces overhead. Operations performed on serialized data without the need for deserialization. Access to individual attributes without deserializing the whole object.
Lazy Evaluation Yes. Yes. Yes.
Programming Language Support Java Scala Python R Java Scala Python R Java Scala
Schema Projection Schemas need to be defined manually. Auto-discovery of file schemas. Auto-discovery of file schemas.
Aggregation Hard, slow to perform simple aggregations and grouping operations. Fast for exploratory analysis. Aggregated statistics on large datasets are possible and perform quickly. Fast aggregation on numerous datasets.

덤 2.

위 동영상에서 소개한 샘플 코딩 연습 삼아서 타이핑.

  1. Create a dataset from a sequence of elements
  2. Create a dataset from a sequence of case classes
  3. Create a dataset from a RDD
  4. Create a dataset from Dataframe
import org.apache.spark.sql.SparkSession

case class Employee(name:String, age:Int)

object Ds4Ways {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder
      .appName("Dataset 4 Ways")
      .master("local[*]")
      .getOrCreate

    import spark.implicits._
    // 1.
    val numSeq = Seq(1, 2, 3, 4, 5)
    val numDs = numSeq.toDS()
    numDs.show

    // 2.
    val empSeq = Seq(Employee("one", 23), Employee("two", 34), Employee("three", 54))
    val empSeqDs = empSeq.toDS()
    empSeqDs.show

    // 3.
    val rdd = spark.sparkContext.parallelize(Seq((1, "spark"), (2, "Hive")))
    val rddDs = rdd.toDS()
    rddDs.show

    // 4.
    val empRdd = spark.sparkContext.parallelize(empSeq)
    val empDf = empRdd.toDF()
    val empDfDs = empDf.as[Employee]
    empDfDs.show
  }
}

덤 3. 스택오버플로우에서 본거.

// Spark version < 2.x toDS is available with sqlContext.implicits._
import sqlContext.implicits._
val myrdd = testRDD.toDS() 

// Spark version >= 2.x
val spark: SparkSession = SparkSession.builder.config(conf).getOrCreate; 
import spark.implicits._ 
val myrdd = testRDD.toDS()