Spark Dataframe, Dataset

Dataframes 와 Datasets 에 대해 잘 설명해놓은 글 발견한 김에 메모.

A Deep Dive Into Spark Datasets and DataFrames Using Scala

글 요약은 나중으로 미루고 메모 하는 김에.

덤 1. https://phoenixnap.com/kb/rdd-vs-dataframe-vs-dataset


	RDD	DataFrame	Dataset
Release version	Spark 1.0	Spark 1.3	Spark 1.6
Data Representation	Distributed collection of elements.	Distributed collection of data organized into columns.	Combination of RDD and DataFrame.
Data Formats	Structured and unstructured are accepted.	Structured and semi-structured are accepted.	Structured and unstructured are accepted.
Data Sources	Various data sources.	Various data sources.	Various data sources.
Immutability and Interoperability	Immutable partitions that easily transform into DataFrames.	Transforming into a DataFrame loses the original RDD.	The original RDD regenerates after transformation.
Compile-time type safety	Available compile-time type safety.	No compile-time type safety. Errors detect on runtime.	Available compile-time type safety.
Optimization	No built-in optimization engine. Each RDD is optimized individually.	Query optimization through the Catalyst optimizer.	Query optimization through the Catalyst optimizer, like DataFrames.
Serialization	RDD uses Java serialization to encode data and is expensive. Serialization requires sending both the data and structure between nodes.	There is no need for Java serialization and encoding. Serialization happens in memory in binary format.	Encoder handles conversions between JVM objects and tables, which is faster than Java serialization.
Garbage Collection	Creating and destroying individual objects creates garbage collection overhead.	Avoids garbage collection when creating or destroying objects.	No need for garbage collection
Efficiency	Efficiency decreased for serialization of individual objects.	In-memory serialization reduces overhead. Operations performed on serialized data without the need for deserialization.	Access to individual attributes without deserializing the whole object.
Lazy Evaluation	Yes.	Yes.	Yes.
Programming Language Support	Java Scala Python R	Java Scala Python R	Java Scala
Schema Projection	Schemas need to be defined manually.	Auto-discovery of file schemas.	Auto-discovery of file schemas.
Aggregation	Hard, slow to perform simple aggregations and grouping operations.	Fast for exploratory analysis. Aggregated statistics on large datasets are possible and perform quickly.	Fast aggregation on numerous datasets.

덤 2.

How to create a Dataset in Spark : 4 ways to create a spark dataset

위 동영상에서 소개한 샘플 코딩 연습 삼아서 타이핑.

Create a dataset from a sequence of elements
Create a dataset from a sequence of case classes
Create a dataset from a RDD
Create a dataset from Dataframe

import org.apache.spark.sql.SparkSession

case class Employee(name:String, age:Int)

object Ds4Ways {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder
      .appName("Dataset 4 Ways")
      .master("local[*]")
      .getOrCreate

    import spark.implicits._
    // 1.
    val numSeq = Seq(1, 2, 3, 4, 5)
    val numDs = numSeq.toDS()
    numDs.show

    // 2.
    val empSeq = Seq(Employee("one", 23), Employee("two", 34), Employee("three", 54))
    val empSeqDs = empSeq.toDS()
    empSeqDs.show

    // 3.
    val rdd = spark.sparkContext.parallelize(Seq((1, "spark"), (2, "Hive")))
    val rddDs = rdd.toDS()
    rddDs.show

    // 4.
    val empRdd = spark.sparkContext.parallelize(empSeq)
    val empDf = empRdd.toDF()
    val empDfDs = empDf.as[Employee]
    empDfDs.show
  }
}

덤 3. 스택오버플로우에서 본거.

// Spark version < 2.x toDS is available with sqlContext.implicits._
import sqlContext.implicits._
val myrdd = testRDD.toDS() 

// Spark version >= 2.x
val spark: SparkSession = SparkSession.builder.config(conf).getOrCreate; 
import spark.implicits._ 
val myrdd = testRDD.toDS()

저작자표시 (새창열림)

'OpenSource' 카테고리의 다른 글

embulk 설정 파일에서 변수, include 사용하는 방법 (0)	2021.04.13
[java]embulk plugin 수정 사용 트릭 (0)	2021.04.08
embulk 이용한 from Rest API to DB 간단 예 (0)	2021.03.23
참조 라이브러리들의 라이센스 일괄 확인법(pom.xml) (0)	2021.03.22
vagrant 로 생성한 VirtualBox VM에 ssh 접속 (0)	2021.03.16

힘껏차라

Spark Dataframe, Dataset

'OpenSource' 카테고리의 다른 글

티스토리툴바

Spark Dataframe, Dataset

'OpenSource' 카테고리의 다른 글

'OpenSource' Related Articles

티스토리툴바