Dataframes 와 Datasets 에 대해 잘 설명해놓은 글 발견한 김에 메모.
글 요약은 나중으로 미루고 메모 하는 김에.
덤 1. https://phoenixnap.com/kb/rdd-vs-dataframe-vs-dataset
RDD | DataFrame | Dataset | |
Release version | Spark 1.0 | Spark 1.3 | Spark 1.6 |
Data Representation | Distributed collection of elements. | Distributed collection of data organized into columns. | Combination of RDD and DataFrame. |
Data Formats | Structured and unstructured are accepted. | Structured and semi-structured are accepted. | Structured and unstructured are accepted. |
Data Sources | Various data sources. | Various data sources. | Various data sources. |
Immutability and Interoperability | Immutable partitions that easily transform into DataFrames. | Transforming into a DataFrame loses the original RDD. | The original RDD regenerates after transformation. |
Compile-time type safety | Available compile-time type safety. | No compile-time type safety. Errors detect on runtime. | Available compile-time type safety. |
Optimization | No built-in optimization engine. Each RDD is optimized individually. | Query optimization through the Catalyst optimizer. | Query optimization through the Catalyst optimizer, like DataFrames. |
Serialization | RDD uses Java serialization to encode data and is expensive. Serialization requires sending both the data and structure between nodes. | There is no need for Java serialization and encoding. Serialization happens in memory in binary format. | Encoder handles conversions between JVM objects and tables, which is faster than Java serialization. |
Garbage Collection | Creating and destroying individual objects creates garbage collection overhead. | Avoids garbage collection when creating or destroying objects. | No need for garbage collection |
Efficiency | Efficiency decreased for serialization of individual objects. | In-memory serialization reduces overhead. Operations performed on serialized data without the need for deserialization. | Access to individual attributes without deserializing the whole object. |
Lazy Evaluation | Yes. | Yes. | Yes. |
Programming Language Support | Java Scala Python R | Java Scala Python R | Java Scala |
Schema Projection | Schemas need to be defined manually. | Auto-discovery of file schemas. | Auto-discovery of file schemas. |
Aggregation | Hard, slow to perform simple aggregations and grouping operations. | Fast for exploratory analysis. Aggregated statistics on large datasets are possible and perform quickly. | Fast aggregation on numerous datasets. |
덤 2.
위 동영상에서 소개한 샘플 코딩 연습 삼아서 타이핑.
- Create a dataset from a sequence of elements
- Create a dataset from a sequence of case classes
- Create a dataset from a RDD
- Create a dataset from Dataframe
import org.apache.spark.sql.SparkSession
case class Employee(name:String, age:Int)
object Ds4Ways {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder
.appName("Dataset 4 Ways")
.master("local[*]")
.getOrCreate
import spark.implicits._
// 1.
val numSeq = Seq(1, 2, 3, 4, 5)
val numDs = numSeq.toDS()
numDs.show
// 2.
val empSeq = Seq(Employee("one", 23), Employee("two", 34), Employee("three", 54))
val empSeqDs = empSeq.toDS()
empSeqDs.show
// 3.
val rdd = spark.sparkContext.parallelize(Seq((1, "spark"), (2, "Hive")))
val rddDs = rdd.toDS()
rddDs.show
// 4.
val empRdd = spark.sparkContext.parallelize(empSeq)
val empDf = empRdd.toDF()
val empDfDs = empDf.as[Employee]
empDfDs.show
}
}
덤 3. 스택오버플로우에서 본거.
// Spark version < 2.x toDS is available with sqlContext.implicits._
import sqlContext.implicits._
val myrdd = testRDD.toDS()
// Spark version >= 2.x
val spark: SparkSession = SparkSession.builder.config(conf).getOrCreate;
import spark.implicits._
val myrdd = testRDD.toDS()
'OpenSource' 카테고리의 다른 글
embulk 설정 파일에서 변수, include 사용하는 방법 (0) | 2021.04.13 |
---|---|
[java]embulk plugin 수정 사용 트릭 (0) | 2021.04.08 |
embulk 이용한 from Rest API to DB 간단 예 (0) | 2021.03.23 |
참조 라이브러리들의 라이센스 일괄 확인법(pom.xml) (0) | 2021.03.22 |
vagrant 로 생성한 VirtualBox VM에 ssh 접속 (0) | 2021.03.16 |