본문 바로가기

Lang

[Python, Rust]ConnectorX - load data from DBs in the fastest and most memory efficient way.

https://github.com/sfu-db/connector-x

'ConnectorX enables you to load data from databases into Python in the fastest and most memory efficient way.'

  • Rust 로 구현된 데이터 로딩 라이브러리. 문서만으로 보면 성능이 ...
  • 문서화가 아직은 많이 부족.

mariaDB만 일단 테스트 해봤는데

  • db uri 에 옵션 설정은 직접 안됨. 문서가 부족하니 다른 설정 방법이 있는지 아직 잘 모르겠음.
  • read_sql 에서 컬럼값이 null 일 때 에러 발생(임시 방편으로 query 에서 ifnull 함수 쓰면 되긴 하지만)

 

 

 

GitHub - sfu-db/connector-x: Fastest library to load data from DB to DataFrames in Rust and Python

Fastest library to load data from DB to DataFrames in Rust and Python - GitHub - sfu-db/connector-x: Fastest library to load data from DB to DataFrames in Rust and Python

github.com

https://towardsdatascience.com/connectorx-the-fastest-way-to-load-data-from-databases-a65d4d4062d5

 

ConnectorX: The fastest library for loading your Python data frame

Accelerate Pandas read_sql by 10x with one line of code

towardsdatascience.com

Three main reasons make ConnectorX achieve this performance:

  1. Written in native language:Unlike other libraries, ConnectorX is written in Rust, which avoids the additional cost of implementing data-intensive applications in Python.
  2. Copy exactly once:While existing solutions more or less do data copy multiple times when downloading the data from databases, the implementation of ConnectorX follows the “zero-copy” principle. We manage to copy the data exactly once, directly from theSourcetoDestinationeven under parallelism.
  3. CPU cache efficient:We apply several optimizations to make ConnectorX CPU cache-friendly. Other than “zero-copy” implementation, data processing in ConnectorX is conducted in a streaming fashion to reduce cache miss. Another example is that when we construct strings in Python, we write a batch of strings into one pre-allocated buffer instead of allocating separated locations for each one.