As a 21st-century data analyst or data scientist, the most essential framework which is widely used by all is — Pandas! Pandas has made life easier with the help of in-memory processing power and the use of dataframes. As opposed to dumping the entire dataset in a SQL database and query the database using SQL queries to view the output, now we just read the dataset files in a pandas df. All this tedious process is now replaced by pandas dataframes.
Along with Pandas, there are better alternatives which can further boost up your computational power by optimizing the pandas workflows. Here are some powerful alternatives which are easy to install and that allow you to boost up your workflows!
“Fasten your seatbelts, it’s going to be a bumpy ride”
Modin
Tired of longer waiting times while dealing with large dataframes? — Modin has come to the rescue!
Modin uses Ray or Dask to provide an effortless way to speed up your pandas notebooks, scripts, and libraries. Unlike other distributed DataFrame libraries, Modin provides seamless integration and compatibility with existing pandas code. Even using the DataFrame constructor is identical.
“Modin accelerates your pandas workflows with one line of code change”
Installation
Modin is completely open-source project at UC Berkeley’s RISELab. Here’s the GitHub link for the project
https://github.com/modin-project/modin
Modin can be quickly installed via PyPI:
pip install modin
Usage
Using Modin is as simple as using Pandas dataframe. Just replace pandas as pd with modin.pandas as pd.
import modin.pandas as pd
You’re all set! Optimize your pandas workflow using the power of dataframes with enhanced computation power and faster speed :)
Dask
Need faster processing power than Modin? — Say hello to Dask!
Dask.distributed is a lightweight library for distributed computing in Python. Dask.distributed is a centrally managed, distributed, dynamic task scheduler. The central dask-scheduler process coordinates the actions of several dask-worker processes spread across multiple machines and the concurrent requests of several clients.
Installation
Dask is also an open-source framework and is actively maintained by their developers. Here’s the source code on their GitHub link:
https://github.com/dask/dask
Dask can be easily installed using pip:
python -m pip install dask distributed –upgrade
Ray
Want to unleash the power of distributed computing? — Welcome Ray!
Ray is fully compatible with deep learning frameworks like TensorFlow, PyTorch, and MXNet, and it is natural to use one or more deep learning frameworks along with Ray in many applications (for example, our reinforcement learning libraries use TensorFlow and PyTorch heavily).
Modin uses Ray to provide an effortless way to speed up the pandas’ notebooks, scripts, and libraries. Ray is a high-performance distributed execution framework targeted at large-scale machine learning and reinforcement learning applications.
Installation
Ray is also an open-source project. Here’s the GitHub link for Ray:
https://github.com/ray-project/ray/
Ray can be quickly installed using pip by the following command:
pip install ray
PySpark
Want to leverage the in-memory processing power? — PySpark is just the right choice!
PySpark is a great language for performing exploratory data analysis at scale, building machine learning pipelines, and creating ETLs for a data platform. PySpark uses in-memory compute power making the dataframes faster than ever. Spark dataframes are the most preferred choice of data scientists and data engineers as they make the usual pandas faster by replacing the compute power with in-memory resource and they provide great flexibility to connect to outside frameworks.
Installation
Spark is maintained by Apache and is an open-source framework with an active community of developers. Here’s the source code for Spark on GitHub:
https://github.com/apache/spark
Spark can be easily installed using the following commands:
Make sure you have Java installed on your machine:
brew cask install java
As spark is written in Scala, we need to include the Scala dependency:
brew install scala
Finally, after installing all the dependencies, install PySpark using the following command:
brew install apache-spark
Usage
Creating a spark is as easy as the following. Just like modin, there are syntax differences between spark and pandas
df = spark.read.json(“examples/src/main/resources/people.json”)
Confused?
Confused with a lot of options? Don’t worry. Just analyze the dataset that you’re dealing with and choose the appropriate framework. Let’s say you are dealing with a dataset that is a just a few gigs — choosing Modin would be a good choice here as you can maximize the potential of your laptop. But let’s say you’re dealing with an ETL pipeline where the data ingested is quite huge. In such cases, PySpark would do the job but you might think of an alternative framework such as Ray where you can distribute the load over a cluster of machines and take full advantage of the parallel computing!
Feel free to comment with better alternatives and your suggestions for making the pandas dataframes faster!
References
[1] https://www.kdnuggets.com/2019/11/speed-up-pandas-4x.html
[2] https://github.com/modin-project/modin
[3] https://github.com/ray-project/ray
[4] https://modin.readthedocs.io/en/latest/
[5] https://github.com/apache/spark
Comments