pandas-gpt: Power up your data science workflow with AI dataframe queries

Ryan Vandersmith
Level Up Coding
Published in
5 min readJun 12, 2023

--

With the recent explosion in popularity of large language models such as ChatGPT, Bing Chat, and Character.ai, hardly a day goes by without hearing about a new, fascinating use case for these truly remarkable AI chatbots.

…maybe except for this one

A few weeks ago, I was in the midst of writing a Python notebook to analyze the engagement and growth of a product that we recently launched at DFINITY. While typing df.plot() for the hundredth time, I couldn’t help but wonder if there was a way to speed up my workflow using AI.

If you’ve tried using ChatGPT for data analysis, you’ll know that it’s often difficult to generate correct Python source code involving a pandas DataFrame. Since the content of your dataset is only available at runtime, LLMs tend to hallucinate column names and make all sorts of assumptions about the structure of your data. Here’s a quick example:

This is an impressive solution, but it doesn’t quite work for our dataset. Although ChatGPT almost guessed the correct name of the column, “Order Date” should contain a space between the words, and the values use a localized month/day/year format. Let’s ask ChatGPT to fix these issues:

Works like a charm! Now, if only it were possible to automate this process as a Python library, you could type this directly into your notebook:

df.ask('which day of the week had the most orders')

The good news is that this library now exists!

pandas-gpt is a new Python package which makes it possible to do (almost) anything with a pandas DataFrame using natural language queries.

This works by using prompt engineering to generate a function from the input query based on column information from your dataset at runtime.

You can install the package by running the following command:

pip install pandas-gpt

Below are a few examples of what’s possible with this library, which you can try online in this Google Colab notebook.

Let’s start by importing pandas_gpt and loading our dataset:

import pandas as pd
import pandas_gpt

df = pd.DataFrame('https://gist.githubusercontent.com/bluecoconut/9ce2135aafb5c6ab2dc1d60ac595646e/raw/c93c3500a1f7fae469cba716f09358cfddea6343/sales_demo_with_pii_and_all_states.csv')

Next, configure your OpenAI API key. I generally use the following code snippet for notebook environments such as Jupyter and Google Colab:

import openai
openai.api_key = input('Enter API key from (https://platform.openai.com/account/api-keys): ')

# Clear the API key from the notebook's output panel
from IPython.display import clear_output
clear_output()

From here, the only limit is your imagination!

# Data transformation
df = df.ask('drop purchases from Laurenchester, NY')
df = df.ask('add a new Category column with values "cheap", "regular", or "expensive"')

# Queries
weekday = df.ask('which day of the week had the largest number of orders?')
top_10 = df.ask('what are the top 10 most popular products, as a table')

# Plotting
df.ask('plot monthly and hourly sales')
top_10.ask('horizontal bar plot with pastel colors')

# Allow changes to original dataset
df.ask('do something interesting', mutable=True)

# Show source code before running
df.ask('convert prices from USD to GBP', verbose=True)

The result for that last example blew my mind:

As you can see, pandas-gpt is capable of sending requests to APIs and importing whatever Python packages are best suited for the task at hand.

By the way, if you just want to view the generated source code without running it, simply replace df.ask() with df.ask.code(). Likewise, you can view the original GPT prompt with df.ask.prompt().

You can also configure the default settings for the ask() function:

import pandas_gpt

pandas_gpt.verbose = True # Always display code before running
pandas_gpt.mutable = True # Enable modifying the original DataFrame

Since ChatGPT is known to occasionally produce surprising or incorrect results, I would recommend using this library in a sandboxed environment instead of your local computer. Google Colab is a great place to try pushing the capabilities of pandas-gpt. I’ve also found this library to be useful for exploratory data analysis and setting up AutoML pipelines in Kaggle notebooks.

Here are a few tips for getting the best results out of pandas-gpt:

  • If your dataset has a large number of columns, try running prompts on a subset of your data, e.g. df[['Time', 'Price']].ask('xkcd line plot').
  • Double-check that your prompt is unambiguous. It’s usually possible to fix incorrect results by slightly adjusting the wording of your request.
  • As a rule of thumb, pandas-gpt is best at using the most popular Python libraries. Here is a ranking based on total downloads in case you would like to use this to your advantage.

Definitely let me know if you come across any interesting, unexpected, or humorous use cases for this project. Feel free to report a bug or contribute a feature at the pandas-gpt GitHub repository:

Thanks for reading, and safe travels into these uncharted waters.

~ Ryan

Level Up Coding

Thanks for being a part of our community! Before you go:

🚀👉 Join the Level Up talent collective and find an amazing job

--

--

Enthusiastic programming language designer and full-stack Progressive Web App developer.