Array Processing with Polars

Polars

Let's start by importing Polars in the conventional way.

import polars as pl

Why Polars?

We can make a Polars data frame with random elements like this.

from random import random


def randlist(n):
    return [random() for _ in range(n)]

def rand_df2(rows):
    return pl.DataFrame({"A": randlist(rows), "B": randlist(rows)})

df = rand_df2(8)
df

shape: (8, 2)
┌──────────┬──────────┐
│ A        ┆ B        │
│ ---      ┆ ---      │
│ f64      ┆ f64      │
╞══════════╪══════════╡
│ 0.721364 ┆ 0.214304 │
│ 0.955427 ┆ 0.547667 │
│ 0.073088 ┆ 0.338197 │
│ 0.177586 ┆ 0.358761 │
│ 0.508176 ┆ 0.14593  │
│ 0.199816 ┆ 0.988286 │
│ 0.31845  ┆ 0.738086 │
│ 0.525122 ┆ 0.180186 │
└──────────┴──────────┘

Let's get the average of each column using conventional Python techniques, namely for loops.

rows, columns = df.shape

averages = []
for column in range(columns):
    row_sum = 0
    for row in range(rows):
        row_sum += df.item(row, column)
    averages.append( row_sum / rows)
print(averages)

[0.4348785539007972, 0.4389271590257381]

Now let's tidy it up and make a function out of it.

def averages(df):
    rows, columns = df.shape

    return [
        (sum(df.item(row, column) for row in range(rows)) / rows)
        for column in range(columns)
    ]

averages(df)

[0.4348785539007972, 0.4389271590257381]

We would like to measure how long it takes to run some functions. We make a Python decorator.

from time import time


def timer_func(func):
    def wrap_func(*args, **kwargs):
        t1 = time()
        result = func(*args, **kwargs)
        t2 = time()
        elapsed = t2 - t1
        print(f"Function {func.__name__!r} executed in {elapsed:.4f}s")
        return result

    return wrap_func


@timer_func
def long_time(n):
    for i in range(n):
        for j in range(n):
            i * j


long_time(10000)

Function 'long_time' executed in 4.1624s

Let's rewrite our averages function.

@timer_func
def averages(df):
    rows, columns = df.shape

    return [
        (sum(df.item(row, column) for row in range(rows)) / rows)
        for column in range(columns)
    ]

for n in range(24):
    rows = 2 ** n
    df = rand_df2(rows)
    print(f"{rows:8}", end=' ')
    averages(df)

      1 Function 'averages' executed in 0.0000s
      2 Function 'averages' executed in 0.0000s
      4 Function 'averages' executed in 0.0000s
      8 Function 'averages' executed in 0.0000s
     16 Function 'averages' executed in 0.0000s
     32 Function 'averages' executed in 0.0000s
     64 Function 'averages' executed in 0.0001s
    128 Function 'averages' executed in 0.0001s
    256 Function 'averages' executed in 0.0002s
    512 Function 'averages' executed in 0.0004s
   1024 Function 'averages' executed in 0.0008s
   2048 Function 'averages' executed in 0.0016s
   4096 Function 'averages' executed in 0.0032s
   8192 Function 'averages' executed in 0.0067s
  16384 Function 'averages' executed in 0.0126s
  32768 Function 'averages' executed in 0.0271s
  65536 Function 'averages' executed in 0.0554s
 131072 Function 'averages' executed in 0.1109s
 262144 Function 'averages' executed in 0.2194s
 524288 Function 'averages' executed in 0.4477s
1048576 Function 'averages' executed in 0.9050s
2097152 Function 'averages' executed in 1.6663s
4194304 Function 'averages' executed in 3.5078s
8388608 Function 'averages' executed in 6.9581s

Now the polars library has a built-in function to do just this. It's called mean

df.mean()

shape: (1, 2)
┌──────────┬──────────┐
│ A        ┆ B        │
│ ---      ┆ ---      │
│ f64      ┆ f64      │
╞══════════╪══════════╡
│ 0.500165 ┆ 0.500011 │
└──────────┴──────────┘

Let's run our timing test using it.

@timer_func
def averages_p(df):
    df.mean()

for n in range(24):
    rows = 2 ** n
    df = rand_df2(rows)
    print(f"{rows:8}", end=' ')
    averages_p(df)

      1 Function 'averages_p' executed in 0.0003s
      2 Function 'averages_p' executed in 0.0001s
      4 Function 'averages_p' executed in 0.0001s
      8 Function 'averages_p' executed in 0.0001s
     16 Function 'averages_p' executed in 0.0001s
     32 Function 'averages_p' executed in 0.0001s
     64 Function 'averages_p' executed in 0.0001s
    128 Function 'averages_p' executed in 0.0001s
    256 Function 'averages_p' executed in 0.0001s
    512 Function 'averages_p' executed in 0.0001s
   1024 Function 'averages_p' executed in 0.0001s
   2048 Function 'averages_p' executed in 0.0001s
   4096 Function 'averages_p' executed in 0.0001s
   8192 Function 'averages_p' executed in 0.0001s
  16384 Function 'averages_p' executed in 0.0001s
  32768 Function 'averages_p' executed in 0.0002s
  65536 Function 'averages_p' executed in 0.0002s
 131072 Function 'averages_p' executed in 0.0003s
 262144 Function 'averages_p' executed in 0.0003s
 524288 Function 'averages_p' executed in 0.0006s
1048576 Function 'averages_p' executed in 0.0008s
2097152 Function 'averages_p' executed in 0.0014s
4194304 Function 'averages_p' executed in 0.0025s
8388608 Function 'averages_p' executed in 0.0048s

On the largest data frame, the Polars version is \(7.5774 / 0.0055 = 1377\) times faster!

Getting Started with Polars

Make a data frame

import polars as pl
from datetime import date

df = pl.DataFrame(
    {
        "name": [
            "Alice Archer",
            "Ben Brown",
            "Chloe Cooper",
            "Daniel Donovan",
        ],
        "birthdate": [
            date(1997, 1, 10),
            date(1985, 2, 15),
            date(1983, 3, 22),
            date(1981, 4, 30),
        ],
        # in kilos
        "weight": [57.9, 72.5, 53.6, 83.1],
        # in metres
        "height": [1.56, 1.77, 1.65, 1.75],
    }
)

print(df)

shape: (4, 4)
┌────────────────┬────────────┬────────┬────────┐
│ name           ┆ birthdate  ┆ weight ┆ height │
│ ---            ┆ ---        ┆ ---    ┆ ---    │
│ str            ┆ date       ┆ f64    ┆ f64    │
╞════════════════╪════════════╪════════╪════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   │
│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
│ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   │
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   │
└────────────────┴────────────┴────────┴────────┘

Write to a CSV file and read it again

csv_file = "/tmp/sample.csv"
df.write_csv(csv_file)
df_csv = pl.read_csv(csv_file, try_parse_dates=True)
print(df_csv)

shape: (4, 4)
┌────────────────┬────────────┬────────┬────────┐
│ name           ┆ birthdate  ┆ weight ┆ height │
│ ---            ┆ ---        ┆ ---    ┆ ---    │
│ str            ┆ date       ┆ f64    ┆ f64    │
╞════════════════╪════════════╪════════╪════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   │
│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
│ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   │
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   │
└────────────────┴────────────┴────────┴────────┘

Note that nothing about the data types is written to the CSV files so it may be that information is lost in this process. We can see this if we tell the reader not to parse the dates.

pl.read_csv(csv_file)

shape: (4, 4)
┌────────────────┬────────────┬────────┬────────┐
│ name           ┆ birthdate  ┆ weight ┆ height │
│ ---            ┆ ---        ┆ ---    ┆ ---    │
│ str            ┆ str        ┆ f64    ┆ f64    │
╞════════════════╪════════════╪════════╪════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   │
│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
│ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   │
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   │
└────────────────┴────────────┴────────┴────────┘

Expressions and contexts

We can refer to columns of a data frame

pl.col("weight")

<Expr ['col("weight")'] at 0x7FF09ACE4150>

and we can build expressions from them

pl.col("weight") / (pl.col("height") ** 2)

<Expr ['[(col("weight")) / (col("heigh…'] at 0x7FF09ACE4250>

You might recognise this expression as the Body mass index, used as a "convenient rule of thumb used to broadly categorize a person as underweight, normal or overweight".

We can use such expressions in Polars contexts, for example to select columns by expression.

result = df.select(
    pl.col("name"),
    pl.col("birthdate").dt.year().alias("birth_year"),
    (pl.col("weight") / (pl.col("height") ** 2)).alias("bmi"),
)
print(result)

shape: (4, 3)
┌────────────────┬────────────┬───────────┐
│ name           ┆ birth_year ┆ bmi       │
│ ---            ┆ ---        ┆ ---       │
│ str            ┆ i32        ┆ f64       │
╞════════════════╪════════════╪═══════════╡
│ Alice Archer   ┆ 1997       ┆ 23.791913 │
│ Ben Brown      ┆ 1985       ┆ 23.141498 │
│ Chloe Cooper   ┆ 1983       ┆ 19.687787 │
│ Daniel Donovan ┆ 1981       ┆ 27.134694 │
└────────────────┴────────────┴───────────┘

Note how we have selected two columns and one expression.
We have transformed birthdate to extract the year from the date and renamed the column to birth_year
We have given the expression calculated from the columns weight and height the name bmi
Note that the original dataframe remains unchanged.

Expression expansion

Polars allows one expression to be used for multiple expression. This is called expression expansion.

result = df.select(
    pl.col("name"),
    (pl.col("weight", "height") * 0.95).round(2).name.suffix("-5%"),
)
print(result)

shape: (4, 3)
┌────────────────┬───────────┬───────────┐
│ name           ┆ weight-5% ┆ height-5% │
│ ---            ┆ ---       ┆ ---       │
│ str            ┆ f64       ┆ f64       │
╞════════════════╪═══════════╪═══════════╡
│ Alice Archer   ┆ 55.0      ┆ 1.48      │
│ Ben Brown      ┆ 68.88     ┆ 1.68      │
│ Chloe Cooper   ┆ 50.92     ┆ 1.57      │
│ Daniel Donovan ┆ 78.94     ┆ 1.66      │
└────────────────┴───────────┴───────────┘

One expression applies to both columns, weight and height
Each column is
1. first multiplied by 0.95
2. then rounded to 2 decimal places
3. then the column is renamed by suffixing its name with -5%

Sometimes we want to add columns to our date frame rather than select columns from it. For this we use with_columns

result = df.with_columns(
    birth_year=pl.col("birthdate").dt.year(),
    bmi=pl.col("weight") / (pl.col("height") ** 2),
)
print(result)

shape: (4, 6)
┌────────────────┬────────────┬────────┬────────┬────────────┬───────────┐
│ name           ┆ birthdate  ┆ weight ┆ height ┆ birth_year ┆ bmi       │
│ ---            ┆ ---        ┆ ---    ┆ ---    ┆ ---        ┆ ---       │
│ str            ┆ date       ┆ f64    ┆ f64    ┆ i32        ┆ f64       │
╞════════════════╪════════════╪════════╪════════╪════════════╪═══════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   ┆ 1997       ┆ 23.791913 │
│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   ┆ 1985       ┆ 23.141498 │
│ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   ┆ 1983       ┆ 19.687787 │
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   ┆ 1981       ┆ 27.134694 │
└────────────────┴────────────┴────────┴────────┴────────────┴───────────┘

Here the original columns remain unchanged and we add two columns.
Note, again, that the original data frame remains unchanged.
Note how the new column birth_year is of type i32. This takes half the space of an i64 and is more that sufficient to store calendar years.
We use a more convenient "assignment form" to make the new columns, rather than alias. If the new column name contains, for example, a space then we would need to use alias.

Sometimes we need to filter the rows of a data frame. For example

result = df.filter(pl.col("birthdate").dt.year() < 1990)
print(result)

shape: (3, 4)
┌────────────────┬────────────┬────────┬────────┐
│ name           ┆ birthdate  ┆ weight ┆ height │
│ ---            ┆ ---        ┆ ---    ┆ ---    │
│ str            ┆ date       ┆ f64    ┆ f64    │
╞════════════════╪════════════╪════════╪════════╡
│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
│ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   │
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   │
└────────────────┴────────────┴────────┴────────┘

Note how filter takes a predicate on the column birthdate.
Such filters can be more complex Boolean expressions over multiple columns.

Rather than writing Boolean ∧, we can write multiple filters.

result = df.filter(
    pl.col("birthdate").is_between(date(1982, 12, 31), date(1996, 1, 1)),
    pl.col("height") > 1.7,
)
print(result)

shape: (1, 4)
┌───────────┬────────────┬────────┬────────┐
│ name      ┆ birthdate  ┆ weight ┆ height │
│ ---       ┆ ---        ┆ ---    ┆ ---    │
│ str       ┆ date       ┆ f64    ┆ f64    │
╞═══════════╪════════════╪════════╪════════╡
│ Ben Brown ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
└───────────┴────────────┴────────┴────────┘

Note that you can indicate to is_between whether to include either endpoint or both by using the closed optional argument.

We have had selecting columns, adding columns and filtering rows. Another commonly needed operation on data frames is grouping. For this we use group_by

Suppose we would like to see how many rows in the data frame have a birth date in each decade. We can do it like this.

result = df.group_by(
    decade = (pl.col("birthdate").dt.year() // 10 * 10),
    maintain_order=True,
).len()
print(result)

shape: (2, 2)
┌────────┬─────┐
│ decade ┆ len │
│ ---    ┆ --- │
│ i32    ┆ u32 │
╞════════╪═════╡
│ 1990   ┆ 1   │
│ 1980   ┆ 3   │
└────────┴─────┘

We take the birthdate column
We extract its year
We divide its year by 10 using integer division (//). This drops the fractional part.
Now we multiply by 10 again. We are left with the decade.
We name it decade.
We use maintain_order to keep the order of the decades consistent with the dates in the data frame. This is slower but more convenient to read.
We take the length, i.e. the number of rows in each decade.

Now that we have grouped the rows into decades, we can take aggregations over each group.

result = df.group_by(
    decade=(pl.col("birthdate").dt.year() // 10 * 10),
    maintain_order=True,
).agg(
    pl.len().alias("sample_size"),
    pl.col("weight").mean().round(2).alias("avg_weight"),
    pl.col("height").max().alias("tallest"),
)
print(result)

shape: (2, 4)
┌────────┬─────────────┬────────────┬─────────┐
│ decade ┆ sample_size ┆ avg_weight ┆ tallest │
│ ---    ┆ ---         ┆ ---        ┆ ---     │
│ i32    ┆ u32         ┆ f64        ┆ f64     │
╞════════╪═════════════╪════════════╪═════════╡
│ 1990   ┆ 1           ┆ 57.9       ┆ 1.56    │
│ 1980   ┆ 3           ┆ 69.73      ┆ 1.77    │
└────────┴─────────────┴────────────┴─────────┘

Here we take
- the "length" of each group and name it sample_size
- the weight which we average (mean), round to 2 decimal places and name avg_weight
- the height which we take the largest of (max) and name tallest
Note that we can not use the assignment form for agg; we must use alias.

We can build more complex queries by chaining expressions and contexts.

result = (
    df.with_columns(
        pl.col("name").str.split(by=" ").list.first(),
        decade=(pl.col("birthdate").dt.year() // 10 * 10),
    )
    .select(pl.all().exclude("birthdate"))
    .group_by(pl.col("decade"), maintain_order=True)
    .agg(
        pl.col("name"),
        pl.col("weight", "height").mean().round(2).name.prefix("avg_"),
    )
)
print(result)

shape: (2, 4)
┌────────┬────────────────────────────┬────────────┬────────────┐
│ decade ┆ name                       ┆ avg_weight ┆ avg_height │
│ ---    ┆ ---                        ┆ ---        ┆ ---        │
│ i32    ┆ list[str]                  ┆ f64        ┆ f64        │
╞════════╪════════════════════════════╪════════════╪════════════╡
│ 1990   ┆ ["Alice"]                  ┆ 57.9       ┆ 1.56       │
│ 1980   ┆ ["Ben", "Chloe", "Daniel"] ┆ 69.73      ┆ 1.72       │
└────────┴────────────────────────────┴────────────┴────────────┘

We extract the first name of each name by string manipulation.
The name column is aggregated into a list of names.

To conclude, we will look at two ways of combining data frames.

First of all we make a new data frame and then join it with the first one.

df2 = pl.DataFrame(
    {
        "name": [
            "Ben Brown",
            "Daniel Donovan",
            "Alice Archer",
            "Chloe Cooper",
        ],
        "parent": [True, False, False, False],
        "siblings": [1, 2, 3, 4],
    }
)
print(df2)

shape: (4, 3)
┌────────────────┬────────┬──────────┐
│ name           ┆ parent ┆ siblings │
│ ---            ┆ ---    ┆ ---      │
│ str            ┆ bool   ┆ i64      │
╞════════════════╪════════╪══════════╡
│ Ben Brown      ┆ true   ┆ 1        │
│ Daniel Donovan ┆ false  ┆ 2        │
│ Alice Archer   ┆ false  ┆ 3        │
│ Chloe Cooper   ┆ false  ┆ 4        │
└────────────────┴────────┴──────────┘

We see the same four names with additional columns. Now let's join this data frame to the earlier one.

print(df.join(df2, on="name", how="left"))

shape: (4, 6)
┌────────────────┬────────────┬────────┬────────┬────────┬──────────┐
│ name           ┆ birthdate  ┆ weight ┆ height ┆ parent ┆ siblings │
│ ---            ┆ ---        ┆ ---    ┆ ---    ┆ ---    ┆ ---      │
│ str            ┆ date       ┆ f64    ┆ f64    ┆ bool   ┆ i64      │
╞════════════════╪════════════╪════════╪════════╪════════╪══════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   ┆ false  ┆ 3        │
│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   ┆ true   ┆ 1        │
│ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   ┆ false  ┆ 4        │
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   ┆ false  ┆ 2        │
└────────────────┴────────────┴────────┴────────┴────────┴──────────┘

We get a combination of the two data frames.

There are many ways to join for dealing with, e.g cases where a name is present in one data frame but not in the other or a name occurs multiple times in one or other data frame. It is worth taking the time to experiment with them. Here is a link to the Monopoly data used in the examples.

More straightforwardly than joining, we can just concatenate data frames, horizontally or vertically.

df3 = pl.DataFrame(
    {
        "name": [
            "Ethan Edwards",
            "Fiona Foster",
            "Grace Gibson",
            "Henry Harris",
        ],
        "birthdate": [
            date(1977, 5, 10),
            date(1975, 6, 23),
            date(1973, 7, 22),
            date(1971, 8, 3),
        ],
        "weight": [67.9, 72.5, 57.6, 93.1],
        "height": [1.76, 1.6, 1.66, 1.8],
    }
)

print(pl.concat([df, df3], how="vertical"))

shape: (8, 4)
┌────────────────┬────────────┬────────┬────────┐
│ name           ┆ birthdate  ┆ weight ┆ height │
│ ---            ┆ ---        ┆ ---    ┆ ---    │
│ str            ┆ date       ┆ f64    ┆ f64    │
╞════════════════╪════════════╪════════╪════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   │
│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
│ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   │
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   │
│ Ethan Edwards  ┆ 1977-05-10 ┆ 67.9   ┆ 1.76   │
│ Fiona Foster   ┆ 1975-06-23 ┆ 72.5   ┆ 1.6    │
│ Grace Gibson   ┆ 1973-07-22 ┆ 57.6   ┆ 1.66   │
│ Henry Harris   ┆ 1971-08-03 ┆ 93.1   ┆ 1.8    │
└────────────────┴────────────┴────────┴────────┘

Other sources

Here are some other introductory tutorials for Polars that you might find useful.

Other libraries

Numpy
Pandas

This week's lab

Chrestomathy
Rosetta Code
the book: ISLP
practice
pairs for this week's lab

[('Jagoda', 'Giorgia'),
 ('Carlijn', 'Adrian'),
 ('Roman', 'Emilio'),
 ('Fridolin', 'Efraim'),
 ('Emil', 'Yaohong'),
 ('Matúš', 'Nora'),
 ('Natalia', 'Madalina'),
 ('Max', 'Asmahene'),
 ('Jess', 'Emilie'),
 ('Mariana', 'Romane'),
 ('Lynn', 'Louanne','Janina',),]

[('Jagoda', 'Giorgia'),
 ('Carlijn', 'Adrian'),
 ('Roman', 'Emilio'),
 ('Fridolin', 'Efraim'),
 ('Emil', 'Yaohong'),
 ('Matúš', 'Nora'),
 ('Natalia', 'Madalina'),
 ('Max', 'Asmahene'),
 ('Jess', 'Emilie'),
 ('Mariana', 'Romane'),
 ('Lynn', 'Louanne', 'Janina')]