Array Processing with Polars
Polars
Let's start by importing Polars in the conventional way.
import polars as pl
Why Polars?
We can make a Polars data frame with random elements like this.
from random import random def randlist(n): return [random() for _ in range(n)] def rand_df2(rows): return pl.DataFrame({"A": randlist(rows), "B": randlist(rows)}) df = rand_df2(8) df
shape: (8, 2) ┌──────────┬──────────┐ │ A ┆ B │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞══════════╪══════════╡ │ 0.721364 ┆ 0.214304 │ │ 0.955427 ┆ 0.547667 │ │ 0.073088 ┆ 0.338197 │ │ 0.177586 ┆ 0.358761 │ │ 0.508176 ┆ 0.14593 │ │ 0.199816 ┆ 0.988286 │ │ 0.31845 ┆ 0.738086 │ │ 0.525122 ┆ 0.180186 │ └──────────┴──────────┘
Let's get the average of each column using conventional Python techniques,
namely for loops.
rows, columns = df.shape averages = [] for column in range(columns): row_sum = 0 for row in range(rows): row_sum += df.item(row, column) averages.append( row_sum / rows) print(averages)
[0.4348785539007972, 0.4389271590257381]
Now let's tidy it up and make a function out of it.
def averages(df): rows, columns = df.shape return [ (sum(df.item(row, column) for row in range(rows)) / rows) for column in range(columns) ] averages(df)
[0.4348785539007972, 0.4389271590257381]
We would like to measure how long it takes to run some functions. We make a Python decorator.
from time import time def timer_func(func): def wrap_func(*args, **kwargs): t1 = time() result = func(*args, **kwargs) t2 = time() elapsed = t2 - t1 print(f"Function {func.__name__!r} executed in {elapsed:.4f}s") return result return wrap_func @timer_func def long_time(n): for i in range(n): for j in range(n): i * j long_time(10000)
Function 'long_time' executed in 4.1624s
Let's rewrite our averages function.
@timer_func def averages(df): rows, columns = df.shape return [ (sum(df.item(row, column) for row in range(rows)) / rows) for column in range(columns) ]
for n in range(24): rows = 2 ** n df = rand_df2(rows) print(f"{rows:8}", end=' ') averages(df)
1 Function 'averages' executed in 0.0000s
2 Function 'averages' executed in 0.0000s
4 Function 'averages' executed in 0.0000s
8 Function 'averages' executed in 0.0000s
16 Function 'averages' executed in 0.0000s
32 Function 'averages' executed in 0.0000s
64 Function 'averages' executed in 0.0001s
128 Function 'averages' executed in 0.0001s
256 Function 'averages' executed in 0.0002s
512 Function 'averages' executed in 0.0004s
1024 Function 'averages' executed in 0.0008s
2048 Function 'averages' executed in 0.0016s
4096 Function 'averages' executed in 0.0032s
8192 Function 'averages' executed in 0.0067s
16384 Function 'averages' executed in 0.0126s
32768 Function 'averages' executed in 0.0271s
65536 Function 'averages' executed in 0.0554s
131072 Function 'averages' executed in 0.1109s
262144 Function 'averages' executed in 0.2194s
524288 Function 'averages' executed in 0.4477s
1048576 Function 'averages' executed in 0.9050s
2097152 Function 'averages' executed in 1.6663s
4194304 Function 'averages' executed in 3.5078s
8388608 Function 'averages' executed in 6.9581s
Now the polars library has a built-in function to do just this. It's called mean
df.mean()
shape: (1, 2) ┌──────────┬──────────┐ │ A ┆ B │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞══════════╪══════════╡ │ 0.500165 ┆ 0.500011 │ └──────────┴──────────┘
Let's run our timing test using it.
@timer_func def averages_p(df): df.mean() for n in range(24): rows = 2 ** n df = rand_df2(rows) print(f"{rows:8}", end=' ') averages_p(df)
1 Function 'averages_p' executed in 0.0003s
2 Function 'averages_p' executed in 0.0001s
4 Function 'averages_p' executed in 0.0001s
8 Function 'averages_p' executed in 0.0001s
16 Function 'averages_p' executed in 0.0001s
32 Function 'averages_p' executed in 0.0001s
64 Function 'averages_p' executed in 0.0001s
128 Function 'averages_p' executed in 0.0001s
256 Function 'averages_p' executed in 0.0001s
512 Function 'averages_p' executed in 0.0001s
1024 Function 'averages_p' executed in 0.0001s
2048 Function 'averages_p' executed in 0.0001s
4096 Function 'averages_p' executed in 0.0001s
8192 Function 'averages_p' executed in 0.0001s
16384 Function 'averages_p' executed in 0.0001s
32768 Function 'averages_p' executed in 0.0002s
65536 Function 'averages_p' executed in 0.0002s
131072 Function 'averages_p' executed in 0.0003s
262144 Function 'averages_p' executed in 0.0003s
524288 Function 'averages_p' executed in 0.0006s
1048576 Function 'averages_p' executed in 0.0008s
2097152 Function 'averages_p' executed in 0.0014s
4194304 Function 'averages_p' executed in 0.0025s
8388608 Function 'averages_p' executed in 0.0048s
On the largest data frame, the Polars version is \(7.5774 / 0.0055 = 1377\) times faster!
Getting Started with Polars
Make a data frame
import polars as pl from datetime import date df = pl.DataFrame( { "name": [ "Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan", ], "birthdate": [ date(1997, 1, 10), date(1985, 2, 15), date(1983, 3, 22), date(1981, 4, 30), ], # in kilos "weight": [57.9, 72.5, 53.6, 83.1], # in metres "height": [1.56, 1.77, 1.65, 1.75], } ) print(df)
shape: (4, 4) ┌────────────────┬────────────┬────────┬────────┐ │ name ┆ birthdate ┆ weight ┆ height │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 ┆ f64 │ ╞════════════════╪════════════╪════════╪════════╡ │ Alice Archer ┆ 1997-01-10 ┆ 57.9 ┆ 1.56 │ │ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 │ │ Chloe Cooper ┆ 1983-03-22 ┆ 53.6 ┆ 1.65 │ │ Daniel Donovan ┆ 1981-04-30 ┆ 83.1 ┆ 1.75 │ └────────────────┴────────────┴────────┴────────┘
Write to a CSV file and read it again
csv_file = "/tmp/sample.csv" df.write_csv(csv_file) df_csv = pl.read_csv(csv_file, try_parse_dates=True) print(df_csv)
shape: (4, 4) ┌────────────────┬────────────┬────────┬────────┐ │ name ┆ birthdate ┆ weight ┆ height │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 ┆ f64 │ ╞════════════════╪════════════╪════════╪════════╡ │ Alice Archer ┆ 1997-01-10 ┆ 57.9 ┆ 1.56 │ │ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 │ │ Chloe Cooper ┆ 1983-03-22 ┆ 53.6 ┆ 1.65 │ │ Daniel Donovan ┆ 1981-04-30 ┆ 83.1 ┆ 1.75 │ └────────────────┴────────────┴────────┴────────┘
Note that nothing about the data types is written to the CSV files so it may be that information is lost in this process. We can see this if we tell the reader not to parse the dates.
pl.read_csv(csv_file)
shape: (4, 4) ┌────────────────┬────────────┬────────┬────────┐ │ name ┆ birthdate ┆ weight ┆ height │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ f64 ┆ f64 │ ╞════════════════╪════════════╪════════╪════════╡ │ Alice Archer ┆ 1997-01-10 ┆ 57.9 ┆ 1.56 │ │ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 │ │ Chloe Cooper ┆ 1983-03-22 ┆ 53.6 ┆ 1.65 │ │ Daniel Donovan ┆ 1981-04-30 ┆ 83.1 ┆ 1.75 │ └────────────────┴────────────┴────────┴────────┘
Expressions and contexts
We can refer to columns of a data frame
pl.col("weight")
<Expr ['col("weight")'] at 0x7FF09ACE4150>
and we can build expressions from them
pl.col("weight") / (pl.col("height") ** 2)
<Expr ['[(col("weight")) / (col("heigh…'] at 0x7FF09ACE4250>
You might recognise this expression as the Body mass index, used as a "convenient rule of thumb used to broadly categorize a person as underweight, normal or overweight".
We can use such expressions in Polars contexts, for example to select columns by expression.
result = df.select( pl.col("name"), pl.col("birthdate").dt.year().alias("birth_year"), (pl.col("weight") / (pl.col("height") ** 2)).alias("bmi"), ) print(result)
shape: (4, 3) ┌────────────────┬────────────┬───────────┐ │ name ┆ birth_year ┆ bmi │ │ --- ┆ --- ┆ --- │ │ str ┆ i32 ┆ f64 │ ╞════════════════╪════════════╪═══════════╡ │ Alice Archer ┆ 1997 ┆ 23.791913 │ │ Ben Brown ┆ 1985 ┆ 23.141498 │ │ Chloe Cooper ┆ 1983 ┆ 19.687787 │ │ Daniel Donovan ┆ 1981 ┆ 27.134694 │ └────────────────┴────────────┴───────────┘
- Note how we have selected two columns and one expression.
- We have transformed
birthdateto extract the year from the date and renamed the column tobirth_year - We have given the expression calculated from the columns
weightandheightthe namebmi - Note that the original dataframe remains unchanged.
Expression expansion
Polars allows one expression to be used for multiple expression. This is called expression expansion.
result = df.select( pl.col("name"), (pl.col("weight", "height") * 0.95).round(2).name.suffix("-5%"), ) print(result)
shape: (4, 3) ┌────────────────┬───────────┬───────────┐ │ name ┆ weight-5% ┆ height-5% │ │ --- ┆ --- ┆ --- │ │ str ┆ f64 ┆ f64 │ ╞════════════════╪═══════════╪═══════════╡ │ Alice Archer ┆ 55.0 ┆ 1.48 │ │ Ben Brown ┆ 68.88 ┆ 1.68 │ │ Chloe Cooper ┆ 50.92 ┆ 1.57 │ │ Daniel Donovan ┆ 78.94 ┆ 1.66 │ └────────────────┴───────────┴───────────┘
- One expression applies to both columns,
weightandheight - Each column is
- first multiplied by
0.95 - then rounded to 2 decimal places
- then the column is renamed by suffixing its name with
-5%
- first multiplied by
Sometimes we want to add columns to our date frame rather than select columns
from it. For this we use with_columns
result = df.with_columns( birth_year=pl.col("birthdate").dt.year(), bmi=pl.col("weight") / (pl.col("height") ** 2), ) print(result)
shape: (4, 6) ┌────────────────┬────────────┬────────┬────────┬────────────┬───────────┐ │ name ┆ birthdate ┆ weight ┆ height ┆ birth_year ┆ bmi │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 ┆ f64 ┆ i32 ┆ f64 │ ╞════════════════╪════════════╪════════╪════════╪════════════╪═══════════╡ │ Alice Archer ┆ 1997-01-10 ┆ 57.9 ┆ 1.56 ┆ 1997 ┆ 23.791913 │ │ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 ┆ 1985 ┆ 23.141498 │ │ Chloe Cooper ┆ 1983-03-22 ┆ 53.6 ┆ 1.65 ┆ 1983 ┆ 19.687787 │ │ Daniel Donovan ┆ 1981-04-30 ┆ 83.1 ┆ 1.75 ┆ 1981 ┆ 27.134694 │ └────────────────┴────────────┴────────┴────────┴────────────┴───────────┘
- Here the original columns remain unchanged and we add two columns.
- Note, again, that the original data frame remains unchanged.
- Note how the new column
birth_yearis of typei32. This takes half the space of ani64and is more that sufficient to store calendar years. - We use a more convenient "assignment form" to make the new columns, rather
than
alias. If the new column name contains, for example, a space then we would need to usealias.
Sometimes we need to filter the rows of a data frame. For example
result = df.filter(pl.col("birthdate").dt.year() < 1990) print(result)
shape: (3, 4) ┌────────────────┬────────────┬────────┬────────┐ │ name ┆ birthdate ┆ weight ┆ height │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 ┆ f64 │ ╞════════════════╪════════════╪════════╪════════╡ │ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 │ │ Chloe Cooper ┆ 1983-03-22 ┆ 53.6 ┆ 1.65 │ │ Daniel Donovan ┆ 1981-04-30 ┆ 83.1 ┆ 1.75 │ └────────────────┴────────────┴────────┴────────┘
- Note how
filtertakes a predicate on the columnbirthdate. - Such filters can be more complex Boolean expressions over multiple columns.
Rather than writing Boolean
∧, we can write multiple filters.result = df.filter( pl.col("birthdate").is_between(date(1982, 12, 31), date(1996, 1, 1)), pl.col("height") > 1.7, ) print(result)
shape: (1, 4) ┌───────────┬────────────┬────────┬────────┐ │ name ┆ birthdate ┆ weight ┆ height │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 ┆ f64 │ ╞═══════════╪════════════╪════════╪════════╡ │ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 │ └───────────┴────────────┴────────┴────────┘
- Note that you can indicate to isbetween whether to include either
endpoint or both by using the
closedoptional argument.
- Note that you can indicate to isbetween whether to include either
endpoint or both by using the
We have had selecting columns, adding columns and filtering rows. Another
commonly needed operation on data frames is grouping. For this we use
group_by
Suppose we would like to see how many rows in the data frame have a birth date in each decade. We can do it like this.
result = df.group_by( decade = (pl.col("birthdate").dt.year() // 10 * 10), maintain_order=True, ).len() print(result)
shape: (2, 2) ┌────────┬─────┐ │ decade ┆ len │ │ --- ┆ --- │ │ i32 ┆ u32 │ ╞════════╪═════╡ │ 1990 ┆ 1 │ │ 1980 ┆ 3 │ └────────┴─────┘
- We take the
birthdatecolumn - We extract its
year - We divide its
yearby 10 using integer division (//). This drops the fractional part. - Now we multiply by 10 again. We are left with the decade.
- We name it
decade. - We use
maintain_orderto keep the order of the decades consistent with the dates in the data frame. This is slower but more convenient to read. - We take the length, i.e. the number of rows in each
decade.
Now that we have grouped the rows into decades, we can take aggregations over each group.
result = df.group_by( decade=(pl.col("birthdate").dt.year() // 10 * 10), maintain_order=True, ).agg( pl.len().alias("sample_size"), pl.col("weight").mean().round(2).alias("avg_weight"), pl.col("height").max().alias("tallest"), ) print(result)
shape: (2, 4) ┌────────┬─────────────┬────────────┬─────────┐ │ decade ┆ sample_size ┆ avg_weight ┆ tallest │ │ --- ┆ --- ┆ --- ┆ --- │ │ i32 ┆ u32 ┆ f64 ┆ f64 │ ╞════════╪═════════════╪════════════╪═════════╡ │ 1990 ┆ 1 ┆ 57.9 ┆ 1.56 │ │ 1980 ┆ 3 ┆ 69.73 ┆ 1.77 │ └────────┴─────────────┴────────────┴─────────┘
- Here we take
- the "length" of each group and name it
sample_size - the
weightwhich we average (mean), round to 2 decimal places and nameavg_weight - the
heightwhich we take the largest of (max) and nametallest
- the "length" of each group and name it
- Note that we can not use the assignment form for
agg; we must usealias.
We can build more complex queries by chaining expressions and contexts.
result = ( df.with_columns( pl.col("name").str.split(by=" ").list.first(), decade=(pl.col("birthdate").dt.year() // 10 * 10), ) .select(pl.all().exclude("birthdate")) .group_by(pl.col("decade"), maintain_order=True) .agg( pl.col("name"), pl.col("weight", "height").mean().round(2).name.prefix("avg_"), ) ) print(result)
shape: (2, 4) ┌────────┬────────────────────────────┬────────────┬────────────┐ │ decade ┆ name ┆ avg_weight ┆ avg_height │ │ --- ┆ --- ┆ --- ┆ --- │ │ i32 ┆ list[str] ┆ f64 ┆ f64 │ ╞════════╪════════════════════════════╪════════════╪════════════╡ │ 1990 ┆ ["Alice"] ┆ 57.9 ┆ 1.56 │ │ 1980 ┆ ["Ben", "Chloe", "Daniel"] ┆ 69.73 ┆ 1.72 │ └────────┴────────────────────────────┴────────────┴────────────┘
- We extract the first name of each name by string manipulation.
- The
namecolumn is aggregated into a list of names.
To conclude, we will look at two ways of combining data frames.
First of all we make a new data frame and then join it with the first one.
df2 = pl.DataFrame( { "name": [ "Ben Brown", "Daniel Donovan", "Alice Archer", "Chloe Cooper", ], "parent": [True, False, False, False], "siblings": [1, 2, 3, 4], } ) print(df2)
shape: (4, 3) ┌────────────────┬────────┬──────────┐ │ name ┆ parent ┆ siblings │ │ --- ┆ --- ┆ --- │ │ str ┆ bool ┆ i64 │ ╞════════════════╪════════╪══════════╡ │ Ben Brown ┆ true ┆ 1 │ │ Daniel Donovan ┆ false ┆ 2 │ │ Alice Archer ┆ false ┆ 3 │ │ Chloe Cooper ┆ false ┆ 4 │ └────────────────┴────────┴──────────┘
We see the same four names with additional columns. Now let's join this data frame to the earlier one.
print(df.join(df2, on="name", how="left"))
shape: (4, 6) ┌────────────────┬────────────┬────────┬────────┬────────┬──────────┐ │ name ┆ birthdate ┆ weight ┆ height ┆ parent ┆ siblings │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 ┆ f64 ┆ bool ┆ i64 │ ╞════════════════╪════════════╪════════╪════════╪════════╪══════════╡ │ Alice Archer ┆ 1997-01-10 ┆ 57.9 ┆ 1.56 ┆ false ┆ 3 │ │ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 ┆ true ┆ 1 │ │ Chloe Cooper ┆ 1983-03-22 ┆ 53.6 ┆ 1.65 ┆ false ┆ 4 │ │ Daniel Donovan ┆ 1981-04-30 ┆ 83.1 ┆ 1.75 ┆ false ┆ 2 │ └────────────────┴────────────┴────────┴────────┴────────┴──────────┘
We get a combination of the two data frames.
There are many ways to join for dealing with, e.g cases where a name is present in one data frame but not in the other or a name occurs multiple times in one or other data frame. It is worth taking the time to experiment with them. Here is a link to the Monopoly data used in the examples.
More straightforwardly than joining, we can just concatenate data frames, horizontally or vertically.
df3 = pl.DataFrame( { "name": [ "Ethan Edwards", "Fiona Foster", "Grace Gibson", "Henry Harris", ], "birthdate": [ date(1977, 5, 10), date(1975, 6, 23), date(1973, 7, 22), date(1971, 8, 3), ], "weight": [67.9, 72.5, 57.6, 93.1], "height": [1.76, 1.6, 1.66, 1.8], } ) print(pl.concat([df, df3], how="vertical"))
shape: (8, 4) ┌────────────────┬────────────┬────────┬────────┐ │ name ┆ birthdate ┆ weight ┆ height │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 ┆ f64 │ ╞════════════════╪════════════╪════════╪════════╡ │ Alice Archer ┆ 1997-01-10 ┆ 57.9 ┆ 1.56 │ │ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 │ │ Chloe Cooper ┆ 1983-03-22 ┆ 53.6 ┆ 1.65 │ │ Daniel Donovan ┆ 1981-04-30 ┆ 83.1 ┆ 1.75 │ │ Ethan Edwards ┆ 1977-05-10 ┆ 67.9 ┆ 1.76 │ │ Fiona Foster ┆ 1975-06-23 ┆ 72.5 ┆ 1.6 │ │ Grace Gibson ┆ 1973-07-22 ┆ 57.6 ┆ 1.66 │ │ Henry Harris ┆ 1971-08-03 ┆ 93.1 ┆ 1.8 │ └────────────────┴────────────┴────────┴────────┘
Other sources
Here are some other introductory tutorials for Polars that you might find useful.
This week's lab
- Chrestomathy
- Rosetta Code
- the book: ISLP
- practice
- pairs for this week's lab
[('Jagoda', 'Giorgia'), ('Carlijn', 'Adrian'), ('Roman', 'Emilio'), ('Fridolin', 'Efraim'), ('Emil', 'Yaohong'), ('Matúš', 'Nora'), ('Natalia', 'Madalina'), ('Max', 'Asmahene'), ('Jess', 'Emilie'), ('Mariana', 'Romane'), ('Lynn', 'Louanne','Janina',),]
[('Jagoda', 'Giorgia'),
('Carlijn', 'Adrian'),
('Roman', 'Emilio'),
('Fridolin', 'Efraim'),
('Emil', 'Yaohong'),
('Matúš', 'Nora'),
('Natalia', 'Madalina'),
('Max', 'Asmahene'),
('Jess', 'Emilie'),
('Mariana', 'Romane'),
('Lynn', 'Louanne', 'Janina')]