A SQL to Julia DataFrames dictionary using DataFrames and DataFramesMeta. In general, I use DataFramesMeta since it abstracts away some lower level nuance and it makes for a tidier workflow when I’m constructing queries.
Overview
Querying in DataFrames
and DataFramesMeta
SQL Clause | DataFrames Equivalent | DataFramesMeta Equivalent |
---|---|---|
SELECT |
select |
@select |
WHERE |
subset(df, :colnm => ByRow(x -> x>1)) |
@rsubset(df, :colnm > 1) |
LIKE |
subset(df, :colnm=>ByRow(x -> occursin(r"string", x))) |
@rsubset(df, occursin(r"string", :colnm)) |
GROUP BY |
gd = groupby(df, :colnm) combine(gd, :colnm=>sum) |
gd = groupby(df, :colnm) @combine(gd, :newcol=sum(:colnm)) or @by(df, :colnm, :newcol=sum(:colnm)) |
SUM(colnm) OVER (PARTITION BY other_colnm) AS newcol |
gd = groupby(df, :colnm) transform(gd, :colnm=>sum) |
gd = groupby(df, :colnm) @transform(gd, :newcol=sum(:colnm)) |
ORDER BY colnm ASC |
sort(df, :colnm) |
@orderby(df, :colnm) |
ORDER BY colnm DESC |
sort(df, :colnm, rev=true) |
@orderby(df, sort(:colnm, rev=true) |
Joins
SQL Clause | DataFrames Equivalent |
---|---|
INNER JOIN |
innerjoin(df, df2, on=:colnm) innerjoin(df, df2, on=[:left=>:right]) |
LEFT JOIN |
leftjoin(df, df2, on=:colnm) leftjoin(df, df2, on=[:left=>:right]) |
RIGHT JOIN |
rightjoin(df, df2, on=:colnm) rightjoin(df, df2, on=[:left=>:right]) |
OUTER JOIN |
outerjoin(df, df2, on=:colnm) outerjoin(df, df2, on=[:left=>:right]) |
SELECT * FROM table1, table2 aka cartesian product or crossjoin |
crossjoin |
A Worked Example
The Dataset
We’ll manually create a dataset of employees df
:
using CSV, DataFramesMeta, Statistics, Dates
# DataFrame(column=data)
= DataFrame(id=1:8,
df =["Michael", "Dwight", "Angela", "Jim", "Pam", "Oscar", "Meredith", "Creed"],
first_name=["Scott", "Schrute", "Martin", "Halpert", "Beesly", "Nunez", "Palmer", "Bratton"],
last_name=["Management & Admin", "Sales", "Accounting", "Sales", "Management & Admin", "Accounting",
department"Purchasing", "Purchasing"],
=[5100, 4200, 3750, 4300, 2200, 3400, 3300, 3200])
salary
8×5 DataFrame
Row │ id first_name last_name department salaryInt64 String String String Int64
│
─────┼──────────────────────────────────────────────────────────1 │ 1 Michael Scott Management & Admin 5100
2 │ 2 Dwight Schrute Sales 4200
3 │ 3 Angela Martin Accounting 3750
4 │ 4 Jim Halpert Sales 4300
5 │ 5 Pam Beesly Management & Admin 2200
6 │ 6 Oscar Nunez Accounting 3400
7 │ 7 Meredith Palmer Purchasing 3300
8 │ 8 Creed Bratton Purchasing 3200
Let’s create a sales database called db_sales
with client information (thanks to this site)
# Parse dates as Date objects
= ["1-2-2006", "1-29-2006", "2-1-2006", "2-14-2006", "3-1-2006", "3-20-2006"]
dates = parse.(Date, dates, dateformat"m-d-y")
dates
= DataFrame(id=1:6,
db_sales =dates,
transaction_date=[4, 2, 4, 2, 4, 2],
employee_id=[100, 500, 600, 200, 400, 250],
quantity=["Dunmore High School", "Harper Collins", "Blue Cross of Pennsylvania",
customer"Apex Technology", "Blue Cross of Pennsylvania",
"Stone, Cooper, and Grandy: Attorneys at Law"])
6×5 DataFrame
Row │ id transaction_date employee_id quantity customerInt64 Date Int64 Int64 String
│
─────┼───────────────────────────────────────────────────────────────────────────────────1 │ 1 2006-01-02 4 100 Dunmore High School
2 │ 2 2006-01-29 2 500 Harper Collins
3 │ 3 2006-02-01 4 600 Blue Cross of Pennsylvania
4 │ 4 2006-02-14 2 200 Apex Technology
5 │ 5 2006-03-01 4 400 Blue Cross of Pennsylvania
6 │ 6 2006-03-20 2 250 Stone, Cooper, and Grandy: Attor…
Subsetting Rows
Subsetting rows is possible in base DataFrames
, but the syntax in DataFramesMeta
is easier for beginners to follow. The special @rsubset
macro saves on having to write anonymous functions so it’s one less syntactical thing to keep typing every time.
#DataFrames
subset(df, :department => ByRow(x -> occursin("Admin", x)))
#DataFramesMeta
@rsubset(df, occursin("Admin", :department))
2×5 DataFrame
Row │ id first_name last_name department salaryInt64 String String String Int64
│
─────┼──────────────────────────────────────────────────────────1 │ 1 Michael Scott Management & Admin 5100
2 │ 5 Pam Beesly Management & Admin 2200
Matching Text
What if you want to do some string matching with wildcards, i.e. SQL WHERE
clause with the LIKE
or %
operator?
We can use the occursin()
function and pass it as an argument to @rsubset
, like:
#DataFrames
subset(df, :department => ByRow(x -> occursin("Admin", x)))
#DataFramesMeta
@rsubset(df, occursin("Admin", :department))
2×5 DataFrame
Row │ id first_name last_name department salaryInt64 String String String Int64
│
─────┼──────────────────────────────────────────────────────────1 │ 1 Michael Scott Management & Admin 5100
2 │ 5 Pam Beesly Management & Admin 2200
Adding the r
in front of the string lets you use regex to use wildcards and more complex string matching criteria.
Aggregation
i.e. GROUP BY Column
and SUM(column)
For regular GROUP BY
you first use groupby()
and then either combine
or @combine
, or you can use the @by
function as shorthand.
Grouping with @by
@by(df, :department,
:"Average Salary" = mean(:salary),
:count=length(:salary))
4×2 DataFrame
Row │ department Average SalaryString Float64
│
─────┼────────────────────────────────────1 │ Management & Admin 3650.0
2 │ Sales 4250.0
3 │ Accounting 3575.0
4 │ Purchasing 3250.0
Grouping with combine
and @combine
= groupby(df, :department)
gd #DataFrames
combine(gd, :salary => mean => :"Average Salary",
:department => length => :count)
#DataFramesMeta
@combine(gd, :"Average Salary" = mean(:salary),
:count = length(:department))
4×2 DataFrame
Row │ department Average SalaryString Float64
│
─────┼────────────────────────────────────1 │ Management & Admin 3650.0
2 │ Sales 4250.0
3 │ Accounting 3575.0
4 │ Purchasing 3250.0
Window Functions with @transform
In SQL:
AVG(colnm) OVER (PARTITION BY other_colnm) AS newcol
i.e. the PARTITION BY
clause, similar to groupby but it returns a value for each row in your table after doing the aggregations in each partition.
In Julia, you can use the @transform
macro to do this after grouping.
# Example: add a column that
= groupby(df, :department)
gd #DataFrames
transform(gd, :salary => mean => :"avg_dept_salary")
#DataFramesMeta
@transform(gd, :"avg_dept_salary"=mean(:salary))
8×6 DataFrame
Row │ id first_name last_name department salary avg_dept_salaryInt64 String String String Int64 Float64
│
─────┼──────────────────────────────────────────────────────────────────────────1 │ 1 Michael Scott Management & Admin 5100 3650.0
2 │ 2 Dwight Schrute Sales 4200 4250.0
3 │ 3 Angela Martin Accounting 3750 3575.0
4 │ 4 Jim Halpert Sales 4300 4250.0
5 │ 5 Pam Beesly Management & Admin 2200 3650.0
6 │ 6 Oscar Nunez Accounting 3400 3575.0
7 │ 7 Meredith Palmer Purchasing 3300 3250.0
8 │ 8 Creed Bratton Purchasing 3200 3250.0
And we see indeed that we have the same number of rows as the initial dataset.
Putting together a query with @chain
= @chain db_sales begin
sales groupby(:employee_id)
@combine(:total_quantity=sum(:quantity),
:number_of_customers=length(:customer))
end
= @chain df begin
result @select(:id, :first_name, :last_name, :department)
@rsubset(:department=="Sales")
leftjoin(sales, on=[:id=>:employee_id])
@orderby(sort(:total_quantity, rev=true)) # descending
end
2×6 DataFrame
Row │ id first_name last_name department total_quantity number_of_customersInt64 String String String Int64? Int64?
│
─────┼───────────────────────────────────────────────────────────────────────────────1 │ 4 Jim Halpert Sales 1100 3
2 │ 2 Dwight Schrute Sales 950 3
Alternative: Use DuckDB
Another amazing alternative is to use DuckDB
to query dataframes, CSV files, parquet files, etc directly with SQL.
Assuming we had the dataframes created above, we’d query it with SQL with a few lines:
using DuckDB
# create a new in-memory dabase
= DBInterface.connect(DuckDB.DB)
con
# register it as a view in the database
register_data_frame(con, df, "my_df")
DuckDB.
= """
queryStr SELECT
first_name
, last_name
, department
, salary
, AVG(salary) OVER (PARTITION BY department) AS avg_dept_salary
FROM my_df
"""
# run a SQL query over the DataFrame
= DBInterface.execute(con, queryStr)
results print(results)
# 8×5 DataFrame
# Row │ first_name last_name department salary avg_dept_salary
# │ String? String? String? Int64? Float64?
# ─────┼────────────────────────────────────────────────────────────────────
# 1 │ Angela Martin Accounting 3750 3575.0
# 2 │ Oscar Nunez Accounting 3400 3575.0
# 3 │ Michael Scott Management & Admin 5100 3650.0
# 4 │ Pam Beesly Management & Admin 2200 3650.0
# 5 │ Meredith Palmer Purchasing 3300 3250.0
# 6 │ Creed Bratton Purchasing 3200 3250.0
# 7 │ Dwight Schrute Sales 4200 4250.0
# 8 │ Jim Halpert Sales 4300 4250.0