Starting SQL: Bucketing our data with GROUP BY
In this article let’s look at how theGROUP BY
keyword works. This key word is a bit confusing in terms of what we would expect it to do. Often, beginners confuse it with ORDER BY
. Check out my other article on that topic here.
GROUP BY
is made more useful when combined with aggregate functions. Look out for that article!
In this multi-part article you will be doing a bit of SQL query writing so please have access to a relational database management system (i.e. PostgreSQL)!
Resources
Create a database called population_db
. Copy the following SQL transactions into a .sql
file or your terminal logged into your relational database manager and run.
the following data is completely fictional. Any relation to real people or genuine proprietary data is purely coincidence.
CREATE TABLE wards(
id SERIAL PRIMARY KEY,
ward VARCHAR(50)
);
INSERT INTO wards (ward)
VALUES
('Shinjuku'),
('Chiyoda'),
('Setagaya');
CREATE TABLE persons(
id SERIAL PRIMARY KEY,
family_name VARCHAR(255),
given_name VARCHAR(255),
ward_id int REFERENCES wards(id),
dob DATE
);
INSERT INTO persons (family_name, given_name, ward_id, dob)
VALUES
('Tanabe', 'Ryousuke', 1, '1983-05-12'),
('Mori', 'Miyako', 2, '1987-11-17'),
('Numata', 'Taisei', 2, '1963-02-23'),
('Tanaka', 'Zen', 3, '1991-09-01'),
('Horie', 'Mao', 1, '1980-10-13'),
('Miyamoto', 'Yuzuki', 1, '1974-09-25'),
('Toda', 'Tomohiro', 3, '1977-02-11'),
('Iida', 'Akiko', 1, '1992-01-30'),
('Saitou', 'Kenta', 3, '1985-12-29');
What is GROUP BY doing for us?
Take a look at the ward_id
column. We can interpret that each person belongs to a ward. If we group people by their wards, we essentially create buckets of data. This is important and you’ll see in a bit. Visualized, it would look like the data you’ll see in the lower chart.
If we were to write this as a SQL query it would look like this:
SELECT ward_id
FROM persons
GROUP BY ward_id;
And the result would look something like:
Wait! how is this helpful? Maybe if I change the query and add a few more columns, I can see something closer to the spreadsheet tables above right? Let’s give that a try.
SELECT ward_id, family_name, given_name
FROM persons
GROUP BY ward_id;
We have added the family_name
& given_name
fields.However, if we were to run this in the hopes of getting something like:
Remember, GROUP BY
is only creating buckets of data based on the ward_id
so that we can use things like aggregate functions to analyze that grouped data. Unfortunately, if you were to try, don’t be surprised if you are greeted with an error message that looks like this!
column “persons.family_name” must appear in the GROUP BY clause or be used in an aggregate function
What gives? If you need to run a transaction that produces the above, you need to use the ORDER BY
keyword instead. See my other article about that here!
We are getting this error because the GROUP BY
clause is expecting us to use an aggregate function to draw some kind of result on each bucket with an aggregate function. This is a common problem that people often experience when working with GROUP BY
.
Conclusion
GROUP BY
and ORDER BY
are often confused with each other which lead to the error message you encountered above. When using GROUP BY
you can imagine that you are taking many records and grouping them so that you can process information from those grouped records (i.e. population in a particular town, average sales in one day or per sales person, how many books each author wrote). We can process that information using aggregate functions. Be on the lookout for that article!