Starting SQL: Bucketing our data with GROUP BY

Jay R. Montojo
3 min readApr 24, 2022

--

In this article let’s look at how theGROUP BY keyword works. This key word is a bit confusing in terms of what we would expect it to do. Often, beginners confuse it with ORDER BY . Check out my other article on that topic here.

GROUP BY is made more useful when combined with aggregate functions. Look out for that article!

In this multi-part article you will be doing a bit of SQL query writing so please have access to a relational database management system (i.e. PostgreSQL)!

Resources

Create a database called population_db . Copy the following SQL transactions into a .sql file or your terminal logged into your relational database manager and run.

the following data is completely fictional. Any relation to real people or genuine proprietary data is purely coincidence.

CREATE TABLE wards(
id SERIAL PRIMARY KEY,
ward VARCHAR(50)
);
INSERT INTO wards (ward)
VALUES
('Shinjuku'),
('Chiyoda'),
('Setagaya');

CREATE TABLE persons(
id SERIAL PRIMARY KEY,
family_name VARCHAR(255),
given_name VARCHAR(255),
ward_id int REFERENCES wards(id),
dob DATE
);
INSERT INTO persons (family_name, given_name, ward_id, dob)
VALUES
('Tanabe', 'Ryousuke', 1, '1983-05-12'),
('Mori', 'Miyako', 2, '1987-11-17'),
('Numata', 'Taisei', 2, '1963-02-23'),
('Tanaka', 'Zen', 3, '1991-09-01'),
('Horie', 'Mao', 1, '1980-10-13'),
('Miyamoto', 'Yuzuki', 1, '1974-09-25'),
('Toda', 'Tomohiro', 3, '1977-02-11'),
('Iida', 'Akiko', 1, '1992-01-30'),
('Saitou', 'Kenta', 3, '1985-12-29');

What is GROUP BY doing for us?

Take a look at the ward_id column. We can interpret that each person belongs to a ward. If we group people by their wards, we essentially create buckets of data. This is important and you’ll see in a bit. Visualized, it would look like the data you’ll see in the lower chart.

The bottom table looks something like we see when we use ORDER BY ward_id.

If we were to write this as a SQL query it would look like this:

SELECT ward_id
FROM persons
GROUP BY ward_id;

And the result would look something like:

Wait! how is this helpful? Maybe if I change the query and add a few more columns, I can see something closer to the spreadsheet tables above right? Let’s give that a try.

SELECT ward_id, family_name, given_name
FROM persons
GROUP BY ward_id;

We have added the family_name & given_name fields.However, if we were to run this in the hopes of getting something like:

Remember, GROUP BY is only creating buckets of data based on the ward_id so that we can use things like aggregate functions to analyze that grouped data. Unfortunately, if you were to try, don’t be surprised if you are greeted with an error message that looks like this!

column “persons.family_name” must appear in the GROUP BY clause or be used in an aggregate function

What gives? If you need to run a transaction that produces the above, you need to use the ORDER BY keyword instead. See my other article about that here!

We are getting this error because the GROUP BY clause is expecting us to use an aggregate function to draw some kind of result on each bucket with an aggregate function. This is a common problem that people often experience when working with GROUP BY .

Conclusion

GROUP BY and ORDER BY are often confused with each other which lead to the error message you encountered above. When using GROUP BY you can imagine that you are taking many records and grouping them so that you can process information from those grouped records (i.e. population in a particular town, average sales in one day or per sales person, how many books each author wrote). We can process that information using aggregate functions. Be on the lookout for that article!

--

--

Jay R. Montojo
Jay R. Montojo

Written by Jay R. Montojo

Tokyo-based full-stack developer with a background in engineering finance, audit and teaching. Plotting to disrupt the Tokyo tech scene and do good with code.

No responses yet