SQL GROUP BY Clause in a SELECT Query

So far the query results are record-oriented, meaning each row in the result corresponds to one record in the table. The SQL GROUP BY clause opens up the possibility that queries have a row in the result representing a group of records. Plus, it enables you to use grouping functions: functions that return some result () calculated in the group of records such as the sum, average or quantity.

Here is the example table. Suppose you work on a farm growing apples. The database contains a single table named apple with all the production history. The table has the following schema:

Year Apple Variety Number of Trees in Production Tons Produced Harvest day Price per Ton First summer storm
2020 Red Delicious 2000 102 06/23/2020 54.50 07/03/2020
2020 Magic Green 700 33 06/12/2020 62.60 07/03/2020
2020 Red Globus 500 26 05/30/2020 71.50 07/03/2020
2019 Red Delicious 1800 87 07/15/2019 52.25 07/12/2019
2019 Magic Green 500 26 06/28/2019 59.40 07/12/2019
2019 Red Globus 500 27 05/28/2019 68.00 07/12/2019
2018 Red Delicious 1800 92 07/02/2018 56.75 06/03/2018
2018 Red Globus 500 24 05/30/2018 66. 06/03/2018
2017 Red Delicious 1500 76.5 07/18/2017 51.45 07/30/2017
2016 Red Delicious 1500 72 06/26/2016 47.60 06/23/2016

Suppose you want to obtain the total of apples produced every year. This value is not a column table apple, but there is a tons_produced column per apple variety. If you group all the records for the same year and sum the value in tons_produced for all the records in the group, you obtain the total. Here is the GROUP BY clause you would use:


SELECT year,
      SUM(tons_produced) AS Total_apple_produced
FROM  apple
GROUP BY year

This query creates groups of records with the same value in the column year. The first group is for year 2020 with 3 records, the next group is 2019 with 3 records, and so on. For each group of records you apply the Aggregate Functions SUM to all the records in the group to obtain the total tons produced each year. Here are the results of the query:

Year Total_apple_produced
2020 161
2019 140
2018 116
2017 76.5
2016 72

Using SQL WHERE and SQL GROUP BY in a SELECT

In this lesson, you combine all the concepts or clauses you have learned into a single query. You use a WHERE clause to filter records and a GROUP BY to group records in the same SELECT statement. If you want to review the WHERE clause, jump back to the lesson Using The SQL WHERE Clause With Comparison Operators .

The WHERE clause is applied first to filter the records. The GROUP BY clause is applied to the filtered records. Here is the query:


SELECT year, 
       AVG(price_per_ton) AS avg_price_ton
FROM   apples
WHERE  EXTRACT(MONTH FROM first_summer_storm) = 7
GROUP  BY year 

The query uses the EXTRACT function to obtain the month from the first_summer_storm to filter the records from July. Once the records are filtered, you create the groups of records with the GROUP BY year, meaning all records from the same year are in the same group. After that, the average of the price_per_ton is calculated. Here are the results of the query:

Year avg_price_ton
2020 62.86
2019 59.88
2017 51.45

The title of the column with the average price was renamed to avg_price_ton by the AS clause in the SELECT. This is a common practice when you use aggregate functions like AVG, SUM or COUNT. You are creating a calculated value that is not coming from any column. Using the AS clause assigns the name/description in the result, as the database does not know what column name to use.

SQL GROUP BY Collapse of Records Effect

This is an important point a SQL developer must understand to avoid a common error when using the GROUP BY clause. After the database creates the groups of records, all the records are collapsed into groups. You can no longer refer to any individual record column in the query. In the SELECT list, you can only refer to columns that appear in the GROUP BY clause. The columns appearing in the group are valid because they have the same value for all the records in the group.

Elements allowed to be in the SELECT list
COLUMNS used in the GROUP BY
Aggrupation Functions (SUM, COUNT, AVG, MAX, MIN)
Constants (text, numeric or date value)

Here is an example of an invalid query:


SELECT year, 
       first_summer_storm, -- INVALID COLUMN
       AVG(price_per_ton) AS avg_price_ton
FROM   apples
WHERE  EXTRACT(MONTH FROM first_summer_storm) = 7
GROUP  BY year

Below is the error returned by the database:

ERROR: The column «apple.first_summer_storm» must be in GROUP BY
LINE 2: first_summer_storm, – INVALID COLUMN

As a reader exercise, find out why the column is invalid.

SQL Aggregate Functions With Expressions

So far you used aggregate functions with table columns as a single parameter. Now it is time to use the functions with expressions based on several table columns.
For a more complex query, suppose you want to obtain a metric called dollar per tree. here is the math calculation:

dollar_per_tree = ( tons_produced * price_per_ton ) / number_of_trees

The dollar per tree metric is a value related to each apple variety. For example, one apple variety can produce many tons, but usually has a low market price, while other varieties produce less in terms of tons, but have a higher price. The dollar per tree metric can help you figure out the relative value of the trees.
For this example, you want to obtain the average of dollars per tree for each apple variety, plus the maximum and minimum value for the metric. You need to group by apple variety not by year. Here is the query:


SELECT apple_variety, 
       AVG((tons_produced * price_per_ton)/number_of_trees) AS avg_dollar_per_tree,
       MAX((tons_produced * price_per_ton)/number_of_trees) AS max_dollar_per_tree,
       MIN((tons_produced * price_per_ton)/number_of_trees) AS min_dollar_per_tree
FROM   apple
GROUP  BY apple_variety

The result:

apple_variety avg_dollar_per_tree max_dollar_per_tree min_dollar_per_tree
Red Delicious 2.62 2.90 2.28
Magic Green 3.01 3.08 2.95
Red Globus 3.51 3.71 3.16

Of course the dollar per tree metric is related to the year, as tons_produced depends on weather conditions. To see the metric for each year and each apple variety, use a GROUP BY multi-columns:


SELECT year,
       apple_variety, 
       AVG((tons_produced * price_per_ton)/number_of_trees) AS avg_dollar_per_tree,
       MAX((tons_produced * price_per_ton)/number_of_trees) AS max_dollar_per_tree,
       MIN((tons_produced * price_per_ton)/number_of_trees) AS min_dollar_per_tree
FROM   apple
GROUP  BY year, apple_variety

The result:

year apple_variety avg_dollar_per_tree max_dollar_per_tree min_dollar_per_tree
2020 Red Delicious 2.77 2.77 2.77
2020 Magic Green 2.95 2.95 2.95
2020 Red Globus 3.71 3.71 3.71
2019 Red Delicious 2.52 2.52 2.52
2019 Magic Green 3.08 3.08 3.08
2019 Red Globus 3.67 3.67 3.67
2018 Red Delicious 2.90 2.90 2.90
2018 Red Globus 3.16 3.16 3.16
2017 Red Delicious 2.62 2.62 2.62
2016 Red Delicious 2.28 2.28 2.28

Looking at this result, notice several columns have the same values for AVG(), MAX() and MIN() results. This is due to the groups created by the GROUP BY having one record, resulting in the same average, maximum and minimum values.

Closing Words

In this lesson you learned to use the SQL GROUP BY and aggregate functions to increase the power expressivity of the SQL SELECT statement. You know about the collapse issue, and understand you cannot reference individual records once the GROUP BY clause is used.

If you need to access the individual record, SQL provides WINDOW FUNCTIONS as you will learn in a later lesson. For example, if you want a report to show how much is the difference between current year price_per_ton and average of price_ton, you clearly need to access the individual record to obtain the current year price and the average price, you can use the SQL WINDOW FUNCTIONS.

The next lesson goes into more depth on the aggregate functions. You have already used some in this lesson such as AVG(), SUM(), MIN() and MAX(). Keep going, learn SQL and increase your skills!