SQL GROUP BY Clause in a SELECT Query
So far the query results are record-oriented, meaning each row in the result corresponds to one record in the table. The SQL
GROUP BY clause opens up the possibility that queries have a row in the result representing a group of records. Plus, it enables you to use grouping functions: functions that return some result () calculated in the group of records such as the sum, average or quantity.
Here is the example table. Suppose you work on a farm growing apples. The database contains a single table named apple with all the production history. The table has the following schema:
|Year||Apple Variety||Number of Trees in Production||Tons Produced||Harvest day||Price per Ton||First summer storm|
Suppose you want to obtain the total of apples produced every year. This value is not a column table
apple, but there is a
tons_produced column per apple variety. If you group all the records for the same year and sum the value in
tons_produced for all the records in the group, you obtain the total. Here is the
GROUP BY clause you would use:
SELECT year, SUM(tons_produced) AS Total_apple_produced FROM apple GROUP BY year
This query creates groups of records with the same value in the column year. The first group is for year 2020 with 3 records, the next group is 2019 with 3 records, and so on. For each group of records you apply the Aggregate Functions
SUM to all the records in the group to obtain the total tons produced each year. Here are the results of the query:
In this lesson, you combine all the concepts or clauses you have learned into a single query. You use a WHERE clause to filter records and a GROUP BY to group records in the same SELECT statement. If you want to review the WHERE clause, jump back to the lesson Using The SQL WHERE Clause With Comparison Operators .
WHERE clause is applied first to filter the records. The
GROUP BY clause is applied to the filtered records. Here is the query:
SELECT year, AVG(price_per_ton) AS avg_price_ton FROM apples WHERE EXTRACT(MONTH FROM first_summer_storm) = 7 GROUP BY year
The query uses the
EXTRACT function to obtain the month from the
first_summer_storm to filter the records from July. Once the records are filtered, you create the groups of records with the
GROUP BY year, meaning all records from the same year are in the same group. After that, the average of the
price_per_ton is calculated. Here are the results of the query:
The title of the column with the average price was renamed to
avg_price_ton by the AS clause in the
SELECT. This is a common practice when you use aggregate functions like
COUNT. You are creating a calculated value that is not coming from any column. Using the AS clause assigns the name/description in the result, as the database does not know what column name to use.
This is an important point a
SQL developer must understand to avoid a common error when using the
GROUP BY clause. After the database creates the groups of records, all the records are collapsed into groups. You can no longer refer to any individual record column in the query. In the
SELECT list, you can only refer to columns that appear in the
GROUP BY clause. The columns appearing in the group are valid because they have the same value for all the records in the group.
|Elements allowed to be in the SELECT list|
|COLUMNS used in the GROUP BY|
|Aggrupation Functions (SUM, COUNT, AVG, MAX, MIN)|
|Constants (text, numeric or date value)|
Here is an example of an invalid query:
SELECT year, first_summer_storm, -- INVALID COLUMN AVG(price_per_ton) AS avg_price_ton FROM apples WHERE EXTRACT(MONTH FROM first_summer_storm) = 7 GROUP BY year
Below is the error returned by the database:
ERROR: The column «apple.first_summer_storm» must be in GROUP BY
LINE 2: first_summer_storm, – INVALID COLUMN
As a reader exercise, find out why the column is invalid.
So far you used aggregate functions with table columns as a single parameter. Now it is time to use the functions with expressions based on several table columns.
For a more complex query, suppose you want to obtain a metric called dollar per tree. here is the math calculation:
dollar_per_tree = ( tons_produced * price_per_ton ) / number_of_trees
The dollar per tree metric is a value related to each apple variety. For example, one apple variety can produce many tons, but usually has a low market price, while other varieties produce less in terms of tons, but have a higher price. The dollar per tree metric can help you figure out the relative value of the trees.
For this example, you want to obtain the average of dollars per tree for each apple variety, plus the maximum and minimum value for the metric. You need to group by apple variety not by year. Here is the query:
SELECT apple_variety, AVG((tons_produced * price_per_ton)/number_of_trees) AS avg_dollar_per_tree, MAX((tons_produced * price_per_ton)/number_of_trees) AS max_dollar_per_tree, MIN((tons_produced * price_per_ton)/number_of_trees) AS min_dollar_per_tree FROM apple GROUP BY apple_variety
Of course the dollar per tree metric is related to the year, as
tons_produced depends on weather conditions. To see the metric for each year and each apple variety, use a
GROUP BY multi-columns:
SELECT year, apple_variety, AVG((tons_produced * price_per_ton)/number_of_trees) AS avg_dollar_per_tree, MAX((tons_produced * price_per_ton)/number_of_trees) AS max_dollar_per_tree, MIN((tons_produced * price_per_ton)/number_of_trees) AS min_dollar_per_tree FROM apple GROUP BY year, apple_variety
Looking at this result, notice several columns have the same values for
MIN() results. This is due to the groups created by the
GROUP BY having one record, resulting in the same average, maximum and minimum values.
In this lesson you learned to use the SQL
GROUP BY and aggregate functions to increase the power expressivity of the SQL SELECT statement. You know about the collapse issue, and understand you cannot reference individual records once the
GROUP BY clause is used.
If you need to access the individual record, SQL provides
WINDOW FUNCTIONS as you will learn in a later lesson. For example, if you want a report to show how much is the difference between current year
price_per_ton and average of
price_ton, you clearly need to access the individual record to obtain the current year price and the average price, you can use the SQL WINDOW FUNCTIONS.
The next lesson goes into more depth on the aggregate functions. You have already used some in this lesson such as
MAX(). Keep going, learn
SQL and increase your skills!
IN THIS PAGE