SQL Group By

There's a SQL query idiom I see a lot, which I'd like to discuss. Using the classic Northwind database example, let's look at the two tables Shippers and Orders:

CREATE TABLE Shippers(
  ShipperID int IDENTITY(1,1) NOT NULL,
  CompanyName nvarchar(40) NOT NULL,
  Phone nvarchar(24) NULL,
  PRIMARY KEY (ShipperID)
)

CREATE TABLE Orders(
  OrderID int IDENTITY(1,1) NOT NULL,
  CustomerID nchar(5) NULL,
  EmployeeID int NULL,
  OrderDate datetime NULL,
  RequiredDate datetime NULL,
  ShippedDate datetime NULL,
  ShipVia int NULL,
  Freight money NULL,
  ShipName nvarchar(40) NULL,
  ShipAddress nvarchar(60) NULL,
  ShipCity nvarchar(15) NULL,
  ShipRegion nvarchar(15) NULL,
  ShipPostalCode nvarchar(10) NULL,
  ShipCountry nvarchar(15) NULL,
  PRIMARY KEY (OrderID ),
  FOREIGN KEY (ShipVia) REFERENCES Shippers(ShipperID)
)

If we want a listing of the number of orders for each shipper, w3schools.com tells us we can use:

SELECT Shippers.CompanyName, COUNT(Orders.OrderID) AS NumberOfOrders
  FROM Orders
  LEFT JOIN Shippers ON Orders.ShipVia = Shippers.ShipperID
  GROUP BY CompanyName;

But the GROUP BY is done by name. If two shippers exist with the same name, they'll be grouped in the same line, rather than listed separately. The query could also be done with a RIGHT JOIN to include suppliers with no orders.

The query should group by ShipperID, which is the primary key:

SELECT Shippers.CompanyName, COUNT(Orders.OrderID) AS NumberOfOrders
  FROM Orders
  LEFT JOIN Shippers ON Orders.ShipVia = Shippers.ShipperID
  GROUP BY ShipperID;

Alas, this query isn't valid. SQL Server will complain with:

"Column 'Shippers.CompanyName' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause."

As the error states, CompanyName cannot be used like that. We can always replace CompanyName by ShipperID in the SELECT clause, like we did in the GROUP BY clause.

SELECT Shippers.ShipperID, COUNT(Orders.OrderID) AS NumberOfOrders
  FROM Orders
  LEFT JOIN Shippers ON Orders.ShipVia = Shippers.ShipperID
  GROUP BY ShipperID;

But the fussy manager we're turning the query results in to doesn't care about each shipper's database identifier. No, we actually need to include the name.

One of the most frequently suggested solutions is the use of an aggregation functions, even though we know it is not needed, as there is only one CompanyName for each ShipperID. This leaves us with:

SELECT MAX(Shippers.CompanyName), COUNT(Orders.OrderID) AS NumberOfOrders
  FROM Orders
  LEFT JOIN Shippers ON Orders.ShipVia = Shippers.ShipperID
  GROUP BY ShipperID;

Here, we used MAX(), but we could also have used MIN(), or even AVG() if the field were numerical. Ultimately, we might even have used a random aggregation function. This ambiguity is what turns me away from this solution.

There is a solution involving functional dependencies. If the database supports these, it will realize that CompanyName depends on ShipperID and thus the second query that resulted in an error would now work. PostgreSQL supports this from version 9.1 onward. But I'm working with SQL Server so this is no good for me.

MySQL is famous for not reporting errors and returning random values, when using a column in the SELECT clause that is not in the GROUP BY clause, but since version 5.7.5, it respects functional dependencies.

Another frequently suggested solution is to add the CompanyName to the GROUP BY:

SELECT Shippers.CompanyName, COUNT(Orders.OrderID) AS NumberOfOrders
  FROM Orders
  LEFT JOIN Shippers ON Orders.ShipVia = Shippers.ShipperID
  GROUP BY ShipperID, Shippers.CompanyName;

This works, but in a more complex query dealing with more tables it becomes hard to understand which GROUP BY columns are actually needed for grouping and which are there just to appease the database gods.

By the same token, if we want to extract an additional field, like say the Phone, we have to add it to the SELECT clause and the GROUP BY clause:

SELECT Shippers.CompanyName, Shippers.Phone, COUNT(Orders.OrderID) AS NumberOfOrders
  FROM Orders
  LEFT JOIN Shippers ON Orders.ShipVia = Shippers.ShipperID
  GROUP BY ShipperID, Shippers.CompanyName, Shippers.Phone;

This seems like the wrong approach. I've recently had to add a couple of columns to several similarly structured queries and it felt bad to contaminate the GROUP BY with additional unnecessary columns.

If the problem is mixing the CompanyName column in the SELECT and the GROUP BY, it can be solved by separating the GROUP BY into a sub-query with the desired aggregations and do whatever JOINs are necessary to obtain the desired details.

SELECT Shippers.CompanyName, NumberOfOrders
  FROM (
      SELECT ShipVia, COUNT(Orders.OrderID) AS NumberOfOrders
      FROM Orders
      GROUP BY ShipVia
  ) AS OrdersSummary
  LEFT JOIN Shippers ON OrdersSummary.ShipVia = Shippers.ShipperID

This way, if we want to add the Phone, we just need to do it in the SELECT clause:

SELECT Shippers.CompanyName, Shippers.Phone, NumberOfOrders
  FROM (
      SELECT ShipVia, COUNT(Orders.OrderID) AS NumberOfOrders
      FROM Orders
      GROUP BY ShipVia
  ) AS OrdersSummary
  LEFT JOIN Shippers ON OrdersSummary.ShipVia = Shippers.ShipperID

My goal was to discuss several possible approaches to GROUP BY queries and highlight one I don't see recommended often enough on search engines or sites like Stackoverflow, among others.

An empty and long hall at Premium Minds' office. There's a small lounge, with couches and a small table with a plant on it, and multiple doors and windows to the various development teams' working areas.

You're welcome to drop by anytime!

Av. 5 de Outubro 125 Galerias 1050-052 Lisboa
Check on Google Maps
Premium Minds on Google Maps
Back to top