# SQL Group By

> There's a SQL query idiom I see a lot, which I'd like to discuss. Using the classic Northwind database example, let's look at the two tables Shippers and Orders

**Date:** 2019-04-16
**Author:** Filipe Roque
**Tags:** Sql, Bug-hunting

There's a SQL query idiom I see a lot, which I'd like to discuss. Using the classic [Northwind](https://docs.microsoft.com/en-us/dotnet/framework/data/adonet/sql/linq/downloading-sample-databases) database example, let's look at the two tables Shippers and Orders:

```sql
CREATE TABLE Shippers(
  ShipperID int IDENTITY(1,1) NOT NULL,
  CompanyName nvarchar(40) NOT NULL,
  Phone nvarchar(24) NULL,
  PRIMARY KEY (ShipperID)
)

CREATE TABLE Orders(
  OrderID int IDENTITY(1,1) NOT NULL,
  CustomerID nchar(5) NULL,
  EmployeeID int NULL,
  OrderDate datetime NULL,
  RequiredDate datetime NULL,
  ShippedDate datetime NULL,
  ShipVia int NULL,
  Freight money NULL,
  ShipName nvarchar(40) NULL,
  ShipAddress nvarchar(60) NULL,
  ShipCity nvarchar(15) NULL,
  ShipRegion nvarchar(15) NULL,
  ShipPostalCode nvarchar(10) NULL,
  ShipCountry nvarchar(15) NULL,
  PRIMARY KEY (OrderID ),
  FOREIGN KEY (ShipVia) REFERENCES Shippers(ShipperID)
)
```

If we want a listing of the number of orders for each shipper, [w3schools.com](https://www.w3schools.com/sql/sql_groupby.asp) tells us we can use:

```sql
SELECT Shippers.CompanyName, COUNT(Orders.OrderID) AS NumberOfOrders
  FROM Orders
  LEFT JOIN Shippers ON Orders.ShipVia = Shippers.ShipperID
  GROUP BY CompanyName;
```

But the `GROUP BY` is done by name. If two shippers exist with the same name, they'll be grouped in the same line, rather than listed separately. The query could also be done with a `RIGHT JOIN` to include suppliers with no orders.

The query should group by `ShipperID`, which is the primary key:

```sql
SELECT Shippers.CompanyName, COUNT(Orders.OrderID) AS NumberOfOrders
  FROM Orders
  LEFT JOIN Shippers ON Orders.ShipVia = Shippers.ShipperID
  GROUP BY ShipperID;
```

Alas, this query isn't valid. SQL Server will complain with:

```bash
"Column 'Shippers.CompanyName' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause."
```

As the error states, `CompanyName` cannot be used like that. We can always replace `CompanyName` by `ShipperID` in the `SELECT` clause, like we did in the `GROUP BY` clause.

```sql
SELECT Shippers.ShipperID, COUNT(Orders.OrderID) AS NumberOfOrders
  FROM Orders
  LEFT JOIN Shippers ON Orders.ShipVia = Shippers.ShipperID
  GROUP BY ShipperID;
```

But the fussy manager we're turning the query results in to doesn't care about each shipper's database identifier. No, we actually need to include the name.

One of the most frequently suggested solutions is the use of an aggregation functions, even though we know it is not needed, as there is only one `CompanyName` for each `ShipperID`. This leaves us with:

```sql
SELECT MAX(Shippers.CompanyName), COUNT(Orders.OrderID) AS NumberOfOrders
  FROM Orders
  LEFT JOIN Shippers ON Orders.ShipVia = Shippers.ShipperID
  GROUP BY ShipperID;
```

Here, we used `MAX()`, but we could also have used `MIN()`, or even `AVG()` if the field were numerical. Ultimately, we might even have used a [random aggregation function](https://wiki.postgresql.org/wiki/Aggregate_Random). This ambiguity is what turns me away from this solution.

There is a solution involving [functional dependencies](https://dzone.com/articles/sql-group-by-and-functional-dependencies-a-very-us). If the database supports these, it will realize that CompanyName depends on `ShipperID` and thus the second query that resulted in an error would now work. PostgreSQL supports this from version [9.1](https://wiki.postgresql.org/wiki/What's_new_in_PostgreSQL_9.1#SQL_and_PL.2FPgSQL_features) onward. But I'm working with SQL Server so this is no good for me.

MySQL is [famous](http://rpbouman.blogspot.com/2007/05/debunking-group-by-myths.html) for not reporting errors and returning random values, when using a column in the `SELECT` clause that is not in the `GROUP BY` clause, but since version `5.7.5`, it respects [functional dependencies](http://rpbouman.blogspot.com/2014/09/mysql-575-group-by-respects-functional.html).

Another frequently suggested solution is to add the `CompanyName` to the `GROUP BY`:

```sql
SELECT Shippers.CompanyName, COUNT(Orders.OrderID) AS NumberOfOrders
  FROM Orders
  LEFT JOIN Shippers ON Orders.ShipVia = Shippers.ShipperID
  GROUP BY ShipperID, Shippers.CompanyName;
```

This works, but in a more complex query dealing with more tables it becomes hard to understand which `GROUP BY` columns are actually needed for grouping and which are there just to appease the database gods.

By the same token, if we want to extract an additional field, like say the `Phone`, we have to add it to the `SELECT` clause and the `GROUP BY` clause:

```sql
SELECT Shippers.CompanyName, Shippers.Phone, COUNT(Orders.OrderID) AS NumberOfOrders
  FROM Orders
  LEFT JOIN Shippers ON Orders.ShipVia = Shippers.ShipperID
  GROUP BY ShipperID, Shippers.CompanyName, Shippers.Phone;
```

This seems like the wrong approach. I've recently had to add a couple of columns to several similarly structured queries and it felt bad to contaminate the `GROUP BY` with additional unnecessary columns.

If the problem is mixing the `CompanyName` column in the `SELECT` and the `GROUP BY`, it can be solved by separating the `GROUP BY` into a sub-query with the desired aggregations and do whatever `JOIN`s are necessary to obtain the desired details.

```sql
SELECT Shippers.CompanyName, NumberOfOrders
  FROM (
      SELECT ShipVia, COUNT(Orders.OrderID) AS NumberOfOrders
      FROM Orders
      GROUP BY ShipVia
  ) AS OrdersSummary
  LEFT JOIN Shippers ON OrdersSummary.ShipVia = Shippers.ShipperID
```

This way, if we want to add the `Phone`, we just need to do it in the `SELECT` clause:

```sql
SELECT Shippers.CompanyName, Shippers.Phone, NumberOfOrders
  FROM (
      SELECT ShipVia, COUNT(Orders.OrderID) AS NumberOfOrders
      FROM Orders
      GROUP BY ShipVia
  ) AS OrdersSummary
  LEFT JOIN Shippers ON OrdersSummary.ShipVia = Shippers.ShipperID
```

My goal was to discuss several possible approaches to `GROUP BY` queries and highlight one I don't see recommended often enough on search engines or sites like [Stackoverflow](https://stackoverflow.com/questions/13999817/reason-for-column-is-invalid-in-the-select-list-because-it-is-not-contained-in-e), among others.
