Counting Groups with Window Functions: #SQLNewBlogger

I looked at row_number() in a previous post. Now I want to build on that and do some counting of rows with COUNT() and the OVER clause. I’ll show how this differs a bit from a normal aggregate.

Another post for me that is simple and hopefully serves as an example for people trying to get blogging as #SQLNewBloggers. This is also part of a series on Window Functions.

The Raw Data

Let’s use AdventureWorks as a sample database. In the Sales.SalesOrderHeader table there are lots of customer orders, each with a Customer ID. I’ll limit my results to these customers, which has a spread of orders:

WHERE soh.CustomerID IN (11011, 11015, 11019, 11012)

We’ll use this WHERE clause to help us decide

Counting Orders

If I wanted to count the number of orders each customer hasplaced, I can use this code:

SELECT
   soh.CustomerID
, COUNT (*) AS OrderCount
FROM Sales.SalesOrderHeader AS soh
WHERE
   soh.CustomerID IN ( 11011, 11015, 11019, 11012 )
GROUP BY soh.CustomerID;

I see these results:

2024-10_0175

We can see that each of these customers has different numbers of orders.  If I want a simple summary, this is the best code. However, what if I want to add other columns?

Supposed I want to add not only the customer ID, but also the shipdate. I’m wondering when orders were shipped. If I use the code above, I’d need to add this in the column list, and the group by. If I do this, I get this code:

SELECT
   soh.CustomerID
, soh.ShipDate
, COUNT (*) AS OrderCount
FROM Sales.SalesOrderHeader AS soh
WHERE
   soh.CustomerID IN ( 11011, 11015, 11019, 11012 )
GROUP BY
   soh.CustomerID
, soh.ShipDate;

And this result:

2024-10_0176

The results are cut off, but they repeat with a 1 for each row. The front end can sum these, but it’s easy to make a mistake here. Especially if there are filters.

With an OVER() clause, my aggregate changes a bit. I would use this code, where I am paritioning, or essentially grouping, the data by the customer ID. I don’t have a group by, so I don’t need to add other fields in two places. Here is the code:

SELECT
   soh.CustomerID
, soh.ShipDate
, COUNT (*) OVER (PARTITION BY soh.CustomerID) AS OrderCount
FROM Sales.SalesOrderHeader AS soh
WHERE
   soh.CustomerID IN ( 11011, 11015, 11019, 11012 );

And the results.

2024-10_0177

I have the count repeated, but if I am showing this in the front end, I can hide that column in a table, but I can also access the total count from any row.

Differences with Group By

Let’s say I decide to add something else, like the account number and the SalesOrderNumber. If I do this in a GROUP BY, I need to add this to the column list and the group by, as shown here:

SELECT
   soh.CustomerID
, soh.AccountNumber
, soh.SalesOrderNumber
, soh.ShipDate
, COUNT (*) AS OrderCount
FROM Sales.SalesOrderHeader AS soh
WHERE
   soh.CustomerID IN ( 11011, 11015, 11019, 11012 )
GROUP BY
   soh.CustomerID
, soh.ShipDate
, soh.AccountNumber
, soh.SalesOrderNumber;
GO

The change the Window function is as shown:

SELECT
   soh.CustomerID
, soh.AccountNumber
, soh.SalesOrderNumber
, soh.ShipDate
, COUNT (*) OVER (PARTITION BY soh.CustomerID) AS OrderCount
FROM Sales.SalesOrderHeader AS soh
WHERE
   soh.CustomerID IN ( 11011, 11015, 11019, 11012 );

Maintenance is easy, and what’s more, I don’t have to worry about ordering my group by in different ways. If I do care about grouping, then I can alter the partitioning, but often I just want to add other columns without making the code more complex in a second place.

The results for the group by have the fields, but all 1s in the count again.

2024-10_0179

The window function has the counts for each customer.

2024-10_0178

Small things, but those points of maintenance can be annoying and they can cause problems. For complex data, the other fields might not group well, or the order of grouping might change what is shown.

There are more things to be concerned about here, but one of the big things might be a running total, where I count the orders over time. I’d like results like this:

2024-10_0180

My window function code is:

SELECT
   soh.CustomerID
, soh.AccountNumber
, soh.SalesOrderNumber
, soh.ShipDate
, COUNT (*) OVER (PARTITION BY soh.CustomerID ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS OrderCount
FROM Sales.SalesOrderHeader AS soh
WHERE
   soh.CustomerID IN ( 11011, 11015, 11019, 11012 );

The regular code with group by is this:

select soh1.customerid, soh1.AccountNumber, soh1.SalesOrderNumber, soh1.ShipDate
, count(*) 'running_total'
from sales.salesorderheader soh1
inner join sales.salesorderheader soh2 on soh2.salesorderid <= soh1.salesorderid
                                    and soh2.customerid = soh1.customerid
group by soh1.customerid, soh1.AccountNumber, soh1.SalesOrderNumber, soh1.ShipDate
order by soh1.customerid, soh1.ShipDate

Not bad, but the big difference is in resources used. The statistics IO below shows the difference. Can you guess which is which?

2024-10_0183

Hint, the one using less reads is the OVER() code.

SQL New Blogger

Setting up a good scenario is a little tricky, and that took time. I messed with a few data sets that helped explain this to me, and you, in a way that made sense. I have a few other scenarios, which I will write about because an OVER() isn’t magic and it might not be the right choice.

Those will be other blogs.

This is a good way to showcase that you understand part of how this works. There’s a whole series to be written on different aspects of how the OVER() clause works. Spending 15-30 minutes each time you experiment helps showcase your knowledge and learning to employers.

That’s what employers need: experimenting and learning.

Posted in Blog | Tagged , , , | 1 Comment

The Cloud Security Problem

Your management gets a great demo from a cloud vendor and decides that the organization needs to implement the new service/application/etc. quickly. Your team tries to comply, furiously learning and experimenting with integrations, software changes, infrastructure configuration, and more. Things get deployed are working. Clients and management are happy with the new capabilities and you breathe a sigh of relief.

After a bit of time there’s a security issue and all of a sudden there’s blame pouring down on everyone. The vendor takes a hit because it’s a public security problem, but the reality might be that your organization didn’t completely understand how to configure strong security. The public doesn’t blame your organization, but internally your team don’t know how to make changes to ensure future security.

That’s a bit of what happened with the Snowflake customer hacks, and a good description of some of the issues is in this piece from Joey D’Antoni. Snowflake didn’t necessarily have bad security, but they allowed customers to have bad security and also limited the options for customers to implement stronger security.

I think about this all the time as I look at the challenges of security that many organizations face. In my decades of working in different situations, the one thing I know is that pressure to move fast often creates security shortcuts. Much of the early security problems with SQL Server could easily be traced to a) allowing installs with no password, b) developers not understanding the security model and granting sysadmin (or using sa) in their applications, and c) management agreeing that this was OK because a deadline needed to be met.

In all these situations, workers had the best of intentions to go fix things later, but there was rarely the time or energy to do so. As someone who forced apps to change away from sa in a large org, and required separate accounts and passwords for servers, I can tell you no one liked me and I got a lot of pressure to leave things alone. That is until I could explain the risks to managers with security people present. Even then there was no shortage of people who wanted me to assume the risk of apps using sa and let things be.

Cloud security is pretty good from the major vendors. Most of the smaller co-location facilities I’ve worked with in the past also had good security. It was the clients that caused problems, often because of a lack of knowledge or the desire to hurry.

Joey says Entra is a great system. I agree, and I love the SSO capabilities I’ve seen implemented. I also know that trying to set things up and configure them has been difficult for me, and I think I can be pretty sharp about lots of technology. Joey has patiently answered many of my ignorant questions because I didn’t understand how some part of Entra (AAD) works. I think lots of tech people don’t understand this and don’ t necessarily have a Joey to call on.

Most of us working in technology need more security education, better habits, and the patience to implement strong security. That includes management. At the same time, I wish that the solutions out there were easier to understand, or maybe better explained for those of us who need to use some portion of the authorization and authentication systems in our software.

Unfortunately, security software is still software and those vendors push out changes and updates quickly as well, leaving education and documentation to the user. This is a bad situation that continually arises with regular issues in many organizations.

Steve Jones

Listen to the podcast at Libsyn, Spotify, or iTunes.

Note, podcasts are only available for a limited time online.

Posted in Editorial | Tagged | 2 Comments

Monday Monitor Tips: Projecting Disk Space

One of the things that many DBAs struggle with is managing space across an estate. There might be one or two servers that you watch closely, or that are a constant problem, but it’s easy to run out of space on other systems.

Redgate Monitor will alert you to low space, but do you want to keep asking SAN admins for more space every week as new systems run out of space? Or do you want to project a number for the year and get that budgeted? Learn how Redgate Monitor can help

This is part of a series of posts on Redgate Monitor. Click to see the other posts

An Estate View of Storage

The estate view of disk usage shows all your disks totaled up and the projections for growth. This looks like the image below on monitor.red-gate.com. This doesn’t seem useful, but I’ve had storage admins come to me in the past asking about budgeting for the next year. How much space will I use?

2024-09_0139

The answer is often I don’t know. I’d have to take a very wild guess for them. Now, I’d use this, and maybe pad it slightly, but across lots of systems, we can probably guess on the growth that will be needed. Any one system might not be predictable, but across lots, averages work out.

Here I can see we’re using 25TB (out of 87TB). Definitely be careful showing this to storage admins, as they often want those to be closer, ubt I can see I’m projected to use 91TB in a year, so I’d need more space. The graph above shows me running out of space in Sept/Oct of 2025.

Below this I see details from machines and their disks. These show in the same groups I see on other screens. While I can’t easily filter here, I can search from a browser, or export this information. That’s useful for checking on individual systems.

2024-09_0149

The other nice thing is I can sort the columns, and if I do this by time until full, I can see the disks that might give me problems soon.

2024-09_0154

None of the test systems are likely to be problematic soon, but certainly someone needs to budget for next year.

Summary

This is a simple screen, but one that I wish I’d have had in the past. It gives me a lot of information that I rarely need, but information that always requires a lot of effort to compile. What disks are being used by my database servers and what percentage of space is in use.

In a DBA team, this is the type of data I’d want to glance at every quarter and then make adjustments to prevents issues. Proactive DBAs want a screen just like this.

Redgate Monitor is a world class monitoring solution for your database estate. Download a trial today and see how it can help you manage your estate more efficiently.

Posted in Blog | Tagged , , , | Comments Off on Monday Monitor Tips: Projecting Disk Space

Databases for Executives

There’s an article at Forbes about the Five Things Business Leaders Should Know about Databases. Disclosure, it’s by my boss, but I think it’s still a good read. These are points we’ve learned from research and work with customers and prospects at Redgate Software. These points come from you, as well as from executives with whom we work, but there are so many people in organizations who don’t think about the complexity of data, so it’s a good one to pass along.

The five things are (if you don’t want to read): data is growing, getting more complex, there are multiple database platforms in most estates, teams struggle (duh), and data is a business issue. Most of us know about the fourth one, often because we may feel overloaded with work. We might also feel a lot of stress in trying to keep up with not only the workload but also trying to learn more to support the ever-growing variety of systems it seems our employer wants to put into production. I regularly talk with customers whose developers keep wanting to try out a new, shiny database platform in the cloud (or add new features from their existing platforms).

Wait, not try out. They’ve already deployed some production data there and now want other developers or operations people to work with their system.  Often lots of the staff isn’t familiar with the platform or feature, even the people who decided to implement it. Is anyone familiar with that situation?

The digital transformation and importance of software isn’t lost on most executives, but I find far too many that don’t place the same importance on the data that powers their software or the database platforms that support all software. Data is important, and ensuring it is available to software, protected, secured, and available is critical. That also means that the process of developing and managing the database portion of your software matters.

I think Database DevOps is important, but it’s not a panacea to buy a product. I’d love to show you what Redgate Software can do here, but the main thing I stress to clients is that you need to get your database development into the modern era and match what software developers do. That means a smooth process, with version control, that ensures you deploy changes in a way that doesn’t impact customers.

This doesn’t mean just go faster. This means bringing along data modeling, good architecture, and performance testing. Those are the same things we’ve been doing (or should have been doing) for decades, but we need to ensure these are still a part of whatever process we choose and included in any automation we implement.

Pass the article along to your management, and be sure they understand that all these points are important. Particularly the fourth point because if your staff isn’t supported and trained, the rest of the business will suffer.

Steve Jones

Listen to the podcast at Libsyn, Spotify, or iTunes.

Note, podcasts are only available for a limited time online.

Posted in Editorial | Tagged | Comments Off on Databases for Executives