Offloading Database Backups to Secondary Replicas (Should I or Shouldn’t I)

With Availability Groups there has always been a common theme across the industry to offload your database backups to a secondary replica. It sounds great in theory. I know people that have done this. I also know people that “thought” they were doing this. But the question is, should you be doing this?

Backup Types available on Secondary Replicas

First, let’s talk about the backup types available on secondary replicase. You will have access to take copy-only full backups or regular log backups (no copy-only here). You do not have the ability to take differential backups, or concurrent backups. That last part meaning if full backups are running on one replica, you cannot simultaneously take log backups on a different replica. This can be a deal breaker for many if you’re using differential backups as well, as these must be taken on your primary replica, as well as your full backups. This is a dependency of the secondary database being read-only, and the backup job not having accessing to update the backup flag within the database.

Controlling where you backups run

When you configure Availability Groups, the wizard presents you with the option to choose where your backups will run. Personally, I wish Microsoft would remove this from the wizard, as I feel it adds a layer of confusion around how this actually works. The confusion comes in around how this setting is actually applied. One would think that if you set this value, your backups will always run on the replica of your choice. That is not true at all. Outside of the Microsoft Maintenance Plans for backups, the only way this setting is ever applied is if your backup jobs explicitly call the sys function sys.fn_hadr_backup_is_preferred_replica ( ‘dbname’ ) and takes action based on the result. I’ve seen it far too many times where someone says “Oh yeah, my backups are running on my secondary”, only to find out all they did was configure the setting in the Availability Group wizard, but they were using a custom backup job that made no reference to the backup preference function.

Log backups and Secondary Replicas

Another concept many people fail to realize is that regardless of what replica you take a log backup on, it becomes part of the log chain that will be required in the event you need to restore your transaction logs. This is due to log chains being persisted across all replicas in an Availability Group. This can be a blessing and a curse. The good news, if you take log backups on a secondary replicas, your primary replica becomes aware of them, and will be able to truncate its transaction log as those backups are completed on the secondary replica. In theory, this sounds great.

I once found a SQL Server Availability Group running a 3rd party backup tool. This tool performed various backup types based on the current state of the database and the backup set. The backup solution was taking full, diff and log backups on the primary replicas. However, there was a diff backup job that ran once a day on the secondary replica. Presummably, for the event the replica was ever to become the primary. Every night it would run, and the backup solution would see that it could not take a diff backup on a secondary, and would on the fly decide to take a log backup instead. Fast forward to the next time a restore was need. That random log backup that the diff backup job was taking on the secondary replica was now needed to fully complete the restore.

Integrity Checks

One might ask, “but what do integrity checks have to do with offloading backups to a secondary replica?”. Well, for starters, you should be taking your backups where ever you are running your integrity checks. Honestly, you should be taking integrity checks on all replicas, but that’s a topic for a different day. Let’s suppose for a moment you’re running integrity checks on your primary replica and taking backups on your secondary replica. Your databases are running for months, and corruption occurs on the secondary replica during that time. But you have no idea corruption has occurred there, because you’re only running integrity checks on your primary replicas. Then one day, someone drops a table in production. The database is in an Availability Group, so that table is now gone on all replicas. You’re only option to get it back is to restore that corrupted backup from your secondary replica.

This by itself does not mean you shouldn’t run backups on your secondary replicas. This just means you need to be aware that if you do run them there, you must make sure you are are also running integrity checks there.

Backup Preference Options

Ok, so let’s assume you’ve taken everything above into account and you’re referencing sys.fn_hadr_backup_is_preferred_replica ( ‘dbname’ ) in your backup jobs. Which backup preference do you choose? Your options are Prefer Secondary, Secondary Only, Primary and Any Replica. Most of these are pretty self-explanatory. The one I really want to focus in on here is Secondary Only. This option can be truly risky. Where as Prefer Secondary can fall back to the primary replicas if no secondary replica is available to take a backup, with this option, your backups will simply not run if there is no secondary replica available to run a backup. If you still choose to run backups on a secondary replica, I would avoid this option entirely.

So, should you take backups on your secondary replicas?

My opinion is No! If the reasons above were not enough to convince you of the same, let me add some additional context. What’s the number one way to loose your job as a DBA? Data loss! So, let’s say your running database backups on your secondary replicas. There are requirements to be met before a backup can run on a secondary replica. The secondary must be able to communicate with the primary replica and the database must be in a SYNCHRONIZED or SYNCHRONIZING state. If that condition is not true, your backup will not run. Add any latency occurring with your data synchronization, and you must also add that to your RPO. If replication is behind by 5 minutes, and you restore a backup from 15 minutes ago, expecting to meet a 15 minute RPO, you may not realize that you’re really only restoring data from 20 minutes ago.

Oh, right now is probably a good time to call out that synchronization can still fall behind, even when replicas are in synchronous commit! Be aware of this, and do not just assume you are 100% protected by using synchronous commit.

Worse, say your full backups are running over night, once a week, at the same time maintenance is occurring. After-all, we typically perform maintenance on our secondary replicas, thinking there is no impact to production. For the past three weeks, a combination of Windows and SQL Server patches were applied during maintenance windows, and the replica was rebooted while the backups were running. You only keep two weeks of backups online, and it’ll take some time to get older backups from offsite storage. I’ll just say, that’s not a situation I would want to be in.

Bottom line, if you want to confirm you have valid backups, you must take them on your primary replica and test them often. If your primary replica does not have enough resources to run backups concurrently with an active workload, you have a resource issue, not a backup preference issue.

Imposter Syndrome

I started my blog to push the boundaries of what I believe to be true. I’ve been very successful throughout my career as a DBA, regardless of the role I’ve held. It’s something that’s always come easy to me, and most concepts just clicked from day one. Time after time, a problem would arise, and I’d manage to solve it.

However, I realized in late 2021 my knowledge was no longer being challenged in the way I would like through traditional methods. If I am going to get better at what I do, I need to take risks. I need to put my thoughts out in a public forum where those smarter than me could challenge my ideas and challenge me to be better.

Imposter syndrome. The idea that once I step out from behind my mask, I may find that I’m not nearly as good of a DBA as I once thought I was. I’m going to be challenged by those with decades more experience than me. The “real” DBAs. But that’s ok with me. If I want to get better, I need to be willing to be wrong and learn from it.

I’ve committed to three things for the new year. Read three blog posts a day from respected sources, write one blog post per week, and start engaging in social media.

I don’t know where this journey will take me by the end of 2022. But I’m ready to take off the mask and find out. My challenge to you. Step out of your comfort zone. Find something that pushes you to be better at what you enjoy. If we all continue to stay in our comfort zones, we’ll all continue to be average.

Lastly, if you see me post something you believe to be inaccurate, call me out and let’s correct it. I welcome that opportunity. In fact, I encourage it, that’s why I’m here.

How large can a single data file grow?

To be honest, this is not something I ever thought I needed to know. Until one day, I found a database where transactions were failing due to ‘PRIMARY’ filegroup is full errors. So, I did what any DBA would have done, and checked if auto-growth was enabled. It was. So, why was the file no longer growing?

The answer wasn’t immediately clear, but I had a good idea what the issue was.

If we look in ‘sys.master_files’, we can see that ‘size’ refers to the number of 8k pages in the data file and is stored as a data type of INT.

If we do the math, that gives us a maximum of 2,147,483,648 8k pages per data file. That’s a lot of pages, but how much space would that be on disk?

Let’s do a little more math. 2,147,483,648 x 8 equals 17,179,869,184 KB. If we break that down further, it’s 16TB. So, there’s your answer, a data file cannot grow past 16TB.

Any ideas how big my data file was? You guessed it… 16TB!

That’s a lot of space, and hopefully you don’t have to manage a data file that size. But be aware, that if you have a data file nearing that limit and it’s the only data file configured, new writes to the database will fail once the file hits this size. If you’re in this scenario, you’ll want to go ahead and add an addition data file before you hit this limit.

What I’m looking forward to in SQL Server 2022

It’s been a a several weeks since Microsoft started releasing details on SQL Server 2022. Here are two of the new features I’m most interested in.

Parameter Sensitive Plan Optimization

Parameter Sniffing has been plaguing SQL Server for years, and it sounds like Microsoft is finally putting some real effort into fixing the problem. Traditionally, when a query is executed, SQL Server will build and cache one plan. It does this based on the first set of parameters used during the initial execution. All subsequent executions of the same query, regardless of the parameters, will use the exact same execution plan. This is not always ideal, as one set of parameters may perform better with an index seek, while another may perform better with an index scan. Starting in SQL Server 2022, Microsoft is planning to solve this problem with Parameter Sensitive Plan Optimization. From what they’ve demonstrated, this would work by having SQL Server cash multiple plans for the same query, and decide which one would be more appropriate for a given set of parameters. There are not many details available yet, and several questions are yet to be answered, such as how many plans might get cached per query. But in general, I’m still excited to see how this performs in the real world.

Bi-directional Failover between SQL Server 2022 and Azure SQL Managed Instance

This is another intriguing feature. For years, I’ve been cautious about moving too aggressively into Azure SQL. Aside from operational cost over time, my reasoning was largely related to the fact that once you had your data there, it was much more difficult to get it back out. This is because Azure SQL is always running the newest release of SQL Server, and downgrading to an on-premise SQL Server version was not fully supported. There were ways around it, but this may be the most streamlined way, thus far. This will also be a great news for those looking to run a SQL Server on-premise, but are not able to manage a full DR region. With SQL Server 2022, you’ll now be able to add an Azure SQL Managed Instance as a failover target. Once you enable this, Microsoft will setup up a Distributed Availability Group behind the scenes between your on-premise instance and your Azure SQL Managed Instance. After setup is complete, you’ll then be able to freely fail back and forth between Azure and on-premise.

What about you, what feature are you most looking forward to in SQL Server 2022?

Your first priority as a Platform DBA

You’ve taken a new job ! You’re now responsible for 50+ SQL Sever instances, maybe more. What do you plan to do in your first 30, 60 or 90 days? Make sure backups are running? Make sure consistency checks are running? Implement maintenance routines, if none already exist? Those are all great ideas and things you should consider doing sooner rather than later. However, there is one thing you’ll want to do first, that is far more important than those or any other typical database administration task you might be thinking of.

Many of us have been there. You think you’re monitoring your entire SQL estate. You think you have all your backups properly configured. Then one day, you get a call, an application is offline. But wait, you’ve never even heard of this application. You have no idea what SQL Server is running behind it. Finally, with some help from a senior developer, you track down the SQL Server. You log on. Then, it hits you. You realize backups have not run successfully for months and a database is in a suspect state.

Now, that’s not a situation I wish upon anyone. But let’s face it, we often inherit a mess when taking over a new SQL Server environment. However, the phrase “I didn’t know about that server” is not typically going to take you very far in this industry, nor with your boss. After all, as a DBA, it’s your job to know where all your SQL Servers are.

So, what’s your first priority as a DBA? You’ve probably guessed it by now. The first thing you’ll want to do is get an accurate list of your SQL Server inventory. Preferably using automated collection with a regular refresh. Because, let’s face it, people love to go rogue and randomly install SQL Server instances for some side project they’re working on, without telling any one. Then suddenly it becomes your problem one day when it breaks.

So, before you run off and start changing configuration options, changing backups schedules, tuning queries, or whatever that shiny object is sitting in front of you, I want you to go get an accurate list of your inventory. One that you know you can trust. Sure, this is probably the least interesting part of our job. But it’s a must, and it’ll pay you dividends in the end.

It’s not going to be easy. People are going to try to side track you left and right with new shinny objects. Everyone always thinks their problem is the top priority, and sometimes it might be. That’s for you to access on a case by case basis. But if you’re not diligent, you’ll get 6, 12 or even 18 months into your new job, and still not have an accurate inventory list.

Once you finally have an accurate inventory list, you can start tackling your tech debt. Because at this point, you can trust that once you start reconfiguring backups, integrity checks, or monitoring, that you’ve covered your entire SQL estate. At this point, the chances of finding a random SQL Server sitting under someone’s desk are slim.

There are many tools out there to help you collect and find your SQL inventory, and I won’t go into detail about which one to use or how to use it. Here are few commons ones, in no particular order. Most importantly, pick one you feel comfortable with and get started.

  • dbatools Find-DbaInstance
  • Idera SQL Discovery
  • Microsoft MAP Toolkit

Please leave a comment with the tool of your choice, or let me know if there is one you think should make the list.

Lastly, where do you feel inventory collection falls on a DBA’s priorities list?

Stop using the OR operator in UPDATES

Something I see regularly is developers trying to write stored procedures to service multiple update scenarios. They typically accomplish this by adding multiple input parameters to the stored procedure and then using an OR operator in the update statement to match the parameter that a value was provided for. In many scenarios this works okay and no one ever notices the performance hit this type of stored procedure takes. That is, until the table being updated grows to millions of rows or there are joins involved. Suddenly that update that used to take less than one second it now taking 10-20 seconds.

For my examples, I’ll be using the AdventureWorks 2019 OLTP database. This can be downloaded from Microsoft for anyone interested trying this themselves. First, we’re going to give Janice a few extra hours of vacation time. Below is the statement we’ll use.

Now, before we execute that, let’s check what Indexes are on the table.

We can see there is a NONCLUSTERED index on LoginID, and another on NationalIDNumber. Now, how is SQL Server going to use these indexes when we attempt to update Janice’s vacation hours? After all, we’re providing a value for LoginID, right?

Wow, we actually did a CLUSTERED index scan instead of a NONCLUSTERED seek, and read 290 rows before we found the 1 row we were looking for. Again, not significant numbers based on the size the database we’re testing in. But these numbers can escalate quickly in a much larger database.

But why didn’t SQL Server using our NONCLUSTERED index? Consider this, we don’t have an index with a single key that covers both columns in our WHERE clause. Further, SQL Server has to assume that either one of them won’t be a match due to our use of the OR operator. So, best case scenario would be for SQL Server to do an Index Seek on both NONCLUSTERED indexes and join the results. However, that would be pretty costly, so SQL Server decided it would rather do a CLUSTERED index scan in this scenario.

Now, do we think the result will be an different if we pass in NationalIDNumber instead? let’s try.

No, we get the same execution plan. Now, the estimates are still good in this plan. SQL Server expected 1 row and found 1 row. SQL Server only had to read 290 before finding that 1 row.

Now, let’s add another layer of complexity that I’ve seen a few times. Developers think that by splitting the WHERE clause into two AND statements, each with its own OR operator embedded in the middle, that they can avoid this above issues. This code usually looks like this.

So, if we execute this statement, does SQL Server have a better understanding of which index it can use?

No, the plan actually gets worse. Now SQL Server is expecting to find 55 rows and only finding 1. This may not seem like a big deal, but SQL Server clears RAM for the expected result set, before it begins to execute the query. In this case, SQL Server is clearing space in RAM for the expected 55 rows, instead of just enough space for the one row it will actually find.

Now, let’s try this update a different way. We’re going to write an IF statement to separate out the possible combination of parameters that may be populated at execution time. Here is what that update statement would look like.

Now, how will this statement perform?

Wow, this actually produced the Index Seek we’ve been looking for. Better yet, SQL Server expected to read 1 row and actually read only 1 row. It doesn’t get much better than that. By writing our update statement this way, we’ve allowed SQL Server to build an execution for the exact parameters we populated. If we were to populate NationalIDNumber instead of LoginID, SQL Server would build a plan for that update statement separate from the one for LoginID. Now, I’ll avoid the topic for now around having our developers build specific stored procedures for each of these. That topic is out of scope for the points being considered here.

In summary, you can use an OR operator and get acceptable results most of the time, and often no one ever see a performance issue using it. However, it’s best to understand what SQL Server is doing when you decide to use an OR operator. Let’s try to avoid them, and prevent future bottlenecks from popping up in our code.