Fix a SQL Database

Fix a SQL Database

A true story

Several years ago, before we moved our Production and Development systems over to Kubernetes running on Azure, we used a well-known hosting provider in the UK. Initially, we used Virtual Windows Machines on Windows Hyper-V Server, but the performance was never good enough.

Even though we paid for a guaranteed level of service, there were times of the day when our application tanked. Our CPU, memory, disk, and network were little more than idling along, but still, the performance was terrible. An hour or two later, the application was snappy once more. Traffic hadn’t changed, we hadn’t done anything, it just started performing as it should.

So, we knew we were not receiving the service we were paying for, so we moved from the VMs to physical systems instead. After the move, we never again had performance problems with the same applications, traffic volumes or databases.

All was well until one of our databases started having problems one day, and, long story short, the database became corrupted and went offline. This only happened on one database – all others were fine. We thought we were to blame and must have done something wrong somewhere. After all, SQL Server is rock solid.

The problem was we hadn’t done anything new, and the same software version was being used in exactly the same way on all of the other systems. So, we didn’t understand what we had done wrong.

We took our recent – and not-so-recent – code changes apart, looking for problems, but we couldn’t see anything suggesting a problem.

Whilst part of the team continued looking for the problem, we had to get the system back online. Fortunately, we had backups (you have backups, right?) – so we restored that database, and all was well again.

The team never found the cause of the problem. We also ran ‘DBCC checkdb’ commands every day, and the problem did not reoccur. In truth, we forgot about it and put it down to a weird one-off.

However, several months later, the same thing happened. This time, we found the problem when we ran the ‘DBCC checkdb’ command. We picked it up early, and the database did not go offline. So, this time, we dug into the problem differently. We checked the Event Logs and found that the SQL server had created a Crash Dump, stating that the problem was a memory issue.

At this point, we contacted the hosting provider and had them look into the server hardware. They found nothing. All of the hardware tests they ran gave the server a clean bill of health.

The system with the fault had had a big increase in traffic since the first problem occurred, so we had to schedule some downtime with the customer before we could fix things. Again, we fixed the database, and the system continued working. We continued, in vain, to find the problem.

We had the same problem a week later and contacted the hosting provider again. This time, they checked their logs – not their system logs, but the logs of the work their people moving around in the data centre – and found that every problem instance corresponded with their team working in that data centre area.

They eventually took the server out of service and checked it properly. This time, they found that the motherboard had a faulty memory socket, and as people walked past the server, the vibrations from their footsteps caused the RAM stick to wobble. Each time, this caused a memory issue in SQL, which created a crash dump and caused a problem in the database.

They simply swapped the motherboard out, and we never had a problem with it again.

What SQL statement did we run?

As we worked through various scenarios to determine what had happened, we ran these commands: –

To reset an offline database

EXEC sp_resetstatus [YourDatabase];

To check a database for corruption

DBCC checkdb([YourDatabase])

You can also run the above command like this to only have it show problem areas: –

DBCC CHECKDB ('WintrixPolicywise') with no_infomsgs

Hopefully, this will report no errors, but the next best result is that it says there is corruption that can be fixed by running the command with REPAIR_REBUILD.

To fix some database errors on indexes

ALTER DATABASE [YourDatabase] SET EMERGENCY
ALTER DATABASE [YourDatabase] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DBCC checkdb([YourDatabase], REPAIR_REBUILD)
ALTER DATABASE [YourDatabase] SET MULTI_USER

To fix database errors but potentially lose data

If the initial command comes back stating you might lose data, you run the command like this: –

ALTER DATABASE [YourDatabase] SET EMERGENCY
ALTER DATABASE [YourDatabase] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DBCC checkdb([YourDatabase], REPAIR_ALLOW_DATA_LOSS)
ALTER DATABASE [YourDatabase] SET MULTI_USER

Unfortunately, this may result in data loss; determining what is lost isn’t easy. Details for that sort of problem are beyond the scope of this article.


Stephen

Hi, my name is Stephen Finchett. I have been a software engineer for over 30 years and worked on complex, business critical, multi-user systems for all of my career. For the last 15 years, I have been concentrating on web based solutions using the Microsoft Stack including ASP.Net, C#, TypeScript, SQL Server and running everything at scale within Kubernetes.