Hardening your server

spamless
June 14

Stupid Grub Tricks

This post is not about writing or writers. It’s for the tech weenies among you. It’s particularly aimed at those who do server administration.

We had a server outage last week. It looked entirely nasty. I could not get the server to boot. I was close to pressing the proverbial ‘self-destruct’ button and spending the next few days trying to rebuild from backups. God knows how that would have gone.

image

It turned out the culprit was a buggy kernel upgrade that was part of a normal Canonical (Ubuntu) release cycle for Ubuntu 20.04. Here is the bug report. It affected servers running containers. (That’s how Discourse works.) Affected servers won’t come back up from reboot or will do so only sporadically in an unstable state.

Okay, this isn’t a tech site, but if this site had gone bye-bye, you might have been interested in why. More important, I have a solution to up the ante on server robustness and keep this from happening again.

N.B.: I guarantee nothing. You are free to read along and you are free to try my suggestions, but I take no responsibility for your results or your site! I am offering this proposed fix for free, but don’t come crying if you try it out and you brick your server! I’m not your server admin. Okay?

Let’s harden the server against failure from buggy new kernels. We’ll do this by making one light customization to the grub boot loader. (You need sudo rights to do this. Back up your server first!!! Make sure you know how to restore!)

This is what worked for me as someone with some ability to wield a Linux command line and navigate a server console.

The idea is to instruct your server, whenever it gets rebooted in unattended mode, to boot by default to the previous kernel. But if you are present and upgrading the kernel actively, you can instruct it to boot to the new kernel.

Servers don’t generally get rebooted entirely often. They’re not like your home machine that you might shut down every night. They reboot, or get rebooted, when certain rare events or upgrades are invoked. If your server reboots in auto-upgrade mode, or if there is a crash or downtime for other reasons and you’re not around, you don’t necessarily want the server to come up under a new kernel you’re not sure will work.

On the other hand, if you’re right there watching, you might want to boot to the new kernel. If your machine boots and runs normally, terrific. Let it run. There will typically be weeks, months, or perhaps even years before another reboot.

If you try to boot to the new kernel and it fails, well, it will likely eventually rescue itself and reboot – give it up to about 15 minutes – or alternatively, you can go to the cloud console at your server provider and restart it, possibly via a full power-down cycle. If you’ve invoked this safety measure, then when it restarts it will boot into the previous kernel once more. The emergency will be abated.

We’re going to edit a text file called /etc/default/grub. This has to do with the heart of how your computer knows to boot up, so please be careful. Here are the first ten lines of mine before any changes:

spamless@discourse:~$ head /etc/default/grub
# If you change this file, run 'update-grub' afterwards to update
# /boot/grub/grub.cfg.
# For full documentation of the options in this file, see:
#   info -f grub -n 'Simple configuration'

GRUB_DEFAULT=0
GRUB_TIMEOUT_STYLE=hidden
GRUB_TIMEOUT=0
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"

What’s in this file will be written to a file called /boot/grub/grub.cfg. That happens when you update the instructions for grub, your system’s boot loader. (See the very first line of /etc/default/grub above for how you will do that.)

The line in /etc/default/grub that says GRUB_DEFAULT=0 tells your server to boot into the first menu entry in that /boot/grub/grub.cfg file. As indicated above, that file gets refreshed whenever your system runs update-grub or when you run it manually. While you and I typically start counting with the number 1, computers typically start their count with the number 0. That’s why that number is a zero and not a one.

There are very likely more menu choices in your grub.cfg file. Unless somebody’s done some pretty unusual and bizarre editing to your system’s boot mechanisms, there will be other boot choices in that file. As stated, they’re called menu entries. They can be (usually are) grouped in divisions called submenus. Here are all lines with the word “menu” in them in my Discourse server’s grub.cfg file:

I’ve highlighted menu line "0" and menu line "1>2". (The greater-than sign is the syntax used to designate submenus. The highlighted menu entry in the submenu area is the third choice. But remember, we started counting with the number zero. That’s why it’s entry No. "1>2" to the grub system.) The third entry under the first (and only) submenu section is the one that will tell my server to boot to the previous kernel, which is still on my system. (Check yours!)

For me, the kernel called 5.13.0-1034-oracle, which string you can see in my screenshot in the entries above the second highlighted line, is the newest kernel on my system. The highlighted line that says 5.13.0-1028-oracle is the previous kernel. (And of course, it’s still installed on my system.)

You are responsible for checking your system and confirming which boot entry contains the previous kernel. You are responsible for checking your system and confirming that the menu entry you are interested in points to a kernel that works with your configuration and is still on your system!

Before you make any permanent changes, you might want to test your server for a successful reboot into the previous kernel. You can do this by rebooting with a special command that is part of the grub package and which should be on your system. This is the same command you will use later to boot manually to the current kernel instead of the older one you’ll be setting as default. So, make sure the program is present on your system. It’s called grub-reboot. It will typically be found as /usr/sbin/grub-reboot. You can read the manual page for it if you’d like.

Good, let’s reboot to make this test. Where I designate “1>2” below, you should use what will be the right submenu and menu item for your own system. In the majority of cases, your system will be configured similarly to mine. But you must check. If the syntax is wrong, your server could hang in an unbootable state. Should that happen, you should (a) wait up to about 15 minutes to see if it corrects itself; or, if it doesn’t, (b) log onto your cloud server console and force a manual reboot. (A full, power-down reboot might be necessary.)

Here’s the command to run for the reboot test:

sudo grub-reboot "1>2"; sudo reboot now

Here I’m going to presume your system successfully came back up from the reboot. We want to check to confirm which kernel is running. Here’s how to do that:

spamless@discourse:~$ uname -r
5.13.0-1028-oracle

Good, that’s what I wanted: that’s my prior kernel. My current, which I was in before the reboot, is 5.13.0-1034-oracle.

Okay, now we come to the dangerous part of this tutorial. Use your preferred system editor, with appropriate sudo permissions, to edit the /etc/default/grub file. Back it up first! Edit the line that says GRUB_DEFAULT=0, and change the 0 to "1>2" (if your system is sufficiently similar to mine) or to whatever the submenu group and menu entry are for you. The quotation marks can go on the line. I will aver that they could be single-quotes or doubles for this circumstance; I used doubles.

Save the changes. Run the command to update grub:

sudo update-grub

Once that completes without error, you can reboot your computer in the normal way, presumably using sudo reboot now.

After it’s rebooted successfully, check the kernel version once more using uname -r. It should still show your older kernel. If it does, you’re almost there!

Now you can reboot one last time and force the system to use the newest kernel. That’s the one you were running before you first rebooted according to this tutorial. To do that, set the grub-reboot command expression’s argument to “0”, the first menu selection found in your grub.cfg file:

sudo grub-reboot 0; sudo reboot now

Your system ought to reboot and be running the newer kernel again. Test it:

spamless@discourse:~$ uname -r
5.13.0-1034-oracle

Hooray! Task completed.

Whenever you want to boot into the most recent kernel that’s installed on your system – this will presumably be after you’ve installed a new kernel and you are there to ensure it works – you can run that sudo grub-reboot 0; sudo reboot now command. Otherwise, if your system does an unattended kernel upgrade or if it reboots unexpectedly, it will reboot into the previous kernel, which is one you know to be safe to run your server with.

If you appreciate this tutorial, be sure to give a like or otherwise let me know. And if you want to join our tiny but growing community of expat or isolated writers and friends, you are most welcome to do so!

/dr