The infrastructure decisions that quietly decide your uptime

Almost nobody chooses their uptime on purpose. They choose it by accident, two years earlier, when someone clicked through a hosting signup at 11pm before a launch and never touched the settings again.

That's the pattern we see most. A business gets set up once, it works, and the whole thing goes quiet in everyone's mind. Then on some random Tuesday the SSL certificate expires, or the disk fills to 100%, or a dependency that hadn't been patched since 2023 gets exploited. The site goes down. And it almost always goes down at the worst possible time, because the worst possible time is also the busiest time, which is exactly when the disk was going to fill or the traffic was going to spike.

We've cleaned up enough of these to have opinions. The decisions that actually decide whether you stay up are unglamorous. None of them are exciting to buy. All of them are cheap compared to a bad outage. So let's go through the ones that matter.

You can't fix what you can't see

The first real decision is whether anything is watching.

Most small businesses find out they're down the same way their customers do: someone emails to say the site's broken, or sales just go flat and nobody knows why for six hours. If your first signal is a human complaint, you've already lost the early window where a problem is small and easy.

Monitoring is the cheapest reliability you can buy. We're talking about a few dollars a month for an uptime check that pings your site every minute from a couple of locations and texts you when it stops answering. Add a check on the SSL cert expiry date, because expired certs cause a shocking number of "the whole site is broken" panics and they're 100% predictable. Add disk space alerts at 80% so you get a warning week instead of a dead server.

Here's the honest part. Monitoring doesn't prevent anything. It just changes when you learn. The difference between knowing at 2% degraded and knowing at fully down is usually the difference between a quiet fix and a public incident. That gap is worth paying for.

A backup you've never restored is not a backup

I want to be blunt about this one. If you have never done a test restore, you do not have backups. You have hope, stored offsite.

We've watched it happen. The backup job had been "running" for a year. Green checkmarks every night. Then the day someone actually needed it, the restore failed because the job had been silently backing up an empty directory after a path changed. Nobody noticed, because nobody ever looked. The checkmark was lying and there was no way to know without trying.

So the decision isn't "do we have backups." Almost everyone says yes. The decision is "when did we last restore one and time how long it took." Pick a quarter. Actually pull a backup, stand it up somewhere, confirm the data's real and current. Write down how many hours it took, because that number is your real recovery time, and it's almost always longer than people guess.

A couple of specifics we hold ourselves to: keep more than one backup generation, because ransomware and bad deploys both love to overwrite your most recent good copy. Keep one copy somewhere physically separate from the live system. And test the restore on a schedule, not the day you're panicking.

Patching is a discipline, not an event

Software rots in place. The server you set up two years ago is running two years of unpatched libraries, and every month that goes by adds known vulnerabilities that anyone can look up.

Nobody wants to patch, because patching occasionally breaks things, and a thing that's working feels safer left alone. I get it. But "left alone" is how you end up with a server you're afraid to touch, which is the most dangerous server you can own. The longer you wait, the bigger and scarier each update gets, until patching feels like open-heart surgery instead of a routine.

The fix is rhythm. Small, frequent updates on a staging copy first, then production on a known cadence. Automate the security patches that are safe to automate. Reserve human judgment for the major version jumps. You don't need to be on the absolute latest of everything. You need to not be two years behind on the things attackers actually target.

Single points of failure are decisions too

Every setup has a piece that, if it dies, takes everything with it. One server. One database. One person who knows the password. The question is whether you know where yours is.

You don't need to eliminate every single point of failure. Redundancy costs money, and for a lot of small businesses, full redundancy is overkill for the actual stakes. A brochure site that loses an hour once a year is annoying, not fatal, and paying for 99.99% on that is wasted money. Be honest about what an hour of downtime actually costs you, then buy reliability to match. Not everything needs four nines.

But some things do. If your checkout, your booking system, or your core app going down means real lost revenue or lost trust, that's where you spend. The decision is matching the spend to the stakes, not buying uptime everywhere or nowhere.

What this looks like when it's done right

A manufacturing client came to us running everything on aging on-premise servers in a closet. It mostly worked, until it didn't, and "didn't" was happening enough that they were living around 94% uptime. That sounds close to fine. It's roughly 22 hours of downtime a month, and they felt every one.

We moved them to a monitored hybrid-cloud setup: keep the latency-sensitive pieces close, push the rest to cloud infrastructure that scales and gets patched on a cadence, with backups that actually get test-restored. Uptime went from 94% to 99.9%. Their infrastructure cost dropped 45%, mostly from not maintaining hardware they'd outgrown. And deployments got about 80% faster, because a sane setup is a setup you're not afraid to ship to. None of that came from anything fancy. It came from doing the boring parts on purpose.

This is the core of our managed hosting and uptime work, and for businesses stuck on hardware they've outgrown, it usually starts with a cloud migration that takes the closet out of the equation.

You don't need a full-time ops hire for any of this

That's the part people get wrong. They assume reliability means hiring someone senior and expensive, so they do nothing instead.

The truth is most of this is setup-once-then-maintain. Monitoring, automated patching, a tested backup routine, and a clear picture of your single points of failure. None of it needs a person sitting and watching screens. It needs the decisions made well at the start and reviewed a few times a year. That's a few hours a quarter, not a salary.

Common questions

How much downtime is actually acceptable? Depends entirely on what's down. For a marketing site, a few hours a year is fine and chasing 99.99% is a waste. For anything that takes payments or bookings, that math flips fast, because the lost revenue per hour gets real. Start by estimating what one hour offline costs you, then size your spending to that number rather than to a vanity uptime figure.

We already have backups. Isn't that enough? Only if you've restored one recently. We say this a lot because it keeps being true: an untested backup is a guess. Schedule a real restore this quarter, time it, and confirm the data is current. If it works, great, now you know your recovery time. If it doesn't, you just found out on a calm day instead of a catastrophic one.

What's the single highest-value thing to do first? Turn on monitoring, today. It's cheap, it takes an afternoon, and it converts every future problem from a surprise into an early warning. Everything else, backups, patching, removing single points of failure, is easier once you can actually see what your infrastructure is doing.

If you've set up your hosting once and haven't looked since, that's worth a second look before something forces one. We're happy to take a quick pass at your setup and tell you honestly where the weak points are, no pressure either way. You can get in touch here and we'll have a real conversation about it.