Handling The Outage Today : ActiveInbox Blog

Hi all,

It’s been a few weeks since I posted an update (more on why in a mo), but I wanted to write this partly to give me a little perspective, so we can improve our emergency processes in the future; but mostly to say sorry.

Even though we technically had no warning of the Gmail change, I still feel an immense tightening of the chest imagining everyone sat there wondering why ActiveInbox isn’t loading, and the frustration that causes.

And perhaps more so that after 10 years, I still haven’t anticipated and safe-guarded against every conceivable way that Gmail can break ActiveInbox. Hopefully this post mortem will take us another step closer.

What Caused The Outage?

A small pleasantry in one of our minor features – the ability to show your name against your calendars when picking a due date – relied on a piece of data buried deep in Gmail.

Today, Gmail did a small update that broke that request for your name, which set of a series of escalating events that tripped up ActiveInbox and stopped it loading.

How Was It Detected

Around 10am, we started getting the first notifications that two people couldn’t load ActiveInbox. Past experiences mean we react very quickly to these types of issues, as it’s often a “canary in the mine” suggestion that Gmail is changing, and we have a limited window before many people are affected.

As a consequence, by 11am I was doing a screenshare with Dale in the UK (one of our oldest customers), who very kindly let me run my diagnostics tools on his Gmail to find the problem.

How Quickly Was Everyone Informed

Lisa tweeted while I was talking to Dale, that we were aware of the problem; and began responding to everyone who emailed in. The Get Satisfaction post that Dale had started became our official channel around 2pm.

How Long Did The Fix Take?

As a team we stopped everything to tackle this, and the actual fix took about 2 hours, and was published to Chrome Web Store as soon as we were done.

However, frustratingly, in recent months Chrome has slowed down our release of updates from 30 minutes to 24 hours. This has been the biggest toll on our responsiveness.

Is there a workaround in the meantime?

Joeri Cohen found that by going back to the old Gmail it would work (because it didn’t use include the damaging Gmail change). Very kindly, that info was shared on the forum thread – thank you Joeri!

A more basic solution was to access your Gmail tasks via labels, because that’s how ActiveInbox works (it tries to store as much data as it can entirely within Gmail).

E.g. for your Low Priority items, look for the label “!Low Priority”. Or for items due today, look for the label ZD/20180710 (10th July 2018).

How Could We Handle It Better In The Future

We’ve had time to reflect on how we could improve to reduce the chance of this happening in the future. (As engineers, we never say never – but we want an extremely high likelihood of perfect running).

In terms of raw development speed, I don’t think we could have actually fixed it any faster than we did, and I’m immensely grateful to Dale.

The bottleneck at present is in getting updates distributed. To reduce this, we’re going to try to minimise the causes of our emergency responses:

We’re going to adopt a new technology that will make Gmail UI changes less likely to impact us (InboxSDK, for those wondering).
We’re going to refine our coding process so that in team reviews, we look for and isolate any piece of code dependent upon Gmail data. So that if Gmail changes, the breakage won’t bring down the entire app. (As happened today).

Anything Else?

You may also notice I’ve been a little quiet for the last 5 weeks. It’s because, after the major Gmail change of a few months ago, we’re still dealing with the aftershocks, and I’ve had to go back to coding to help out the rest of the team.

The good news is, as a consequence of what we’ve been working on, another major improvement to the ActiveInbox code is about to begin testing. It will include:

Faster loading, with a much more sophisticated cache system.
More robust, even handling periods offline.
The restoration of the ability to add or update tasks, and notes, while you’re composing an email.
A more robust approach to diagnosing why any problems occur. We’ll now be able to ‘replay’ any issues much more easily, so that when things do go wrong – as they sometimes must – we can fix things much faster.

This was written by Andy Mitchell