Last Friday the SmartQA site went into a blink, inaccessible, socially distanced to use the modern terminology! A story of how some simple choices made by software developers while implementing an automated workflow can bring down a business, especially when humans in support decide to become inaccessible.
Let me tell you the story. The site smartqa.org became inaccessible last Friday, and after a few minutes, I discovered that the site was not down, but unreachable. That is when my tryst with support started. Telephone, chat, emails were unanswered and after five days of relentless pursuit it was sorted without any help from support. So, what was the issue and what can we learn from this?
Well, the issue seemed to be that the domain expired on Friday
despite it being renewed many weeks ago. The renewal process seems to have been botched up. A process that is completely automated, without any human intervention. What went wrong? After five days of being at it, I was given to understand that current domain registrar has possibly shifted his business partnership of buying domains to a different bulk domain provider. This required the new bulk domain provider to authenticate every domain owner with the current registrar. So an email was sent by them to each domain owner (I guess, as I got one) which was supposed to be responded to by a certain date. In my case, the email seemed to have found its way into the ‘read’ folder somehow and therefore I did not respond by the given date. So, on the date of domain expiry, the site went blink.All because of ONE email that I did not respond to! The email that somehow did not show up in my inbox. This email was never resent when response was not received. So, I as a customer never knew about this and my business stopped.
All because of ONE EMAIL THAT BROKE THE WORKFLOW of automated renewal! Know what is the cost of renewal? About Rs 1000 ($12)!
Just imagine if this has been an online business. $12 shuts down the business! All because of a developer making a choice, of assuming that a critical action in an automated workflow is done. Never contemplating what if it is not done, how can I ensure that it is indeed done? In these times with businesses becoming fully digital, these kind of simple choices can break a business.
In my case, I pursued the problem relentlessly, by analysing, by talking to a lot of people and finally a good samaritan helped me nail the problem and then poof, the solution happened. We all know that a problem is a problem, until the solution happens. And in most cases, the solution is simple!
On a lighter note, with support going into quarantine, the site socially distanced, I went into the ICU 🙂 A happy Covid19 story this turned out be, at the end.
“A typical accident takes seven consecutive errors” quoted Malcolm Gladwell in his book “The Outliers”. In the chapter on “The theory of plane crashes”, he analyses airplane disasters where he says that it is a series of small errors that results in a catastrophe. The other example he quotes is the famous accident – “Three Mile Island” (nuclear station disaster in 1979). You may want to read a nice article that I wrote on this <Seven consecutive errors = A Catastrophe>.
When you are building large systems that transform other’s business, stay defensive. Don’t assume that every action will be done, be it by a human or by another systems. Some of these can break the chain and business.
—
Two weeks ago, I gave a keynote talk titled “Be a flow. Test Brillantly.” in Tribal Qonf, an online test conference. “Good testing is a great combination of intellect, techniques, heuristics, process & technology. What does it take to do brilliant testing? Going beyond the intellect, into the deep, a state of flow, immersing into the act.” Here is a crisp FOUR minute version of this as SmartBites video. CLICK HERE to watch.