
Behind the scenes
Black Friday Week: our servers are ready to rumble
by Noah Waldner
On Black Friday, one team made a name for itself. BlackJack played high stakes, worked hard and rose from defeat to become the big winner of the night. A look at the work of the Site Reliability Engineers.
"Cyber Monday was easy," says René Zweifel, Team Leader of Site Reliability Engineering at Digitec Galaxus, "but on Black Friday we were a little shaky for a moment."
Because it was the night before Friday when a year's work had to prove itself. In the interview on Tuesday morning, after everything is over, René is proud. He and his team - "and the entire engineering team in general" - have done a great job. The statistics are impressive.
"We think that's good. Because after Black Friday 2017, we had to say to ourselves: 'no... just no' and then get behind the books."
Saturday, 25 November 2017: Engineering has just suffered a heavy defeat. Digitec.ch and Galaxus.ch were offline for a good 2 hours and 40 minutes despite shutting down pretty much all systems on the servers that were not absolutely essential for shopping. But not all at once: "Sometimes we were online every minute, sometimes offline". The result: nobody had any fun. You didn't have your deals, product management didn't have their sales and engineering - everyone agreed - didn't do its job.
This situation was one of the reasons why Team BlackJack was created. Team leader René Zweifel founded the new team and looked for people to help him with the new mission: Site Reliability Engineering. From then on, their job was to ensure that digitec and Galaxus remained online, come what may.
"After the Black Friday thing, it was quite a big task," says René.
But he and his five team mates have not given up. Sure, the Redis cache system saved the 2017 version of Black Friday, but that wasn't enough for BlackJack. They didn't want to leave anything to chance.
"The infrastructure would have had to be completely replaced in many places," says René.
Switches, routers and all the other network elements would all have to be thrown in the bin, as would the network clusters. A dedicated network would have had to be set up. And so on. That would have cost an infinite amount of money.
The alternative: moving to the cloud.
"That only costs 'almost an infinite amount of money', so it's cheaper than having your own infrastructure," says René and laughs. René has a genuine laugh, infectious and honest. The bearded young man with the short hair thinks for a moment and then says: "That was probably a story, I tell you."
He skips the detailed story and says "... The process was completed at the end of May 2018. Then we had a beer."
And before the beer came the realisation that the systems are now actually infinitely scalable, hence Black Friday Proof. Theoretically.
In order to minimise the load on the servers despite the new infrastructure, BlackJack and the online shop teams have worked on an isomorphic front end. In other words, some of the code that is interpreted and calculated on digitec and Galaxus happens on your computer, not the server. The number of requests sent to the monolith was reduced by node.js and GraphQL. "Well, I have to admit: BlackJack only set the challenge. Other teams have always implemented it," says René and laughs.
"The entire shop is not yet isomorphic. Only the parts that are important for days like Black Friday."
The implementation started in May. Black Friday 2018 was just around the corner. René and his team entered the final phase of the development year: load testing. The new system passed one after the other. Nevertheless, BlackJack worked according to the "bomb and optimise" system and tweaked the system here and there.
Nevertheless, BlackJack has scaled up the system for Black Friday. The load balancers will be increased from 4 to 6, and the shop servers will run on 30 processors with 16 cores each instead of 8 octa-core processors. The Kubernetes clusters are also being massively scaled, just like many other features.
"At midnight, we were running at 600% system performance," says René. There is pride in his voice. But that wasn't just BlackJack. Because on Friday night, as in the previous year, engineers from all teams are present and on call. René himself is assigned to the second shift, starting at 7 a.m., but sits at home in front of a laptop and observes the situation. "They might need me."
Many engineers do the same. As soon as the battle log is updated with the latest information from the engineering war room, emails and text messages arrive in which night owl engineers offer their help and advice. It is a masterpiece of interaction.
But shortly after midnight it becomes clear: no stress.
"We haven't even come close to utilising the 600%," says René proudly. This is despite the fact that the website is receiving more traffic than ever. Users are hitting the website with orders and comments, but the servers are holding up.
On the day after Cyber Monday, René sits in a red T-shirt on an armchair in the lounge on Pfingstweidstrasse. He is relaxed and enjoys talking about the Engineers. The praising emails from the executive board have done their bit. But René doesn't want to rest on his laurels. Neither does his team.
BlackJack does not record one hundred per cent success. René is particularly concerned about the four minutes of downtime.
"That's still simply too much, but we can easily halve it," he says.
When asked, he says that two of the four minutes were due to a tool called Queue it. The tool promises to set up a kind of "digital waiting room". But the thing failed through and through.
"We were impressed. But not for the reason the developers would have liked," says René, a grin crossing his face, "we were impressed by how quickly Queue-it brought us to our knees. It took less than three seconds."
On Cyber Monday, Queue-it was no longer used. Combined with the decrease in hits, this meant that the pages were never offline. He calls that a success. However, he admits that it may have been due to a Queue-it implementation error, not necessarily the tool itself. The investigations are still ongoing.
There are still two minutes left that he and the rest of the BlackJacks have to sort out. These are due to the failover, the fallback to an earlier configuration, at 00:57. The engineers reset a database to an earlier state. The reason: too many PlayStation and AirPod purchases at the same time. Fortunately, however, the data lost during the failover was not confirmed purchase data, but data attributable to users who repeatedly tried to make purchases from different devices. This overloading of the database then caused splash damage and in the end very little was available for a few seconds. However, the failover then resolved the situation for the rest of the night.
René looks to the future with BlackJack. There's a lot to do. Because four minutes doesn't sound like much, but it will take months of work to eradicate it. And BlackJack, René is sure, can do it.
Team BlackJack is also looking for reinforcement.
Journalist. Author. Hacker. A storyteller searching for boundaries, secrets and taboos – putting the world to paper. Not because I can but because I can’t not.