Behind the scenes

Black Friday: How a risky experiment saved the day

Dominik Bärlocher
29.11.2017
Translation: machine translated

Last Friday, Digitec Galaxus AG was in a state of emergency throughout Switzerland. The biggest concern was whether the servers could cope with the onslaught. Software engineer Enes Poyraz was on the front line. He talks about a day when the engineers put all their eggs in one basket.

17 seconds. That's how long it takes digitec.ch to go offline for the first time last Friday, internationally known as Black Friday. The reason: too many user requests. Our company's servers collapsed after 17 seconds under the load of you and other customers. Because too many people wanted to take advantage of the special offers on the international sales day.

"We actually thought that the servers would stay down for longer," says Junior Software Engineer Enes Poyraz.

The engineer recounts a day when he and his team forced a company to become unproductive and the engineers saved your deal with an act of desperation.

Black Friday is 17 seconds old before our servers give up

Enes is present the second the servers go down for the first time. This is because a troop of engineers - all teams are named after James Bond films - are on stand-by every year for Black Friday. In the new offices on Förrlibuckstrasse, at home in the home office or somewhere out there on laptops with mobile internet. They are waiting for the servers to give in and migrate if they can.

But at midnight they are at their limits. The otherwise proud Engineers have to take a hard knock. The men and women, for whom even a downtime of a few seconds is too long, only manage to get the servers back online after a good two hours. But the shop was not stable. The site went offline again and again, but only very briefly. Only a livestream from Digital Marketing is active.

"It's a bit of a hassle for customers, but it's expected behaviour on the engineering side," says Enes. He shrugs his shoulders. Sure, it's unpleasant and the engineers try to minimise these times. But when a nation hammers a website, it can happen.

The second wave

It gets quiet. After 2 a.m., the inhabitants of Switzerland are asleep, having dusted off their good deals. Later in the day, I hear from the Zurich shop that one person has been assigned to take out smartphone subscriptions. Because last Friday there was a 50% discount on every smartphone when taking out a plan. The plans can also be purchased online.

But Enes doesn't notice any of this. He goes home shortly before 3 a.m. and sleeps for a few restless hours. His mobile is on "loud" next to him, as he could receive a call at any time to come back to the office immediately. The servers are down and he is needed. But the call never comes. Enes is asleep. Other engineers do the same, including Team Leader Software Engineering Raphael Renaud, who is actually on call. Raphael's phone rang at 5 o'clock in the morning. It was a call from Wohlen asking why there were so few orders in the system. Because the central warehouse in Wohlen is also in a state of emergency on Black Friday and is running at full capacity and at full speed.

At around 9 a.m., he is back at his desk and describes himself as "well-rested". The servers are holding. Barely.

"The load increased continuously over the course of the morning," says Enes, "and we realised that if it continued like this, we wouldn't be able to make it through the day."

That was out of the question. As there are now more engineers in the offices than the previous night's on-call détachement, they can split up. One team is dedicated to the internal systems. Everything that can be switched off is switched off. An email from Chief Information Officer Oliver Herren informs digitec employees at 9.36 am that internal tools such as time tracking for employees or some functions in the shop backend will be switched off to conserve server resources. Everything that we can host and switch off locally will be switched off.

Employees in the offices at the Zurich headquarters are swallowing empty. We are an online shop. Our website is our capital, the place that pays our bills and brings you your latest gadgets. We are worried. Especially because the editorial team can more or less cease operations. The entire magazine will be banned from the front page. The images you click on eat up too many resources.

But it's all in vain: at almost exactly 12 noon, the server gives up again. Not as bad as the night before, the site comes and goes every second, but still too unstable for a good shopping experience on the site.

The engineers forget all about lunch and set about getting the site back online.

An experiment saves the day

While Oliver and a team of Engineers from all Engineering departments took services offline, Enes and two other Engineers were busy hatching a plan. What to do if taking the internal services offline is not enough?

The three engineers were tasked with finding a solution if all else fails.

"It doesn't get any more out-of-the-box than that," says Enes. He is a little proud to have been part of the team that saved the day. The usually quiet man suddenly speaks a little louder.

The solution is called Redis, a cache system that the engineers are keeping an eye on. Enes is one of those who have tested applications on it. A cache is nothing more than a data memory that stores frequently made requests and can therefore process them faster. An example: If you want to go to the Black Mobile site and thousands of others want to do the same, you no longer have to request the database behind the website every time. The cache has already cached a version of the site and displays it to you. This means that resources can be used for the rest of the purchasing process.

"Sure, we have a cache solution that runs smoothly in normal everyday use," says Enes. But every time a server goes down, the caches have to be recalculated. Even worse: each of our servers calculates the cache locally. In other words, every time a server goes down, the cache has to be recalculated locally on the server machine. This eats up resources that are actually needed by the customer. Nobody in the company thinks about readers of the magazine any more. The Marketing department has long since laid down its arms. Product management is wondering whether their stocks are sufficient. Shop employees are faced with long queues. The engineers, however, are faced with a nation of shoppers and are not even thinking about giving up.

"I haven't tried Redis on a large scale yet," he says, "but I spent two days testing it out." That was never enough to entrust the system with the largest online shop in Switzerland. When Enes looks at his two-week-old notes, he can't help but grin.

I'm sure there are several areas of application here at
Notizen Enes Poyraz

The engineers decide to take Redis live without testing on a demo system. A risky endeavour. Before a company puts software into a live system, it is put through its paces by internal departments and often also by external parties. After all, just because software sounds like exactly what a system needs according to the manufacturer's marketing material doesn't mean that it will deliver the promised added value. Sometimes using it only makes things worse.

"What do you think, will it be good?" Enes is asked.

The engineer nods.

Redis takes over

From then on, everything happens quickly. According to Enes, Redis is "damn quick and easy", so the server is up and running within 20 minutes. Whereas the old digitecs cache system runs locally on each server, Redis is centralised on one server and also processes requests on other servers. In other words, every server writes to the Redis cache, making the solution more scalable and reducing the load on the reading servers.

"To ensure that we don't start a completely reckless #yolo campaign, we have also set up a SwitchBit," says Enes. A small team from IT Operations had to agree to constantly monitor the servers so that the shop is not completely shut down. This is because IT Ops is already busy monitoring server performance all day long. According to Enes, they are the ones who had his team's back for the experiment with Redis.

The problem comes with the go-live, which is as uncoordinated as possible. Enes wants to launch Redis on a managed Microsoft server so that the internal load is not too great.

"Unfortunately, Redis is running on port 6380, which is closed here," he says. He makes an emergency request to open the port, but this is not possible because this port is blocked by Microsoft's server. But Enes has learnt one thing: he always has a second plan up his sleeve.

"At the same time, I tried to get Redis running on Google's cloud," he says. But even that was more difficult than expected. With the help of senior software engineer Michal Nebes, he manages to get a Redis cluster up and running.

At 4 pm it's time to get serious. A short code review. According to Senior Software Engineer Boško Stupar, the code fits and he measures the round trip of the data, i.e. the time it takes for data to be sent to the server and returned to the user's computer.

  • Normal: 50 milliseconds to 500 milliseconds. Too slow
  • Redis, local test system: 8 to 9 milliseconds
  • Redis, productive: 16 to 19 milliseconds

Redis goes online. Evening sales are placed in the hands of an untested system with minimal security

Anxious seconds.

Redis reads requests, builds up a cache.

The load on the servers decreases noticeably. The shop stabilises.

The engineers breathe a sigh of relief

Boško Stupar later wrote on Facebook:

Black Friday survived. That was the greatest day of my career! So stressful, so hard, so rewarding. For nerds: we implemented two-stage caching with a centralised Redis farm. Open heart surgery without painkillers. PS: I hate Black Friday.
Boško Stupar

The Engineers meet up after sunset for a beer on the fifth floor of Pfingstweidstrasse, one eye always on the shop. A lively party is different, but they relax.

At 6.55 pm, Enes gives the all-clear via WhatsApp:

The servers are running normally, under normal load. Meanwhile, people are queuing in the shop, making smartphone plans and picking up orders. Store Manager Adrian Maier stands at the shop entrance in Zurich and provides information about waiting times. Twenty minutes, forty minutes. He is tired, just like the crew behind the tills.

The engineers are halfway through their day, but the shop employees are the last ones who have to hold out. Because at 8 pm, the day is also over at the tills. No more plans, no more deliveries, no more questions.

Quo vadis, Redis?

The Redis system will remain online over the weekend until Monday evening so that it can take part in Cyber Monday. The engineers are proud that the shop will be stable on Monday. But then Redis has done its job and the cluster is taken offline again.

"Because when it's all said and done, we're still talking about a largely untested system here," says Enes.

The engineers have a long list of questions that they need to answer before they can permanently connect Redis to the network in good conscience. Among them are many questions that sound something like this: "Why does Redis do $ding?"

Now that the situation has calmed down again, the engineers can turn their attention to these questions. Because if Black Friday has shown them anything, it's that Redis has potential. It would be a shame not to utilise it. <p

477 people like this article


User Avatar
User Avatar

Journalist. Author. Hacker. A storyteller searching for boundaries, secrets and taboos – putting the world to paper. Not because I can but because I can’t not.

134 comments

Avatar
later