Ganesh Prasannah
January 1, 2022
•
10 minutes to read
On the runway towards the launch, working at startups includes a lot of putting out one fire after the other.
You squash a bug when a user reports it. Speed up the product when someone complains. Roll out a feature when there’s a definite demand.
It’s so important to keep the momentum and build what the users want.
But once you spot the sweet product-market fit, you can no longer not worry about crossing the metaphoric bridge.
Beta testing is over. Users like and want your product. Now it’s time to level up. That means it’s time for proactive action, not just reaction.
Of course, it’s impossible to get everything right in the first rollout. But being on the backfoot is worse than floating a slightly faulty first impression.
To ready ourselves for the launch, we at OSlash launched a mission—internally named Project Apollo.
“One small step for OSlash, a giant leap for productivity”
We wanted to make sure that when we open the gates to the public, we avoid nasty surprises, and that nobody dies. Of course, that’s asking a lot but a team can try.
Our objectives were clear –
To meet the objectives, we divided the engineering team into three groups picked out at random.
Note: The teams worked towards the success of the mission along with performing their everyday tasks. It was not an easy ask, yet one they couldn’t have done better.
The objective of this group was to answer the question “How many users can our platform serve today?”. With this answer, we gained two new abilities — identifying performance degradations and capacity planning for expected traffic surges (like, well, a product launch!).
For most engineering systems, these are the usual metrics for quantifying performance.
To get these numbers straight to our dashboard, the performance squad decided to measure different aspects of performance by breaking it down into parts.
To do that, the squad employed Firebase Performance Monitor that tells us exactly how our users are experiencing our product. It returns –
The squad complemented Firebase with Sentry Performance Tracing to delve further into the exact user experience. Sentry returns the amount of time taken in different parts of our product.
Let’s say a user requests a shortcut -> o/roadmap. Sentry helps return –
We split every touchpoint in the product into smaller chunks known as transactions to easily calculate how much time a transaction takes to complete. Any transaction that takes more than 1.2 seconds to complete leads to user misery. Identifying such transactions gave us a fine-grain analysis of all the potential bottlenecks.
The bottlenecks were identified using Sentry User Misery Score that returns the following:
Transactions per minute: How many times a given operation occurs in a minute
Latency: Measured in P50, P95, P99
P50 - 50% of the operations finish in this time (average speed)
P95 - 5% of the operations take this much time on the slower side (worst speed)
Another tool that we set up to monitor all our lambda functions, including the time taken to complete a transaction, is Lumigo.
By tagging every prod release appropriately, we are able to ensure that the performance doesn’t degrade over time.
Once your product is launched, in a great-case scenario, you should expect a sudden spike in traffic. To be sure we don’t falter at this crucial juncture, the squad helped us answer two crucial questions:
Load Testing with Vegeta
Vegeta is a tool that allows teams to simulate heavy traffic. If you are gearing towards a big launch or PR, it is highly recommended to simulate every condition to see that the product does not break anywhere.
With Vegeta, we were able to figure out how much time the most frequent transactions took in peak traffic conditions.
With the help from Sentry and the data obtained from Vegeta, the squad made it possible for the engineering team to immediately fix the issues that instantly made the product faster for all users.
Also, by linking all issues in Linear to Sentry, we were able to ensure that the fixed issues don’t show up in production.
Observability is being able to quickly find anything you want about a system. In other words, observability is a bunch of systems like error trackers, uptime monitoring systems, log aggregators and tracers, all giving a bird’s eye view of all system components at any given point in time. Observability is also different from monitoring — monitoring will tell you about errors (that you already know of but haven’t fixed yet) that happen again; observability, on the other hand, can give you more real-time information and help you predict faults.
We wanted to ensure that the observability system is our single source of truth for all platform events (errors, alerts, system & app metrics) and that it helps us quickly jump to a particular flow or transaction to help debug and fix issues.
Product and engineering teams came together to identify important metrics for both teams and build an interactive dashboard together.
To make sure the dashboard is built super fast, the Observability Squad used Retool.
With Retool, we’ve unearthed some great insights into the product. Each time a crucial number sees a sudden spike or an expected uptick, our hearts collectively skip a beat.
Incident management is a set of definitions and rules that answer the following questions —
To make sure there is a process in place that can answer every question, the Incident Squad created playbooks that followed a carefully laid-out set of steps.
To make the whole process seamless, the incident squad ended up trying out a bunch of tools for issue monitoring such as PageDuty, Opsgenie, VictorOps, and Incident.io
In our personal experience, incident.io ticked all the boxes we were looking for.
After classifying all issues depending on the level of severity, the incident squad went on to describe how all issues will be communicated to the users, who will stay on call and how that person be monetarily compensated for the extra hours put in.
If you are looking at building your own version of Project Apollo, here are a few key tips that might prove helpful:
The entire mission took us two weeks to complete. In hindsight, to lessen the burden on the already stressed engineering team, we could have earmarked a couple more weeks for the activity.
We hope you found value in our experience. We wish to find you next to us as we travel...