Revisiting our implementation of driver authentication
Every software engineer has stared at a block of code and asked themselves:
> What the hell is this?
Only to realize it was their own code, which they wrote last month.
The phrase “spaghetti,” even out of context, may give you flashbacks of a few such systems you’re personally familiar with. Bottomless method invocations, out-of-date documentation, and no unit tests, oh my!
All of these elements compound with one-another, making it extremely intimidating and tough for engineers to maintain or investigate.
Hey! My name is Zack and I’m a software engineer at Samsara. We’re a connected operations company that collects IoT data from sensors, cameras, and gateways. A large element of our fleet offering is the Samsara Driver mobile app. It’s built on React Native and designed for drivers to track their working hours, message their dispatchers, coordinate their deliveries, and more.
In addition to supporting these features, it also faces many technical challenges — it must operate while offline, run on very low-spec devices, and do a whole lot of math. Since its conception in 2016, it is used by 10s of thousands of customers, some of which have thousands of users in their organization. While the app is supported by a dedicated mobile infrastructure team, its familiar React environment has facilitated development by dozens of feature engineers who were able to hop in, write their feature, and continue on.
🏃♀️ Transient engineering
However, because these contributors were often only passing through, features could be implemented without adhering to any standard patterns.
In 2019 the mobile team decided it was time to rewrite the whole app. To discuss the entirety of the Samsara Driver rewrite would touch on hundreds of decisions, challenges, and lessons, and deserves its own blog post. Instead, I would like to focus on a smaller piece: driver authentication.
Auth was initially in charge of simply signing in and out a driver, but it had grown responsible for many more things. Many features depend on reliable setup and teardown, and for transient engineers, the authentication flow was the most tempting place to attach their functionality. However, over time this meant complexity increased while ownership and documentation became more diluted.
🔥 Understanding the problem
By the time of the rewrite, the auth process had accrued several issues, the sum of which nobody dared touch. For example, it:
Used all of Redux, Flux, and component state. Together, these resulted in multiple sources of truth, multiple implementations for similar infrastructure, and confusing patterns for future development.
Could be initiated from many different places, with different optional parameters. Signin and signout were also spread across several different files, methods, and components.
Had accrued far too many responsibilities. Handshaking with the backend, checking for app updates, conditionally showing a medley of different alerts/prompts, populating initial state, setting up or tearing down push notifications, and handling offline edge cases to name a few. These are all valid things to do on auth, but over time the functionality became jumbled making it very unwieldy.
Had only tiny shreds of documentation and no clear owner(s).
Had very little testing.
Was the source of several strange customer issues, including hard-to-reproduce race conditions and unintentional side effects brought on by other changes.
These issues together made further development intimidating and troublesome. Making even minor changes required extreme care and extensive testing, lest we cause cascading failures. We had lost faith in our own auth process.
We needed a solution which could restore our faith and lead us down a straight path, one which meant the mobile team would never have to loop back to revisit this dilemma again.
📝 Defining the shape of a great solution
Cleaning up the code so that it’s more readable, becoming the owner, and writing docs would all be pretty easy wins, but these were all only short-term fixes. Entropy always prevails. Engineers naturally try to accomplish their goals following the path of least resistance, so we needed some guide rails so that the lowest-energy approach is also the “correct” approach. Documentation will slowly become outdated and the scope of the system will grow and fracture. In a year, someone will have to untangle it all over again. Because we hadn’t created any guide rails, we watched auth fall into chaos.
In the spirit of rewriting the whole app, auth also needed a fresh start. But first we needed to answer an important question: what criteria does our ideal solution need to meet to be successful and stay successful?
Readability— Any dev unfamiliar with auth should be able to understand, edit, and add code to the system with little explanation. Code should be visible, not buried in methods across different files.
Testability— When making code changes, it’s important to be able to minimally test your changes without worrying about supplementary functionality or infrastructure. Our old auth approach was so jumbled it would take hundreds of lines of setup to test even a small change.
Modularity— The overall solution should not become more confusing as it gains responsibility. This is especially important for auth, since it’s a natural entry point for a lot of features. It’s also extremely handy to be able to remove functionality without making surgical cuts in several different spots, whether in response to an outage, slowly rolling out a feature, or just debugging an issue.
While brainstorming a solution we prioritized modularity, since separation of tasks naturally facilitates both readability and testability. We wanted a system which made it extremely difficult to accidentally step outside the boundary of a feature into the boundary of another. Separation needed to be naturally and strongly encouraged.
We turned auth into a modular queue of tasks, each of which corresponds to a particular Redux action. Redux proved extremely handy for this system: it does an excellent job at separating state and the logic which modifies it, it’s familiar to our engineers who use it quite often, and it’s remarkably powerful (if you don’t know much about Redux, their landing page is extremely informative). A component initiating signin would, using Redux, dispatch the first startSignin action, kicking off the queue of tasks. Each task in-turn dispatches the appropriate action, and then moves onto the next task until there are no tasks left in the list.
One responsibility of auth is to register the device’s push token, a device identifier which allows us to send notifications. Push token registration as a component of the signin flow looks something like this:
After finishing a task, the code must dispatch executeNextAuthTask, notifying the system that it should continue with the next task and moving it along. All auth tasks must eventually dispatch this action.
This queue is pretty similar to crossing tasks/actions off a TODO list (signinTasks). Starting tasks is something that can happen anytime. You then go sequentially down the list, starting and finishing them off one-by-one. Each task requires a different initial action, may involve sub-tasks, and completes on a different cadence. Some tasks and actions can even happen asynchronously (dispatching executeNextAuthTask can happen at any time) while others are in progress. Defining tasks as async allows these tasks to proceed more-quickly without needing to wait, while a sync task could stop the application from showing anything before it was ready. In the same way you don’t need to stare at the washing machine while doing laundry, downloading an update shouldn’t block registering a push token.
While tasks could come in hundreds of different flavors, the infrastructure which keeps everything organized and separate is the queue/list.
Engineers had a single entry point for all signin related tasks and could trace functionality from here
🧐 An ideal system?
Does this solution meet our criteria? Does it fit the shape we defined?
Readability— Task definitions are structured to encourage detailed documentation that describes the goals of that task, all actions dispatched by it, and how those actions eventually execute executeNextAuthTask. This makes it very easy for others to learn about the responsibilities and inner workings of a task.
Testability—Task logic is isolated, so an engineer only needs to consider what that specific task does to be able to test it properly. Separate infrastructure tests ensure the system progresses through tasks as expected.
Modularity—Tasks are defined in a single list of tasks from which we can easily add and remove. Developers could conceivably add hundreds of tasks to our list and the act of editing, testing, or removing a task would remain just as simple.
This modularity paid off in one particularly tough outage — because the issue was isolated to a single task we could quickly remove it and roll out a temporary solution.
Since the initial implementation of this project in September 2019, very little has changed about the infrastructure. When glancing back over auth code, the primary difference our mobile team has noticed is how much its responsibilities have grown, and how well this new system has supported this growth. The number of tasks has almost doubled, and feature engineers have been able to contribute within this framework with minimal input from the mobile team.