Chamber 🏰 of Tech Secrets #1
Exploring and mitigating the risks of tribal knowledge on building and running software
The chamber of tech secrets is officially open! In this inaugural edition, we’ll explore Secret #1 which is about Tribal Knowledge and how it shows up in both build and buy scenarios. I have some advice to offer for both enterprises and product builders.
I’d be remiss if I jumped into a topic without acknowledging that it has been a crazy weekend in technology and finance with the collapse of Silicon Valley Bank. I know this is causing tons of stress on lots of people. My best wishes to all those who are stressed over retrieving cash they were depending on for their business or those trying to figure out how to make payroll next week.
Fighting Tribal Knowledge
Our Chick-fil-A EA Principles: Build vs Buy post triggered some great questions. One that surprised me was a question about tribal knowledge, and if it has a higher likelihood to emerge and cause problems in build situations. This matter because every organization wants to be able to 1) operate their important business systems 2) continue to build in the future. They do not want human points-of-failure to prohibit that.
Tribal knowledge can exist in both build and buy scenarios. Here’s why:
Any time we build or buy, we assume responsibility for several buckets of knowledge…
Product knowledge: what the product is supposed to do, why it was created, and where it is headed in the future.
Architecture knowledge: how the discreet parts of the solution and all of its supporting pieces fit together. This could include cloud infrastructure, data and CI/CD pipelines, microservices orchestration, messaging services, databases, metrics / log collection and analysis, etc.
Implementation knowledge: what the code is supposed to do and how it works. I’ll put project structure and programming / testing approaches in this bucket, too.
Operational knowledge: what issues can / have occurred and how we respond to them.
Enterprise Architecture knowledge: how the application / product fits into the larger enterprise.
When we build, these are fairly obvious. When we buy a solution, we still need to have a product owner who performs accounts for much of this knowledge. We still have to account for product knowledge, operational knowledge, and enterprise architecture knowledge. If there are custom integrations in play, application knowledge and implementation knowledge are likely back on the table too.
In my experience, build scenarios have larger teams which potentially reduces turnover risk. Buy scenarios often have a single person who manages them completely (helps scale, yay! increases risk, boo!).
In both build and buy cases, we are exposed to potential tribal knowledge risks stemming from:
Solution Complexity—the solution may be difficult to understand and therefore support at any of the aforementioned levels of knowledge.
Context Deficit—the lack of knowledge about how things were supposed to work, why things were built this way, or why we even have this product in the first place.
Hero Culture—things look better than they are because a hero keeps swooping in to save the day whenever things go wrong, and they become a part of the actual architecture.
Employee Turnover—knowledge egress, context egress, or hero egress
Thoughts for Builders and Buyers
To solve some of the challenges associated with Tribal Knowledge, I’d suggest a few practices:
MVD: That’s minimum viable documentation. Document the non-obvious and why its important, and do so in the right places (in code, git issues, wiki, ADRs, etc.). But don’t document too much. Documentation comes with a cost of upkeep and maintenance. Just like every line of code is a liability, so is every sentence. Just enough is enough. Have you ever searched and found internal documentation that was out of date and therefore both useless and confusing? 🙋♂️
Use the Code: Well-written code can serve as a minimalist sort of documentation. Include minimal commenting, primarily documenting the “why” when less intuitive things are happening. Tests are critical too. I can’t count the number of times I have figured out how something worked quickly by running test cases and adding debug break points along the way. Infrastructure as code applies here as well.
Cross-train: Kill the “hero culture” by making sure that there are always several people that are capable of doing whats important. Those who CAN DO should teach. Move from hero to force multiplier. Don’t just solve the hard problems and be celebrated; teach others how you thought. Think about how you built your foundation and send others on the journey.
Standardize: Baking best practices into platforms and having some baseline standards means that things get done in consistent ways and are therefore easier to understand by others in the future. No need to learn a new CI/CD product or branching strategy to jump in and help out when all your teams do it more or less the same. The more the “right thing” can be the “easy thing” by making concerns into platforms, the better. It also means less standards documentation (see above) to maintain.
Build a culture of sharing: Intentionally invest in the people around you through learning gatherings, coaching, and mentoring. At Chick-fil-A, we used to have an Architecture Forum where people could bring problems and brainstorm solutions together. Sadly, we outgrew that model. We are about to restart a monthly learning environment inspired by my friend Bryan Finster’s DevOps Days program at Walmart.
These are fairly obvious points, so why doesn’t everyone do them?
Scarcity of resources. These things are intuitive, but require execution. People are time-constrained and generally respond to incentives, so make sure that you have a culture that rewards the behaviors you want to see. When was the last time someone was publicly recognized for writing an extremely concise and useful document or highlighted for writing a masterful comment in their code? When was the last time someone was publicly honored for stepping in and restoring a critical production system in the middle of a holiday weekend while taking a short break from snorkeling with sea turtles in the Galapagos Islands while they were on vacation with their family? Be careful what you celebrate: if you reward tribal knowledge and hero culture, that’s what you’ll get.
Secrets from the Edge
I enjoyed reading this, and it got me thinking more clearly about a lot of additional ideas I have about the edge and how it could manifest in the future. I’ll save my thoughts for a future “Secret”.
“Edge Native” is a new term to me. We considered most of these principles in designing Chick-fil-A’s Edge solution (trying to be as similar to cloud as possible while acknowledging our numerous constraints). It is nice to see a more formal comparison between cloud and edge, acknowledging where it makes sense to build similarly and differently.
I had some fun discussions this week about the future of Chick-fil-A’s Edge infrastructure and how we’ll accommodate some of the machine learning workloads that appear to be in our near future. There are some interesting challenges to account for and I look forward to unpacking them later down the road when we have a field-deployed solution.
If you enjoyed this and found it helpful, make sure to subscribe for future updates. I’d also be honored if you shared on Twitter or LinkedIn (or both), especially if you have any disagreements. I’m here to learn. Have a great week!