22 min read
On this page

Engineering Excellence

Engineering Excellence

Why This Matters at the Director/VP Level

When you were an engineering manager, engineering excellence meant making sure your team wrote good code, reviewed PRs thoroughly, and kept test coverage high. At the Director/VP level, the game changes completely. You are no longer ensuring excellence on one team -- you are responsible for raising the bar across dozens of teams, hundreds of engineers, and potentially millions of lines of code.

The challenge is that you cannot be in every code review. You cannot attend every design meeting. You cannot personally evaluate every architectural decision. So the question becomes: how do you create the conditions, systems, and culture where engineering excellence happens consistently, even when you are not in the room?

That is what this chapter is about.


Setting Engineering Principles and Standards

Principles vs. Rules

There is an important distinction between principles and rules, and getting this wrong causes real damage.

Rules are specific and prescriptive: "All services must use PostgreSQL." "Every PR requires two approvals." "Test coverage must be above 80 percent."

Principles are directional and require judgment: "Choose boring technology unless there is a compelling reason not to." "Optimize for readability over cleverness." "Design for failure."

At the Director/VP level, you should lean heavily toward principles. Here is why: rules do not scale across diverse teams working on different problems. A rule that makes perfect sense for your backend platform team might be actively harmful for your data science team. Principles give teams a shared framework for making decisions while preserving the autonomy they need to do their best work.

That said, there are areas where you do need standards -- hard rules that apply everywhere. These tend to be around:

  • Security -- Authentication, authorization, secrets management, data handling
  • Reliability -- SLOs, incident response, on-call expectations
  • Compliance -- Regulatory requirements that are not optional
  • Interoperability -- API contracts, data formats, shared infrastructure

The art is knowing which category a given decision falls into. When in doubt, start with a principle and only graduate to a standard when you have seen enough evidence that unguided judgment is leading to problems.

How to Create Principles That Stick

I have seen dozens of engineering principles documents. Most of them are forgotten within weeks of being published. The ones that actually stick share a few characteristics:

  1. They are short. Five to seven principles, each explainable in a sentence or two. If your principles document is ten pages long, nobody will remember it.

  2. They reflect real tradeoffs. "Write good code" is not a principle because nobody advocates for the opposite. "Optimize for delivery speed over architectural purity" is a principle because it tells you what to prioritize when two good things conflict.

  3. They are developed collaboratively. Principles imposed from above get eye-rolls. Principles developed with input from senior engineers across the organization get buy-in. Run workshops. Solicit drafts. Debate in public.

  4. They are referenced in real decisions. The test of whether a principle is alive is whether people cite it in design reviews, architectural discussions, and retrospectives. If nobody ever says "this aligns with our principle of X," the principle is dead.

  5. They evolve. Review your principles annually. As your organization matures, your tradeoffs change. A startup principle of "move fast and break things" should evolve as the product gains customers who depend on its reliability.

Here is an example set that I have seen work well for a growth-stage company:

  • Own what you build. Teams are responsible for the full lifecycle of their services, from design through operation.
  • Choose boring technology. Default to proven, well-understood tools. Novel technology requires an explicit case for why the risk is worth it.
  • Design for failure. Assume every dependency will fail. Build systems that degrade gracefully.
  • Prefer simple over clever. The engineer who maintains this code in two years should be able to understand it quickly.
  • Make reversible decisions quickly. Reserve heavy process for decisions that are hard to undo.
  • Instrument everything. If you cannot observe it, you cannot operate it.

Best Practices Playbooks

Playbooks are the operational layer beneath your principles. They codify how your organization does common things -- not just what you believe, but how you execute.

What Deserves a Playbook

Not everything needs a playbook. Focus on activities that:

  • Happen frequently across many teams
  • Have a significant impact when done poorly
  • Benefit from consistency across the organization
  • Are hard for new engineers to figure out on their own

Good candidates for playbooks:

  • How to launch a new service -- Infrastructure provisioning, observability setup, runbook creation, security review, load testing
  • How to handle an incident -- Detection, escalation, communication, mitigation, retrospective
  • How to conduct a design review -- When a review is needed, what the document should cover, who reviews, how decisions are recorded
  • How to do a production deployment -- Feature flags, canary rollouts, rollback procedures, monitoring
  • How to deprecate a system -- Migration planning, communication, timeline, support during transition
  • How to onboard to a new team -- First week, first month, key contacts, codebase orientation

Writing Playbooks That People Actually Use

The key insight is that playbooks should be living documents maintained by the people who use them, not static documents written by leadership and imposed on teams.

Here is the pattern that works:

  1. Start with the best existing practice. Find the team that does this thing best. Document what they do.
  2. Circulate for feedback. Other teams will have improvements, edge cases, and objections. Incorporate them.
  3. Publish with clear ownership. Every playbook needs an owner -- a person or team responsible for keeping it current.
  4. Review quarterly. Set a calendar reminder. If a playbook has not been updated in six months, it is probably stale.
  5. Make them discoverable. A playbook nobody can find is useless. Centralize them in a well-known location (internal wiki, docs site, etc.).

One anti-pattern to watch for: playbooks that are so detailed they become rigid procedures. A good playbook provides guidance and captures lessons learned. It does not remove all judgment from the process. Leave room for teams to adapt the playbook to their specific context.


Tech Radar: Adopt, Trial, Assess, Hold

The Tech Radar is one of the most useful frameworks I have seen for managing technology choices at scale. Popularized by ThoughtWorks, it categorizes technologies into four rings:

  • Adopt -- Proven, recommended, use by default. These are your standard tools and technologies.
  • Trial -- Promising, approved for use in non-critical contexts. Teams can experiment with these.
  • Assess -- Interesting, worth investigating. Not yet approved for production use, but teams are encouraged to evaluate them.
  • Hold -- Do not start new projects with these. Existing usage is tolerated but should be migrated away from over time.

Why This Matters

Without a tech radar (or something like it), you end up with two failure modes:

Failure mode 1: Technology anarchy. Every team picks whatever they want. You end up with five different message queues, three different web frameworks, and two different container orchestration systems. Operational complexity goes through the roof. Knowledge sharing becomes impossible. Your platform team cannot support anything because there are too many things to support.

Failure mode 2: Technology dictatorship. A central architecture team mandates every technology choice. Teams lose autonomy. Innovation stalls. Engineers feel disempowered. The mandated choices are often wrong for specific use cases because one-size-fits-all does not work in a diverse engineering organization.

The tech radar threads the needle: it provides clear defaults while creating structured pathways for innovation and evolution.

Running a Tech Radar Process

Here is how I have seen this work well in practice:

  1. Form a tech radar committee. This should be a rotating group of senior engineers and architects from across the organization, not a fixed central team. Rotation prevents the committee from becoming a bottleneck or an ivory tower.

  2. Quarterly review cadence. Once a quarter, the committee reviews proposals to move technologies between rings. Any engineer can submit a proposal.

  3. Evidence-based decisions. Moving a technology from Assess to Trial requires a written evaluation. Moving from Trial to Adopt requires production experience and a recommendation from the team that trialed it. Moving to Hold requires a migration plan.

  4. Publish and communicate. The radar should be visible to the entire engineering organization. When changes are made, explain the reasoning.

  5. Connect to architecture reviews. When teams propose new architectures, the tech radar is the starting point for technology selection. Deviations from "Adopt" technologies require justification.

One thing to watch out for: the tech radar should not be a way to freeze technology choices forever. If your Adopt ring never changes, you are not evolving. The goal is managed evolution, not stasis.


Raising the Bar Across Teams

This is perhaps the hardest challenge at the Director/VP level: how do you consistently raise the quality bar across an entire organization?

Identify and Spread Best Practices

The best practices already exist somewhere in your organization. Your job is to find them, validate them, and help them spread.

Practical mechanisms:

  • Engineering quality reviews. Periodically review a representative sample of PRs, design docs, and incident retrospectives from different teams. Not to judge, but to identify patterns -- both positive and negative.
  • Cross-team pairing. When one team is excellent at something (testing, observability, documentation), create opportunities for engineers from other teams to pair with them.
  • Internal case studies. When a team does something particularly well, ask them to write up what they did and why. Share it broadly.
  • Blameless retrospectives on quality failures. When a major bug, outage, or security issue occurs, the retrospective should not just identify what went wrong -- it should examine what systemic conditions allowed the problem to occur and what changes would prevent similar issues across all teams.

The Role of Staff+ Engineers

Your Staff, Principal, and Distinguished Engineers are your primary lever for raising the bar. They should be:

  • Setting technical direction through architecture documents and design reviews
  • Mentoring senior engineers on their teams and across teams
  • Identifying systemic technical problems and proposing solutions
  • Modeling the engineering practices you want to see

If your Staff+ engineers are heads-down writing code on a single team and not influencing the broader organization, you are underutilizing them. Work with them to define their scope of impact and ensure they have time for cross-cutting work.

Using Metrics Wisely

Metrics can help you understand quality trends, but they can also be gamed and misused. Here is my guidance:

Useful metrics (when tracked as trends, not targets):

  • Deployment frequency and lead time
  • Change failure rate
  • Mean time to recovery from incidents
  • Time spent on unplanned work vs. planned work
  • Age and volume of critical bugs

Dangerous metrics (when used as targets):

  • Lines of code
  • Number of PRs
  • Test coverage percentage (as an absolute number rather than a trend)
  • Story points completed

The difference is that useful metrics measure outcomes (how fast, how reliably, how sustainably you are delivering value), while dangerous metrics measure activity (how much stuff you produced). Goodhart's Law applies: when a measure becomes a target, it ceases to be a good measure.


Architecture Review Boards

Architecture reviews are essential for maintaining technical coherence across a large engineering organization. But they can also become bottlenecks, ivory towers, or rubber stamp exercises if not designed carefully.

Designing Reviews That Add Value

The best architecture review processes I have seen share these characteristics:

  1. Clear criteria for when a review is needed. Not every change needs a review. Define triggers: new services, new external dependencies, changes affecting multiple teams, significant data model changes, security-sensitive changes. If everything requires a review, the process will collapse under its own weight.

  2. Lightweight format. A one-page design document that covers: the problem, proposed solution, alternatives considered, risks, and rollback plan. Not a 30-page specification.

  3. The right reviewers. Include people who will be affected by the decision, people with relevant expertise, and at least one person from outside the proposing team. Rotate reviewers to spread knowledge.

  4. Asynchronous by default, synchronous when needed. Most reviews can happen asynchronously in a shared document. Schedule a meeting only when there is genuine disagreement or when the decision is particularly consequential.

  5. Decision records. Document what was decided, why, and what alternatives were rejected. These records are gold for future engineers who wonder "why did we build it this way?"

  6. Time-boxed. Reviews should complete within a defined timeframe (e.g., one week for async, one meeting for sync). If a review drags on, it is usually a sign that the problem is not well-defined or that there are unresolved disagreements that need to be surfaced explicitly.

Common Architecture Review Anti-Patterns

  • The approval committee. Reviews become a gate where a small group of senior people approve or reject proposals, creating a bottleneck and disempowering teams. Reviews should be collaborative, not adversarial.
  • Review theater. Reviews happen but decisions are not actually influenced by the feedback. This wastes everyone's time and teaches people that the process is performative.
  • Scope creep. A review of a specific architectural decision turns into a debate about the team's entire technical strategy. Stay focused on the decision at hand.
  • Premature standardization. The review board mandates a solution before the problem space is well-understood. Sometimes the right answer is "try two approaches and see which works better."

Internal Tech Talks and Knowledge Sharing

Knowledge sharing is the circulatory system of engineering excellence. When knowledge flows freely, teams learn from each other's mistakes, best practices spread organically, and engineers grow faster.

Formats That Work

  • Weekly tech talks. 30-45 minutes, one presenter, open to the whole engineering org. Topics range from deep dives on specific systems to lessons from incidents to introductions of new technologies. Record them for people who cannot attend live.

  • Architecture show-and-tell. Monthly sessions where teams present significant architectural decisions they have made recently. This is particularly valuable for spreading awareness of how different parts of the system work.

  • Incident review presentations. When a significant incident occurs, present the retrospective findings to the broader org. Focus on what everyone can learn, not on who made a mistake.

  • Reading groups. Small groups that read and discuss technical papers, books, or blog posts together. These tend to be self-organizing and low-overhead.

  • Internal blog or engineering newsletter. A written channel for sharing technical insights, project updates, and lessons learned. This scales better than synchronous presentations for large organizations.

  • Guilds or communities of practice. Cross-team groups organized around a shared interest (e.g., frontend engineering, data engineering, security). They meet regularly to share knowledge, align on practices, and identify common problems.

Making Knowledge Sharing Sustainable

The biggest challenge with knowledge sharing programs is sustaining them over time. Here is what I have seen work:

  • Make it easy to present. Reduce the barrier. A 15-minute talk with no slides is fine. Not everything needs to be a polished conference presentation.
  • Rotate the organizational responsibility. Each team takes a turn organizing the weekly tech talk. This distributes the work and ensures diverse topics.
  • Celebrate sharing. Recognize people who contribute to knowledge sharing in performance reviews and public forums. Make it clear that teaching others is valued.
  • Protect the time. If knowledge sharing sessions are constantly cancelled or deprioritized, the message is clear: it does not matter. Put it on the calendar and defend it.
  • Connect it to real problems. The best tech talks come from real work: "Here is a problem we faced, here is how we solved it, here is what we learned." Abstract theoretical talks have their place, but practical war stories resonate more.

Balancing Autonomy and Consistency

This is the central tension of engineering excellence at scale, and there is no perfect answer. Let me share a framework for thinking about it.

The Spectrum

Imagine a spectrum from full autonomy ("every team decides everything independently") to full consistency ("every team does everything the same way").

Neither extreme works. Full autonomy leads to fragmentation, duplicated effort, and operational chaos. Full consistency leads to rigidity, slow decision-making, and teams that cannot optimize for their specific context.

Where to Be Consistent

Consistency matters most where:

  • The cost of inconsistency is high. Security practices, incident response, data privacy -- these need to be consistent because a weak link anywhere puts the whole organization at risk.
  • Teams need to interoperate. API standards, data formats, deployment pipelines -- consistency here reduces friction at team boundaries.
  • Knowledge needs to be portable. If every team uses a different web framework, an engineer moving between teams has to start from scratch. Common technology choices make internal mobility easier.

Where to Allow Autonomy

Autonomy matters most where:

  • Teams face different problems. A team building real-time data pipelines has different technical needs than a team building a CRUD web application. Forcing them to use the same tools is counterproductive.
  • Innovation is needed. If you mandate every technology choice, you will never discover better approaches. Teams need space to experiment.
  • Ownership drives quality. Teams that choose their own tools and approaches feel more ownership over the results. This drives higher quality and faster problem-solving.

The Practical Approach

Here is how I think about it in practice:

  1. Infrastructure and platform: high consistency. Everyone uses the same deployment pipeline, monitoring stack, and CI/CD system. The platform team provides this as a service.
  2. Security and compliance: mandatory consistency. Non-negotiable standards with automated enforcement where possible.
  3. Application-level technology choices: guided autonomy. The tech radar provides defaults and boundaries. Teams can deviate with justification.
  4. Team-level practices: high autonomy. How a team runs sprints, how they do code reviews, how they organize their codebase -- these are team decisions. Offer guidance through playbooks, but do not mandate.

Real-World Examples

Example 1: The Tech Radar That Prevented a Crisis

A VP of Engineering at a fintech company introduced a tech radar process after discovering that teams had adopted four different databases, three different message queues, and two different container orchestration platforms. The operational burden on their small infrastructure team was unsustainable.

Through the tech radar process, they consolidated to two databases (one relational, one document store), one message queue, and one container orchestration platform. Technologies in the "Hold" ring were given 18-month migration timelines.

The process was not painless -- some teams felt their autonomy was being restricted. But the VP was transparent about the tradeoff: "We are a 150-person engineering team that cannot afford to operate like we are 1,500 people." Within a year, operational incidents decreased by 30 percent, and the infrastructure team went from firefighting to actually building new capabilities.

Example 2: Architecture Reviews That Engineers Actually Valued

A Director of Engineering inherited an architecture review process that engineers universally dreaded. Reviews took weeks, feedback was often contradictory, and the review board was seen as an adversarial gate rather than a collaborative resource.

She redesigned the process with three changes:

  1. Reduced scope. Only decisions meeting specific criteria (new services, new external dependencies, cross-team impact) required review. This cut the number of reviews by 60 percent.
  2. Changed the format. Instead of a formal presentation to a panel, the proposing team shared a one-page document asynchronously. Reviewers had three days to comment. A synchronous meeting was scheduled only if there were unresolved disagreements.
  3. Reframed the purpose. She explicitly stated that reviews were about improving proposals, not approving or rejecting them. The review board's job was to ask good questions and share relevant experience, not to make the decision.

Within two quarters, satisfaction with the architecture review process went from 2.3/5 to 4.1/5 in the engineering survey. More importantly, teams started requesting reviews voluntarily for decisions that did not meet the mandatory criteria, because they found the feedback genuinely helpful.

Example 3: Spreading Excellence Through Guilds

A VP noticed that frontend engineering quality varied wildly across teams. Some teams had excellent testing practices, accessible UIs, and fast page loads. Others had brittle frontends with poor user experience.

Rather than mandating standards from the top, she sponsored a Frontend Guild -- a cross-team group of frontend engineers who met biweekly. The guild:

  • Developed shared component libraries that any team could use
  • Created a frontend playbook with testing strategies, accessibility guidelines, and performance budgets
  • Ran monthly "frontend clinic" sessions where teams could bring their thorny UI problems for collaborative problem-solving
  • Identified and championed new tools (which then went through the tech radar process)

Over the course of a year, the quality gap between the best and worst frontend teams narrowed significantly. And because the standards came from the engineers themselves rather than from management, adoption was high and resistance was low.


Common Mistakes

  1. Mandating without explaining. "Use technology X" without "because of reasons Y and Z" breeds resentment and workarounds. Always explain the reasoning behind standards.

  2. Setting standards you do not enforce. A standard that is widely ignored is worse than no standard at all. It teaches people that standards are optional. If you cannot or will not enforce it, do not call it a standard.

  3. Over-engineering the process. If your architecture review process has more steps than a NASA launch sequence, something has gone wrong. Start simple and add complexity only where you have evidence it is needed.

  4. Ignoring the gap between stated and actual practices. Your engineering principles say "test-driven development." Your actual codebase has 15 percent test coverage. You have a credibility problem. Either change the principle or invest in closing the gap.

  5. Confusing consistency with excellence. Being consistently mediocre is not the goal. Consistency is a means to the end of making excellence sustainable and scalable.

  6. Neglecting the social infrastructure. Tech talks, guilds, and knowledge sharing might feel like "soft" programs, but they are the mechanism by which excellence spreads. Cutting them to "focus on delivery" is short-sighted.

  7. Not involving Staff+ engineers. If your most senior technical people are not driving engineering excellence, who is? Make it an explicit part of their role.

  8. Setting and forgetting. Principles, playbooks, and tech radars need regular review and updating. The technology landscape changes. Your organization's needs change. Your standards should evolve accordingly.

  9. Treating engineering excellence as a separate initiative. It is not a program or a project. It is how you operate. It should be woven into your hiring, your promotions, your planning, and your daily work.


Business Value

Engineering excellence is not about technical aesthetics. It has direct, measurable business impact.

Speed of delivery. Organizations with strong engineering practices ship faster because they spend less time on rework, debugging, and firefighting. When your deployment pipeline is solid, your testing practices are mature, and your architecture is well-understood, you can go from idea to production in days instead of weeks.

Reliability and customer trust. Customers do not care about your architecture, but they care deeply about whether your product works. Engineering excellence -- proper testing, observability, incident response, and operational practices -- is what keeps your product reliable. And reliability is trust. Every outage erodes it. Every smooth day builds it.

Talent attraction and retention. Great engineers want to work with other great engineers on well-run codebases using modern practices. If your engineering organization has a reputation for excellence, recruiting becomes dramatically easier and cheaper. If it has a reputation for chaos, no amount of salary will attract the best people.

Sustainable pace. Organizations without engineering discipline accumulate technical debt that eventually slows them to a crawl. What started as "moving fast" becomes "moving fast toward a wall." Engineering excellence is what allows you to maintain velocity over years, not just months.

Reduced operational costs. When your systems are well-designed, well-tested, and well-monitored, you spend less on infrastructure (because you are not over-provisioning to compensate for uncertainty), less on incidents (because there are fewer of them and they are resolved faster), and less on manual toil (because you have automated the repetitive work).

Innovation capacity. Teams that are drowning in tech debt, operational burden, and quality issues have no capacity for innovation. Engineering excellence frees up the time and mental energy that teams need to explore new ideas, experiment with new approaches, and build the next generation of your product.

The compounding effect is significant. A 10 percent improvement in deployment frequency, combined with a 10 percent reduction in change failure rate, combined with a 10 percent improvement in engineer productivity, does not yield a 30 percent improvement -- it yields something much larger because these factors multiply. Over two to three years, the difference between an organization that invests in engineering excellence and one that does not becomes enormous.


Common Pitfalls

  • Mandating standards without explaining the reasoning. Telling teams to "use technology X" without explaining why breeds resentment, workarounds, and loss of trust in leadership decisions.
  • Setting standards you do not enforce. A standard that is widely ignored is worse than no standard at all because it teaches engineers that standards are optional.
  • Confusing consistency with excellence. Being consistently mediocre is not the goal. Consistency is a means to make excellence sustainable and scalable, not an end in itself.
  • Neglecting knowledge sharing infrastructure. Cutting tech talks, guilds, and cross-team pairing to "focus on delivery" is short-sighted because these are the mechanisms through which best practices spread organically.
  • Not involving Staff+ engineers in cross-cutting work. If your most senior technical people are heads-down coding on a single team, you are underutilizing your primary lever for raising the bar across the organization.
  • Setting principles and forgetting them. Engineering principles, playbooks, and tech radars need regular review as the technology landscape and organizational needs evolve. Stale guidance erodes credibility.

Key Takeaways

  • Lead with principles, not rules. Principles scale better and preserve the autonomy teams need.
  • Use the tech radar framework to balance consistency with innovation in technology choices.
  • Playbooks capture and spread best practices. Keep them living documents owned by the people who use them.
  • Architecture reviews should be collaborative and lightweight, not adversarial gates.
  • Knowledge sharing is the circulatory system of engineering excellence. Invest in it and protect it.
  • Staff+ engineers are your primary lever for raising the bar. Make cross-cutting impact an explicit part of their role.
  • Balance autonomy and consistency based on the cost of inconsistency and the value of local optimization.
  • Engineering excellence compounds over time. The sooner you invest, the larger the returns.