Idea or Execution: What is the Key to Success?

There is a John Doerr quote you may have heard, “Ideas are easy. Execution is everything”. To put it in a slightly different way, mere ideas are cheap. While finding good or truly exceptional ideas is uncommon these days, their value remains somewhat limited unless they are combined with the correct blend of execution and strategy, it takes a team to win. 

Execution done well is expensive, and like many things worth pursuing, it is difficult, and it requires persistence, grit, teamwork, flexibility, and many other attributes done well to achieve a common goal. However, prior to the execution phase, there needs to be a great idea generated, it sounds simple enough, but like a great idea and flawless execution, neither one is fast and easy. 

Taking a step back and revisiting earlier points, the distinction between good, or even exceptional, ideas and mere ideas is crucial. Revolutionary ideas that reshape the world result from a confluence of timing, intellect, expertise, experience, and an individual’s unique perspective, unparalleled by others. Reflecting on personalities like Elon Musk, Jeff Bezos, Steve Jobs, Bill Gates, and Henry Ford, these names exemplify such visionary thinkers, listed in no specific order, simply emerging in my thoughts. 

There are countless products and services, that once they are launched and offered in the market, we ask ourselves, why didn’t I think of that? These products and services appear to be straightforward after a company and its founders and teams have put in the long hours to bring the ideations to life, and to offer you something you truly need, and that simplifies, eases, or improves your life. 

Let’s go back to the first sentence, and revisit these questions: 

Are the ideas easy? 

Is execution everything? 

For me, the challenge often lies in conceiving a genuinely impactful idea—something that genuinely addresses a need or introduces a solution one might not have realized they lacked until its implementation. Human perspectives are as distinct as fingerprints, with diverse outlooks, opinions, and beliefs guiding our evaluations. Each individual processes the world in a unique manner, leading to distinctive approaches to identifying problems and conceiving truly exceptional solutions. The ease of generating ideas, I believe, resides in the eye of the beholder who spots an opportunity, connects the dots, and sparks a brilliant concept.

I think most ideas create a lot of activity, but not the huge impact or outcome a company is looking for. As mentioned above having an idea by itself is cheap, it’s all in how you can execute or implement the idea, which includes challenges around product, marketing, sales, financing, and engineering for example.   

Instagram was not the first photo-sharing app, Facebook was not the first social network, and Amazon was not the first company to sell books online, but with a great idea, they moved on from ideation, and onto the execution and implementation phase which was crucial to get right.  As I said above, the implementation or execution phase takes a lot of effort and the right blend of strategy, and timing among other aspects to get right to make any new idea a success. 

Undoubtedly, execution is pivotal, from my own perspective, it isn’t the sole determinant. In fact, I view a juncture where the power of the idea and the efficacy of execution intersect. From that juncture onward, seamless execution is imperative to realize the idea’s potential. However, it would be an oversight to claim that execution holds greater complexity or significance than conceiving a brilliant idea. Rather, it’s a delicate equilibrium. The specific demands of each idea and its execution may vary, but both elements remain inextricably linked on the path to achievement.

In summation, the landscape of technology and business leadership is marked by the interplay between ideas and execution. John Doerr’s timeless wisdom underscores that while ideas might flow effortlessly, their true worth is unlocked when they are skillfully brought to life through execution and strategic insight. This harmonious synergy, orchestrated by determined teams, propels endeavors toward success.

The journey begins with the inception of remarkable ideas, driven by a confluence of timing, intellect, experience, and unique perspectives. Visionaries like Elon Musk, Jeff Bezos, Steve Jobs, Bill Gates, and Henry Ford exemplify the capacity to reshape the world by combining these elements in extraordinary ways.

However, true transformation happens when these ideas transition from the realm of thought to the realm of action. Execution is the crucible where ideas are refined, tested, and transformed into tangible products and services. Instagram, Facebook, Amazon, and countless others stand as testaments to the pivotal role of execution in shaping industries and societies.

While execution commands its due significance, it isn’t the sole protagonist of this narrative. The convergence of a brilliant idea and effective execution sets the stage for success. The notion that execution trumps idea generation fails to recognize the inherent balance between the two. Every idea holds its unique demands, as does its execution, and the ability to navigate this intricate dance dictates the course of accomplishment.

In the grand theater of innovation, ideas provide the script, and execution brings it to life. The world’s most transformative accomplishments emerge when these elements coalesce seamlessly. Thus, as technology and business leaders, our pursuit should be twofold: to conceive ideas that transcend the ordinary and to master the orchestration of execution, guided by the understanding that these elements are not competing forces, but rather essential partners on the path to realizing monumental achievements.

Rethinking DevOps and Engineering Teams?

Knowing where you are headed, and where you have been

Today you may have pilots/POCs (proof of concept) that have skipped critical phases such as MVP (minimum viable product), and MMP (minimum marketable product), in other words, you have pilot/POCs that went straight into a live production environment.   

Perhaps you have other scenarios where the pilots/POCs made it to the MVP stage, but the cross-functional teams necessary to support were not included or kept, whether this was done due to time or budget constraints isn’t always completely clear. At the MVP stage, you should understand the ultimate objective and the problems you are trying to solve for customers. Additional team members will likely need to be brought in to represent the various areas such as security and the Ops/Infrastructure service. The security and service functions help to define the non-functional elements which support this live production environment, beyond simply the Pilots/POC environments. The security functions are particularly important to ensure that you are aligned with specific data privacy/compliance requirements, in addition to following and implementing various best practices around how you will store and transmit the customer or other protected-sensitive data. 

You may also have prior scenarios where you have taken your pilots/POCs to the MMP stage. Customers who were actively using a particular product or service, but as with the MVP stage, the security and Ops/Infrastructure service teams were not included, or the staffing wasn’t adequate to support running/growing a live environment beyond the POC.  

Team Approach and Alignment 

The above section is important to cover first, as I think it helps to communicate where you may want to go next with teams which also includes DevOps-Platform, engineering, and security resources.   

As you continue ramping and perhaps even preparing for additional Pilots/POCs with your current fleet of Products and Services, you need to plan how your teams will do the work, and specifically what types of work they will be doing.  I will cover this in greater detail below, but essentially, you need to be able to break down these larger projects/problems into micro tasks that fit within an agile team. There are too many different domains, and subdomains that exist today within any given Product/Service – infrastructure and systems the team supports today.  

  • Teams/Squads with experts for each of the POCs 
  • Active Stakeholders which include Product Owners or the equivalent  
  • The team will have general software developers/engineers, and platform (IT Operations/Security) engineering resources partnered together to accomplish various work items, tasks, etc as deliverables in each sprint. Sprint reviews help the teams, which include the various stakeholders, leadership/management, and other stakeholders with visibility, accountability, and communication.    
  • Documented, evolving but clear service workflows, escalations, and interactions with the other teams. 
  • Current and ongoing documentation    

Near-Term Decisions

  • Backfilling/hiring for your principal/chief engineer/arcitechture role(s)
  • Consider moving to one centralized platform to track your work, store and update documentation – maybe its Notion, Basecamp, JIRA, etc.    
  • Moving to one ticketing system for Development, Security, and Service-related issues – maybe JIRA Service Management or something similar.  
  • Communication platforms (Slack, Teams, etc) 
  • RD time – Friday afternoon or another time when team members can work on CIs, other low-hanging fruit work items in the backlog 
  • Do we want teams to use Tribes, Squads, Teams, and Departments? The goal is ultimately to have smaller teams following agile.  
  •  Platform Engineering/DevOps Teams and Topologies 

Long-Term Issues and Decisions

  • Solutions to help teams manage the unplanned work coming in 
  • Leveraging some RD time, enablements  
  • Reducing the amount of context switching issues between projects, tasks, products, customers 
  • Reducing Longer lead times
  • Teams managing/handling platform/system support/break-fix, need a better way to identify, log and track, prioritize with the best and available resources. 
  • Teams/Resources managing/handling implementation/migrations, onboarding, and ongoing support, same as above, need a better way to identify, log and track, prioritize with the best and available resources. 

Other Potential Issues 

  • The engineering teams can become too large 
  • Systems can become monolithic 
  • Work can become blocked across software engineering and service/operations resources 
  • Teams/members become too specialized which can create dependencies 
  • Software/platform/systems can become too large and complicated
  • Documentation becomes nonexistent or not kept up to date 
  • Original/Key team members leaving an organization taking critical knowledge and specialized skills and domain knowledge 
  • DevOps resources may face a large number of requests from the various engineering squads/internal customers.  As those product squads – software engineers all advanced and leveled up, the requests coming in can became more domain and subdomain specific.   

Proposed Team Structure – Includes Software, Platform, and Engineering Resources

  • Using more of a micro team/squad approach, project/work items can be organized into two-week agile sprints and required team members – resources can be bundled/grouped to accomplish a sprint goal. Having available DevOps FTEs, for example, could be assigned along with another engineering resource.
  • The smaller/micro teams help with frequent, focused communication. 
  • Instead of a centralized DevOps person/resource, having more of an embedded platform engineering/ops embedded into dev teams/squads. It will take some time to find the flow that works, but this really will depend on what works for the various teams.  
  • More focus on the domain and subdomain skills and knowledge required. 
  • For each of the potential solutions products and services defined platform required, and the services needed to run, scale, and secure the solution.     
  • Responsibilities for building, deploying, supporting, and retiring a specific service within a business. More business and customer focused. 
  • The teams below will exist in some part to reduce the number of disruptions, distractions, and overload that’s placed on the stream-aligned team.  
  • External/temporary help-support/services to help the teams quickly upskill or implement other best practices to support the stream-aligned team. 
  • Provides specialized services such as machine learning, and analytics/reporting components.  
  • Working with the team leads, and principals to discuss a new team topology and approach.   
  • Based on the adoption and effectiveness of this new approach, we would then consider whether to roll out a similar team structure for other teams within an organization.

Not seeing the Forest for the Trees

Our daughter had been working hard on her latest school art project, it got me thinking about an old saying or idiom, that I am sure many of you have heard. If not, the Forest through the trees saying goes like this “cannot see the forest for the trees” and it means a person or perhaps a team has trouble seeing the big picture or ideas because they remain too focused on the specific details.

This is a pretty short post, but one reason for sharing this, is to stress the importance of keeping the right perspective and focus on what matters and why for your team and organizaiton. One quick example that comes to mindfor me, is when organizations have production or other critical deliverables due to their customers with key milestones. However, teams are off spending too much deciding whether to use Kubernetes vs. Docker for Containerization. It’s that crticial we understand and see the big picture, not getting stuck in the minutiae as those small details may change in order to reach our longer terms goals and objectives.

Visualizing and Reporting on Amazon Cloud Costs & Usage

From the very beginning of cloud, I have to admit this has been a constant struggle. Over the past several years, Amazon has offered us some great tools including cost explorer to help under costs, use, and potential opportunities to save, and reduce operating costs.

Fast forward to the end of 2022, the new AWS Cloud Intelligence, CUDOS, KPI, and Trends Dashboards are absolutely fantastic, see the link below. To take the reports a setup further, you must have your tagging setup and configured, as that can be used to further filter the costs, the results are some really powerful business and operating insights. You will be able to look at past usage and forecast for the future. The data refreshes are within 24 hours, not real-time, but pretty close.

https://aws.amazon.com/blogs/mt/visualize-and-gain-insights-into-your-aws-cost-and-usage-with-cloud-intelligence-dashboards-using-amazon-quicksight/

What I have learned over the last 40 years…

  • My hope is that you will read these and either relate to them or learn from and apply them in the future. They aren’t listed in any particular order.
  • 1) Trust and follow your own hunches and personal intuition, they are normally right
  • 2) You can’t please or say yes to everyone, so stop trying
  • 3) Remove and or distance yourself from situations or people who don’t add positive value to you, your life, family, career, etc
  • 4) Don’t measure your success by comparing yourself to others. We are all at different places in our lives
  • 5) Each of us has a special talent or ability that’s unique to us
  • 6) Always take the time to pay it forward to someone at some point someone took time for you
  • 7) Having grit can be the difference between your success or failure
  • 8) One of your greatest skills to develop and master is ownership and responsibility
  • 9) Some of the most rewarding situations and experiences will require some level of risk and personal sacrifice
  • 10) Time, not money is the most valuable thing
  • 11) Whether people acknowledge it or not, there is a little truth in every joke
  • 12) Whether it’s said or not, people do things because they have something personally to gain from it
  • 13) When faced with a challenge or other dilemma, start by asking why
  • 14) When faced with an ethical dilemma, always try and choose a solution or an alternative that offers the greater good or the lesser evil
  • 15) Everyone is busy with life, always find time for your family, especially your spouse and kids
  • 16) If you made a wrong along the way, make a right
  • 17) As an individual person, be real, be consistent
  • 18) Start with the end state in mind, before you start something new
  • 19) Be self-taught, learn how to be a continuous learner
  • 20) It may not seem fair, but no matter what you do or how hard you try, there will always be people who will simply not like you
  • 21) Listen to good music
  • 22) Whether hard copies or digital, read books, new releases, and the classics
  • 23) Continous, incremental improvement say 5% daily may seem like going nowhere fast until you realize you spent 5% x 30 days x 12 months on improvement, far greater than 0
  • 24) Show up early and always be on time
  • 25) You will not always be the smartest person in a room, situation, or conversation

What is your AWS tagging strategy?

If this concept is new, no worries, just head over and read this.

I would also recommend that you spend some time thinking about how you want to approach this before just jumping in. It’s not impossible to change course once you start down a path, but it is much easier to take some time and think through how best to go forward. If you really aren’t sure, try and start with technical and business tags for example. A technical tag for example could be the environment, or an application ID. A business tag could be a project name, a product or type type, the actual names will be entered as values.

Some of the benefits or more practical reasons for using tags include resource organization, cost allocation (P&L, projects customer, business unit/Team), automation and access control (IAM conditions), security or risk identification.

As your environment(s) mature and grow, and become more complex it will become extremely helpful to have a solid, well thought out tagging plan.

Remember, it doesn’t need to be super complicated or complex from the start, for me this is one of the situation where something is better than nothing, and the sooner the better!

Both Dev and DevOps teams often struggle with tagging consistency, both in the frequency and the tagging content itself. The eventual state to help address this, may be using automation. A great option may be to consider using yor.

Yor is another open-source tool to automatically adds tags to infrastructure configurations. Yor currently supports Terraform, AWS CloudFormation, and server less.

By default, Yor will add a number of tags to each resource block, including the name of the git organization, the repository, and the file that contains the template that created the resource. Another really powerful feature from Yor is that it adds a unique identifier which allows for quick search in Github to locate the code for example.

Your Best Engineers Can Also Be Great Business People

It can take some people longer than others to arrive at this conclusion, which is simply that a team’s approach matters more than the technical details themselves. If you don’t agree with that statement, that’s ok, but perhaps after reading this article you will feel differently.

Over the last 20+ years of my career, which includes Technology, Leadership, Security, and Business, I have done and still do on occasion some hands-on technical work. I have observed and over the years disagree with the idea or view that Technical leaders or heads of engineering know more or know better compared to others outside of the respective Technology or Engineering fields. Those others, mostly outside of Technology or Engineering, have brought us a slightly different mindset or angle which I think helps us leaders develop teams from good to great.

To go a little deeper, when I say good to great, what I mean is those individuals who are simply asking me, our team, or others, smarter questions, seeing what’s ahead not just what’s in front of, prioritizing, and often separate themselves more figuratively from others around them. As leaders, this is how we spot potential talent or the beginnings of a great leader. This timing can be critical, as this is when a leader needs to make themselves available to guide and support this person, so they develop and not burn out or give up and become frustrated or feel abandoned.

Let’s dive a bit deeper into the concepts I mentioned above when I said asking smarter questions, evaluating not only what’s in front of you, but what’s further down the road that we should prepare for, avoid or delay until we get them.

Business focused Engineers/Technical people are thinking about the following:

  • Focused, aligned, and prioritized on the work that will pay off sooner than later. With any type of work, there is a cost for doing that work, the sooner we see a return or reward the better.
  • Before jumping into a new project, upgrade or migration, take the time to estimate or calculate whether the effort is worth an individual or team’s time.
  • No organization, team, or individual has an infinite budget, we often need to look at work or projects in terms of opportunity costs. In other words, the same person or resources cant be used for multiple deliverables at the exact same time, to DO one task will always mean we are NOT doing another task. This is perfectly fine, but this is where it’s critical that the task we decide to do is aligned with business outcomes and has the expected financial return or reward.

Looking at the first bullet item, business-minded engineers and Tech staff should be constantly asking when will this work pay off, and when and where is the return.

Engineering and technology work has a time value, projects that are done sooner are worth much more sooner than later. So we try and avoid the work items that pay off too far into the future.

Engineering and Tech projects such as upgrades and migrations for example carry a huge burden of guaranteed upfront costs, and honestly, the rewards of these efforts are usually unknown or unclear and often a way out into the future. The other reality is that the returns or rewards are often longer than a business or its leaders may want or even realize. In fact, sometimes a typical one-year upgrade or migration may not provide business owners or stakeholders with any reward or return until the second year or longer! This may seem obvious, but another consideration is the return or reward itself from the work done with the upfront investment must exceed the costs for the initial work. No business wants to spend one year on some work or project, only to save one year, the return or reward needs to be more compelling than that!

Looking at the next bullet item, the business-minded engineer or Technical leader evaluates whether the work, project, or opportunity is worth the time.

This is often difficult, but where time and attention deciding what work should be done, and what provides the most value to an organization. Every project or potential work has value to someone else, which is why there is an almost nonstop, continuous inflow of requests for some output or deliverable.

As another rule, leaders and staff should be continuously asking themselves whether a project is worth the time required to complete it. There will always be exceptions or special situations such as Info security or end-of-life/support scenarios, I am referring to everything else outside of that.

Let me share a quote from Warren Buffett that says, “A good management record is far more a function of what business boat you get into than it is how effectively you row. There is no extra credit for the degree of difficulty, lower your degree of difficulty.”

The same thing applies to engineering and technology. Our teams working on the right project work is more important than the Minutia such as the tech stack we select or the lines of code we write.

We need to be able to decide when it makes sense for us to be builders, vs when we should simply purchase something off the shelf that is ready to go or that requires some lite integrations with existing business processes. If off-the-shelf doesn’t fit, and the amount of customization needed outweighs building, then maybe building makes sense. This is often another area, where a business may take the time and evaluate the business requirement or processes to decide whether those can change and be more flexible vs building a system or solution around them.

Switching gears a little, more into infrastructure, whether we decide to host apps/systems in the cloud or keep them on-premises, we need hard data to look at, and we need to calculate the costs and benefits before deciding to go either way. Some questions we may want to ask are below, there are probably many more:

  • If we purchased a solution off the shelf, can the team we have today onboard, integrate, and maintain it?
  • Here it is important we accurately estimate the costs of any building project to help ensure the expected return is greater than the building effort and time.

The last bullet or concept is whether the project or work will move the organization forward.

This is related of course to the other concepts discussed above, but it’s important the teams and resources are spent on those things which move the business forward and are aligned with other business priorities and objectives. As already mentioned, we all have a continuous flow of requests, but often we don’t have the proper justification or the proper financials to show us the expected return on the work done and from what upfront cost.

With many engineering and technology projects, we often have or observe a level of technical debt that needs to be considered in terms of our opportunity cost. In other words, we will need to decide whether we are willing or able to give up having or doing something else. We may find cleaning up one system, app, or database means we simply will not be able to clean up another. This is not unique or special by any means, all businesses and tech teams are faced with these challenges at some point, but it does become important to focus on the specific cleanup efforts, and even more critical, which one helps move the business forward by providing the most value or the greatest impact. Updating or cleaning up a critical business system that most an organization utilizes is a specific Tech debt to focus resources and time on, vs a smaller internal app or system that is rarely used and or is used by a smaller internal group.

As a rule, teams should always consider the opportunity cost of their project work. Remember, that by doing one thing, we are always making a choice and not doing another thing. Our time is very limited we can’t go back and reclaim time lost, so we need to be aware of what, where, and how we are spending it.

Business/Enterprise Spending on Cybersecurity

Organizations have continued to invest in Cybersecurity, aside from the budget or the actual amounts, the focus needs to be on whether the funds were properly allocated for a particular year. The security investment made in 2021 or 2022 may look much different from what businesses have planned and budgeted for in 2023. Organizations of all sizes will either maintain or increase their security budgets for 2023.

Business verticals, industries, and sectors are concerned about cybersecurity breaches, but compliance and risk management, and other mandates are additional areas where focus, priority, and budget are increasing.

With the start of the global pandemic in 2020, organizations rethought their overall cybersecurity and technology investment priorities. With some projects and innovations being pushed out again for months, and even years. Organizations have a finite pool of resources, whether that’s people, software, cloud, or cybersecurity, the pandemic effects and other ongoing economic and political issues force many organizations to prioritize the operational and support ongoing remote work, keep our customer deliverables, to protect and retain our company brand and overall reputation. Cybersecurity spending for some may still take a backseat even if it’s temporary.

Cybersecurity attack strategies and vectors continue to evolve, and threat actors continue to also have access to the same cloud technologies that many businesses have or will leverage which allows them to also evolve and expand their capability. Even with the increased cybersecurity budgets, some organizations continue to use the same tools, techniques, and software to defend their systems and data. There have been so many recent advancements with AI/ML, and the security solutions that leverage this technology help position a business to keep pace with today’s threats.

Technology leaders with a business background who head Cybersecurity and or IT understand these are both cost centers and not revenue generators for companies. The goal for some organizations is to effectively manage risk, to satisfy all security compliance and mandates, but to be thoughtful and prescriptive about what and where they spend those security budget dollars. As I have said in other articles, our IT and Cybersecurity budgets are not infinite, it’s critical to allocate the proper budget allocations thoughtfully and on the specific roadmap work, aligned with business objectives that protect and move the organization forward. If not managed well, the budget, especially the cybersecurity budget could be easily overrun on objectives and initiatives which don’t reduce exposure or risk when compared to others which as I mentioned could have had a much greater impact.

Organizations, both large and small typically think about cybersecurity as software, tools, services, etc. Don’t forget the human elements, such as security awareness training and continuing education for employees. Even with all the latest security and technology in place, there will exist very low-tech entry points opportunity threat actors will take advantage of.

The pillars of a “good”, that lead us to eventual great software and products.

I have a pretty long list of books to read this year, along with a never-ending stack of books on my nightstand to get through.

One of the books I started reading is by Dr. Martin Kleppmann and it is called Designing Data-Intensive Applications. Amazon Web services focus on the various pillars of a well-architected framework. With some overlap, I want to cover pillars such as reliability, scalability, maintainability, security, and why it’s critical to get them right in an organization’s software and products for example. To be clear, there are so many important pillars, this is by no means an exhaustive list, just what I am choosing to focus on in this specific writing. Let’s go through each pillar, starting with reliability and so now.

Reliability

We as businesses and our customers need our systems to be available, responsive, working correctly all the time, and working even when there is some unplanned or unexpected situation with infrastructure, software, process, people, security, or region-specific with a public cloud provider. The bottom line here is that it’s critical to get ahead of these inevitable things and to be thoughtful in the design of systems that need to be fault-tolerant or otherwise resilient.

Team reviews or other more random methods such as using Chaos Monkey may help teams find areas where there may be opportunities to help ensure a system is as resilient as it can be provided budget and other potential requirements and constraints allow.

Infrastructure/Hardware

One of the flexibilities and benefits realized with virtualization, cloud, and containers among the many benefits was the ability for a system to remain online, even in a possibly degraded state. As the abstracted hardware or physical layers are remedied behind the scenes, apps and services can be restored to a normal operational state or origin and ultimately stay resilient and online. Expanding and extending this idea further, the concept of redundancy takes the stage. With redundancy, we are talking about building and deploying an app, service, or capabilities on multiple target machines or nodes. The idea again is if one node or machine has an issue an app or service doesn’t go offline or dark. For me, when I think of some examples of critical services this could be orders or authentication for example.

Software/Coding

Having clear visibility into logs/events is critical, allowing engineers, other stakeholders insights on commits, or identifying, correlating, and hopefully resolving errors or other conditions which may be the cause of errors or poor performance for example.

As systems become more distributed and are designed with scale in mind, it may become much more difficult to find and correlate issues with and to other services and technology in a stack which may be leading to errors or other app and service failures.

If an app or service does go down or offline, the specific request(s) need to be stored/saved elsewhere off a server so that it can be handled eventually when the system comes back online and to prevent the request from being lost or having someone manually reenter at a later point. One possible solution is to use a messaging queue to address this type of thing.

The human element

There have been plenty of cases where a person has updated or pushed a change and later finds it contributed to an app or service going offline or running in an otherwise degraded state. People make mistakes, we are human after all, and not machines. Even after an entire team has reviewed an update or change we still manage at times to skip over something.

One opportunity, method, or practice is for a Dev, Ops, or (InsertTeamName or separate platform TeamName) to build and deploy systems using Infrastructure as Code rather than building and deploying system manually. One other quick note here regarding whether a team follows Agile, Scrum, or Kanban, if the chosen method is too restrictive, doesn’t work well, or there is simply a lack of training for one individual or group, team members may resort to manual updates or changes to the infrastructure, which then means the running, Infrastructure Code will be out of sync with code that has been checked into the code repository.

Scalability

When we think about scalability, we may think about how a system will respond both positively or negatively when we increase the load of the number of requests, or the number of users accessing an app or service for example. During load testing, a team may set the number of total users to simulate, set the spawn rate, etc. The teams can take the data from a load-test, and analyze it to identify potential failures and make updates as needed to allow for and accommodate scaling up.

Some systems have batch and queue processing for jobs that may need to complete quickly or interactively and be tightly aligned with a business process, while other jobs may run longer or even be scheduled to run overnight. With either job type, we are interested in the amount of time it takes to process the job(s), how many jobs can be completed per min or per hour, does the number of jobs decreases over time as the number of concurrent jobs increases? Maybe you have business requirements where specific financial jobs must be finished within 24 hours for month-end processing for example. No matter what other ad-hoc jobs run, those scheduled or overnight jobs must complete within 24hours, perhaps they need their own dedicated queue or system resources to be allocated at specific times.

Locust.io is an open-source Python tool that allows a website, API, APP to be tested for performance. Locust provides statistics in terms of the type of request, name, # fails, as well as the median, average, Min and Max represented in (ms).

Maintainability

The maintainability pillar should be about following best practices which includes documentation that is stored in a central location, current SMEs are identified assigned to various portions of a system. This is true of critical and often with legacy systems, many if not all of the original team may no longer be there or they have been reassigned to other projects and work. The Dev team or DevOps teams still need to keep the system up and running and keep the performance at an optimal and desired level. I mentioned prior when I covered logs, that monitoring the health of the system is really important, sometimes you can predict, then be alerted to a situation that is developing and could cause some unplanned outage or downtime. We also want to be proactive which includes keeping a system up to date with regular security patches and updates.

Security

The Security pillar is a very important one, in fact, it’s part of the Amazon Web Services Well-Architected Framework. This pillar focuses heavily on protecting information, systems, and assets, but while still delivering business value and various mitigation strategies.

Taking a deeper dive into the AWS Security Pillar, there are five main areas covering Identity & Access Management, Data Protection, Detective Controls, Infrastructure Protection, and Incident Response. I’m focusing heavily on Amazon Web Services Services here in this section, but some of the fundamentals apply outside of AWS.

Starting with Identity & Access Management- This covers AWS services such as AWS IAM, AWS Directory Service, and AWS Organizations.

Data Protection- Covers AWS KMS and AWS HSM.

Detective Controls- This covers AWS CloudTrail, AWS Config, AWS Security Hub, and Amazon GuardDuty.

Infrastructure Protection- Covering Amazon VPC, AWS WAF, and AWS Systems Manager.

Incident Response- Includes AWS CloudTrail, Amazon SNS, and Amazon CloudWatch.

Expanding further on the five areas of the AWS Security pillar, each has some important, but simple best practices.

Identity & Access Management

  • When an AWS account is established, there is a root account created. The AWS root account should not be used, this reduces the overall attack surface.
  • Enable MFA on the root account and on all IAM user accounts
  • Using IAM permission boundaries regardless of permissions assigned to roles, the IAM permission boundaries restrict the effective permissions.

Detective Controls

  • Helps identify security misconfigurations
  • Identify threats, threat actors, or other unexpected behavior
  • Alerting, Metrics and event notifications

Enabling and using the AWS Security Hub to collect security data from across AWS accounts, services, supported third-party partner products, and help analyze the findings to find trends. The AWS services include the following

  • Amazon GuardDuty
  • Amazon Macie
  • AWS Firewall Manager
  • AWS Config
  • AWS Inspector
  • IAM Access Analyzer

On-call, achieving the best outcomes for Customers and Teams

When we think on-call, many software and infrastructure engineers often think of late-night calls or other life events, and family disruptions when things go down unexpectedly or are no longer responding as expected.

The goal for on-call teams or the person who is responding to the issue is to get the site or app back online and working again as quickly as possible. This doesn’t mean a permanent fix is always put in place to get the site or app back up, often once things are stable and recovered, engineering teams will work with the on-call team or person, but engineering will ultimately come up with a more permanent fix, once the reason(s) for the issue or event is better understood. The root cause does need to be found and communicated to various stakeholders, but that’s more of an eventual outcome that’s part of the blameless post-mortem process.

On-call for products, services, or infrastructure is often a team who rotates weekly or a person who is responsible for responding to and resolving operational issues according to an agreed SLA or other support agreement that’s in place. Operational or support issues are often single or multiple issues or events that impact a system negatively. As I said above, it could be that a site or app has suddenly become slow, is no longer responding, is returning errors, or is simply unavailable to end users or customers. Sometimes, there are more obscure situations or issues where a portion of the system isn’t working correctly, maybe the checkout or payments process, but the site or app is up and working just fine.

Typically we see both engineering and support teams working together during an on-call response to situations or events they are responsible for, this process or arrangement is called DevOps. To be clear, the DevOps label or name is often used to cover many different aspects of Technology such as development and operations, in this case, DevOps refers to the same group team of engineers who write and operate the code they maintain and support it. Software engineers who write good software must understand how the software runs in production, particularly at scale.

To quote Werner Vogels here “You build it, you run it”, the engineers who wrote an application or service, for example, will be the best to not only address the issue but also to write to formulate a fix or patch. We see IT operations on-call teams who are put on-call to support or recover applications or sites they didn’t write, develop, or make code updates without the proper documentation or context, they are often ill-equipped to resolve the issue or event when it’s beyond a restart, rollback for example.

For the IT Ops on-call team or person, it can be like navigating within a building, without a map of the building layout, without a flashlight, and the building may also be on fire, so time isn’t on your side. This is where DevOps as I have described above becomes the better way, and it becomes the model, the way On-call needs to work particularly with software products and services. It’s where the more traditional enterprise technology models, where we have IT Ops and software development teams on other sides of the wall, over the years teams, we have painfully found this doesn’t work well, but there is hope.

Going back to Werner Vogels for a moment and the “You build it, you run it”, is another powerful and beneficial reason for the same DevOps team of engineers to support or run the code they wrote. Those engineers will be motivated or maybe even annoyed into fixing whatever issue(s) is keeping them up at night or disrupting their weekends. Very different indeed from having the IT Ops on-call team or person be alerted to situations or events they didn’t necessarily cause, and may not be able to fix.

One other note on the DevOps teams, it’s never the same, engineers will all come from varying spectrums in terms of on-the-job experience, education, and the organization and teams they have worked with. Having a more senior-principal engineer working with someone with less experience is always invaluable to all. Principal engineers are there for other team members to learn from, respect, and even disagree to disagree so there isn’t a missed opportunity or idea. Principal engineers have the experience, they have been around and have seen their share, and they can look at code for example, and give us important insights into the what, the where, the how, and the when. Whereas a junior engineer, given the same scenario or review, may only give us the what, and maybe the when if we are lucky. Do we see the difference, it’s simple, it boils down to experience, being right often, and having the developed, tuned knowledge and intuition.

One other final thought, having a better understanding of what the customer requirements are now in terms of an SLA or other support agreement helps the DevOps team to build, iterate, and improve upon what’s already been built, deployed, and secured.