Azure Well-Architected Framework - An Architect's Arsenal

Mar 13, 2025 | 5 min read | 124 Views

I have been lately on a feel trip with the Azure Well-Architected Framework since last month. What I have realized is that it is one of the best arsenals for any architect out there in an open software battlefield. In this article, we will explore the potential of Azure Well-Architected Framework and how it enables dev/architect to have critical thinking for their workloads.

The Azure Well-Architected Framework mainly talks about the part "Am I doing the thing right?". Doing things right is easier said than done. Because the reality is not black and white, it's actually a color splash. A well-designed architecture is not something which is very common, and if an architecture is designed well then it is called as a wonder (seven wonder reference here :P). If you are an architect or a senior dev who design complex architecture for systems, you can correlate what we are talking here. So, if the problem is chaotic in nature, then the solution would be based on some order.

The Azure Well-Architected Framework is a wholesome of multiple actionable and thought-provoking items, including pillars of well-architected design, tools for assessing flaws and gaps in your current design, approach or system, and training guides which help solution architect to adapt to the framework ideology.

The framework pushes you toward building resilient and fault tolerant workloads or features, such that it won't give us chills whenever any reliability, performance, security or costing related issues arrives. It also provides answers to the random shower thoughts which developers tends to have related to their system existence, resiliency in case of any exception or anomalies happens.

In this article, we will discover some key aspects of Azure Well-Architected Framework which makes it a divine arsenal for an architect. We will touch base on these areas of the framework:

  • Understanding the Pillars
  • Identification of Critical Flows
  • Identifying Az Resources Involved
  • Performing Failure Mode Analysis
  • Defining Target Matrix
  • Calculating Costs
  • Monitoring, Troubleshooting and Alerts
  • Data Classification
  • References

Understanding the Pillars

The pillars cover entire spectrum of an architecture for any workload. Each pillar talks about Design principles, Tradeoffs and Recommendations. The checklist of recommendations helps to align with the intentions of the pillar alongside your workload. Some recommendations are per workload/feature while some talks about overall system or practices follows within your dev team. The pillars are listed below:

  1. Reliability
  2. Security
  3. Cost Optimization
  4. Operational Excellence
  5. Performance Efficiency

Each of these pillars are having some tradeoff against each other. We have to choose as per our business needs and requirements which prefer over the other. Some recommendations are related to the cultural adaption of process to strengthen these pillars objectives.

The assessment tool provides you an option to choose for which pillar you want to carry out the assessment. The result also highlights the scoring as well as recommendation to improve the score for each pillar.

It is important to go through the offering within each pillar so that to align the ideology of well architected design.

Read more from here.

Identification of Critical Flows

The framework emphasizes to identify your critical flows and further divide them based on 2 broad categories:

  • User Flow: The flows in which user can interacts with the system (e.g. filling cart, placing an order)
  • System Flow: The flows in which user doesn't interacts with the system; instead, it runs in the background.

Once you categorized your flows, you can then provide below detailing to each of the flows:

  • Flow Identifier: Each flow can associate with unique ids (e.g. FOO-SF-01, FOO-UF-01).
  • Flow Name: A flow name must be provided (e.g. Order Placement).
  • Flow Objective: Defined objective of the flow.
  • Flow Details: High level and low level (optional) detailing of the flow.
  • Criticality Rating: An indication of criticality High, Medium and Low.
  • Business Process Owners and Stakeholders: Owners of the flow.
  • Business Process: How many processes are linked with the flow.
  • Business Impact: What are the positive and negative impacts.
  • Escalation Paths: Lean escalation path for getting resolutions as early as possible.

This provides confidence on the system workloads, as each flow of the system has detailing along with their criticality, impact and how to quickly get to resolutions in case of escalations. This activity leads your team to have many flows documented with appropriate owners and flow detailing.

Read more from here.

Identifying Az Resources Involved

Once we have identified the flows and their criticality, we can then identify the Azure resources involved with each flow individually or for the entire workload as a whole as well.

The Azure resource linking to each flows/workloads provides clear dependency on the Azure services and will further helps in performing FMA and cost calculations.

For example, a user flow Order Placement is using and Order API hosted on Azure App Service and a front-end React app hosted on Azure Static Web App, along with the Azure SQL DB.

Performing Failure Mode Analysis

The FMA is a process of finding, risk (along with its likelihood of appearing) associated with each Azure component/service with its effect and then how we can mitigate it. We can perform this for entire workload or per flow basis as well. Checkout the example - FMA Example.

Read more from here.

Defining Target Matrix

This talks about the part we mentioned earlier, how resilient your workload is in case of exceptions or anomalies. A target matrix needs to be prepared which covers SLO, SLI, SLA, MTTR, MTBF, RTO, RPO. In short, some weirdo terminologies. Pun aside, these matrix covers very important aspect related to the system and individual workloads as well.

Each system has defined SLA (what percentage the business decided that denotes the achievement targets) and SLO (what percentage your team decide that denotes the achievement targets), we always keep the SLO higher than the SLA. While the SLI is something which denotes the actual figures.

For recovery we have RTO and RPO which deals in time durations, for RTO - how much time it will take to recover, for RPO - how much data loss can be expected for the recovery duration. These numbers help to understand how much expected time to recovery we have for the system/workload.

Read more from here.

Calculating Costs

This can be easily performed once we have identified the Az resources involved for the system/workload. We need to tier for each Azure service being used and then needs to calculate an approximate Monthly Computed Cost. This cost is not the actual amount appearing per month based on usage, instead an approximate of how much we have bought the plan. For complex services like AKS, calculate based on consumptions.

Monitoring, Troubleshooting and Alerts

Consider having a detailing monitoring and troubleshooting guide handy for the team. Different scenarios and cases must be covered within the monitoring guide and steps to troubleshoot the issue. Also, configuration of alerts to indicate when there is a surge in consumption or when some anomalies detected.

For monitoring and alerting a customized Azure Workbook can also be created.

Read more from here.

Data Classification

Consider classifying the confidential and public data and apply labeling so that it can be easily distinguished. The confidential data needs to be securely handled. The confidential data are the PII (Personally Identifiable Information) data.

Read more from here.

References

Below shows the quick link references:


Zaki Mohammed
Zaki Mohammed
Learner, developer, coder and an exceptional omelet lover. Knows how to flip arrays or omelet or arrays of omelet.