To provide a broader picture of what quality and built-in quality means, I decided to interview selected experts from related disciplines about their views on these topics.
Hi Rodrigo and welcome to the show. Thanks for taking the time talk with me about built-in quality. Before we get started, please introduce yourself to our readers.
I'm happy to be here, thanks for having me.
I'm Rodrigo, and I am originally from Brazil which I left in 2015 to pursue the American dream as a software engineer.
I got my first job back in 2006 and there and then I unexpectedly started writing code professionally. At the time, I was an Industrial Engineering undergraduate looking for a job in finance. Upon getting it and realizing the job required that a lot of operational tasks to be completed on a daily basis, my manager asked me if I could automate some of it through code. I said I’d give it try; I guess you could say that I enjoyed it, since I have been doing it since then!
After living in the US for five years, my visa expired and a job opportunity presented itself to me in Portugal, which ended up being my next destination. There I had the pleasure to meet you as both worked for Zuhlke Group.
Now I am in Canada because I finally landed my dream job with Amazon. I also love the quality topic which you certainly helped a lot to fuel this interest.
Thank you, I’m happy to hear that.
As you mentioned, we got to know each other at Zuhlke Group and we connected pretty easy due to our passion for quality, which also brought us here today.
The topic of quality and the expression “built-in quality” are very present in numerous companies these days. There is lots of theory, blogs, and other material available in this regard. But as I am not a huge fan of theory, today I am keen to hear your thoughts as a software engineer on this. What’s your personal interpretation of quality? What does it mean for you? What’s your experience with built-in quality? What do you think about buzzwords like “built-in quality”? Would you mind sharing some thoughts?
I'm going to tell you what I think it is because I never really read about a framework named built-in quality. So, I'm going to base on the name of it.
I personally whenever I was part of a software development team, it always felt weird to have developers and then testers that were actually developers.
I'm not talking about manual testers or QA people who look at it like a consultant, telling you how to embrace quality and those things. I’m talking more about when a developer writes code and then the tester writes the test for that code. That never sounded right to me.
For me personally, as a software engineer, I always thought that testing is my responsibility. It's part of my job. And I would even argue it's even more important than writing code. Because code is not really done if you don't test it. How can you prove it works?
I always try to write my tests as early as possible. Because of that I tend to believe that there should be no distinction between in software engineers and testers. It should be the same. So, every software engineer should be a tester.
At Amazon right now that's how it works and I'm very happy that it works this way. The hands-on work of testing belongs to everyone. So, you see product managers writing test cases using things like Gherkin. Then developers implement it. That's how it works and from my limited experience it seems pretty nice. This is my idea of built-in quality. When you don't have to think about quality. It's already ingrained in your process. It's just part of it. When you create a story, the story already has a test defined in it. Everyone is ready to write tests and you don't have to remind them: “Hey, did you test this thing?” “Of course, I tested it…it's what I do”.
I'm very happy for you. That sounds great!
Could you also share a bit from your past? Maybe how it should not work? What were your touch points in the past with testing and quality assurance?
Absolutely. One previous experience I had that did not work well at all was when we were trying to find the sweet spot of team composition.
We had an engineering manager, two to three software engineers and one to two software engineers in testing. That was the formula the company was using. The product manager would just throw out an idea. The engineering manager would prioritize against a roadmap to make sure we don't build up technical debt but at this point nobody was thinking about tests. Then we throw the feature in the team. The tester has to think about how to test it while the engineers are working on it.
At some point QA people, software engineers in quality, were asking for more visibility on their work. So it was like: “Okay, let's create a new lane on the kanban board for testing.” And then: “Oh no, a lane is insufficient…Let's create a ticket for tests.” So we had the ticket for implementing the user story and another ticket for implementing the test. And I was like “don't do it!”. And what happened was a natural bottleneck. Tests were a second-class citizen and when you need to deliver something, you always sacrifice the tests. People felt like: “The feature is done…It's working on my machine…Let's release it.”. And then a lot of problems arose from that practice.
That sounds really bad. This must be very frustrating. Were you as a developer then not allowed to write any tests because that task was something for the engineers in test or how did you handle that?
At this specific job I was not a developer, but an engineering manager. Even though I really wanted to be a developer. So what happened is the developers were not really interested in writing tests. I don't know why because I cannot think of any reason why you wouldn't want to write tests. It's programming, right? That's your job. You write software. It's the same thing but with a different purpose.
I never understood why they don't like it. My guess is, when you try to test something after you’ve developed it, if you didn’t build it well it's hard to test.
You try to test and see it's not working. There is a bunch of dependencies and it's all a big ball of mud.
I don't have dependency injection or something like that. Then they're like: “Yeah, I screwed it…I'm not testing…I'll leave it for the testers, and they can test on an integration or end-to-end environment, because I cannot write unit tests.” That was my impression. And since it was hard to test, that only builds up the problem. The testers being overwhelmed because everything was so difficult to test. Testing was never considered.
Wow…was that the main reason why you left that company?
No, to be completely honest, all the companies I left was not because of the companies. It was because I wanted to pursue a better place to work and live, which worked out pretty well.
Yes, it sounds really good where you are right now.
From this first month of experience at your new company, can you tell us a bit more how it works at this huge organisation. How they manage not to be in the need to think about quality but it is something that comes naturally built into the software?
I've already seen that first, there is no distinction between testers and engineers. It's one and the same. There was onboarding training material not specifically on quality but there was a meme showing Morpheus from Matrix and he's saying, “what if I told you that engineers and testers are the same?”. This is the culture they try to build from the start so that we don't build walls between these two practices. They work best when they are equal, together in a team.
When I looked at the code base everything, at least in my team, is tested. Of course, some tests are better than others. You can't expect perfection. This is not economically viable but everything is tested and the builds don't take too long. They take a little longer but considering the size of the company the building process takes some minutes. I have worked at smaller companies where it took hours to build something.
In my new job I've already seen product managers being very participative on tests. They write ideas of what they think would be good tests. They send a spreadsheet with a bunch of test cases including expected results and where to find the test data. The engineers look at that, review and may add or alter the tests and when everyone agrees, “those tests are meaningful for what we're trying to achieve”, then the engineers start writing the tests and the corresponding code.
And this works. I haven't seen anyone developing something and not testing it. This doesn't happen. It's always tested.
One thing I noticed is that the top priority, and this is a claw across the organization, is operational excellence. They are a DevOps organization. There is no distinction between development and operation. Developers are on-call. You must take care of your systems, you have to monitor DORA metrics. You're responsible for everything. This I think is the magic sauce that allows Amazon to move incredibly fast despite its humongous size. That's my impression so far.
Sounds amazing and it shows perfectly that DevOps, if you implement it right, works and obviously works pretty good at such a big company.
Can you share more details on how your team in particular implements DevOps? How can you take over this big responsibility for the operational part too?
At Amazon we have the concept of a two-pizza-team. That's the sweet spot for the size of a team. A team you can feed with two pizzas. Generally, that’s less than 10 people. It mostly hovers around seven to nine people. In my team we have seven engineers and one manager. The way it works for us to embrace DevOps is simply to follow “you build it, you run it”.
This also works very nicely when you're trying to control your technical debt. We have this concept of on-call rotation. There's always at least one engineer on-call. This person is responsible for responding to incidents. If there is an outage or a service is misbehaving, like showing increased latency or something like that, we have alarms for it. There are dashboards with lots of metrics. We have set up alarms to alert us automatically when something is wrong before our customers find out.
If problems are detected by your customers, you hurt your relationships with them. Amazon is extremely customer centric. We do everything we can to avoid impacting our customers.
This on-call responsibility is not only watching those dashboards and being there to respond to those alarms but also there is a separated backlog for trouble tickets.
The trouble tickets are reported by internal or external customers to point out a problem. The on-call engineer, when there's nothing happening in terms of an incident, is working on those trouble tickets.
This engineer is improving things in the system, paying up technical debt etc. If there are no urgent issues, we always have someone focusing on improving the services and making sure it is running smoothly while the rest of the team works on features. This role rotates every week, so every Sprint, which is two weeks, we have two on-call engineers. First week one, second week another one. I think this works pretty well.
Cool, sounds great.
What I am curious about too, do you have different environments? Where do you deploy or do you deploy directly to production? How does the deployment process work?
This is very interesting, and it surprised me quite a bit.
You have to take care when you have such a giant customer base. You don't want to have an outage. Therefore our pipeline has phases. There are lots of environments, but there is no such thing like “now you need someone to validate the build on staging environment” and only after that it moves to the next stage. Everything is automated.
To reduce blast radius the deployment is split by region. Each region is split into smaller parts. Let's say you have a North America region. Before you deploy to the fleet in North America you deploy to a small portion of the fleet. Then your new code is deployed and released only there. You observe your metrics, there are alarms set up. If the fault rate increases on that small subset that is part of the production fleet the pipeline stops. It will not publish to the full fleet of that region. It may even roll back automatically to not impact customers.
Before it goes to production environments there are other test environments that come first but those are also automated. Code goes in there, builds and that’s where you run integration tests. Every service has instances running in those environments and when you execute integration tests you talk to all the services you need in those environments.
So, all the teams have their instances running on all the environments? No environment where you have only simulators?
The test environments have the real services running but with test data. It’s just the environment that changes, there are no mocked services.
Thinking about the size of the company and the number of services I was expecting to see something different. I was expecting to see something that I believe is more scalable, but I may be wrong.
I mean they know what they’re doing. I used to think that contract testing was way more scalable than integration tests. I still think it is but maybe it’s not worth to change it right now.
And how long does a deployment to production take?
It can take a while because it has to go through all those phases until it reaches all the regions. It may take up to a day depending on the service.
Ok, however I think taking into consideration the size of the company that’s still pretty good. I’ve seen much smaller companies and they still have a lot of manual stuff in between.
Nothing is manual. If you want to do something manually you have to justify and there’s a whole process to do something manually in the pipeline because this is risky. It may introduce human error. You have to really justify that there’s something wrong, the customer is impacted and that’s why we need to act with the manual intervention.
If there is a showstopper how fast can you deploy? If for example a service is down in North America.
You always have the possibility to do things manually in case of emergency. This also depends on the service but you can deploy as quick as possible. You can, if needed, skip everything; but you have to justify and you need an approval from a senior engineer.
Even code reviews, if you send a pull request you can say “I don't want it to be reviewed. I want to merge this now”. As an engineer you don't need to call someone and say, “please approve, I am having an emergency”. You can just approve, merge and then justify why you're doing this.
Of course, no one is going to abuse that functionality, but it is important for us to be able to move fast when things are on fire.
Thank you very much for these insights. We got a very good impression of how you feel about the whole quality topic, what your experiences are and where you are today.
From your point of view what are the most important success factors to ensure a high-quality software and to build in quality from the very start? What would you like our readers to remember?
I would like to start this without any technical basis for it.
It's like a philosophy. It's about never sacrificing long term for the short term. I think this is the first thing. You have to think like an owner. That thing is yours. If you sacrifice long term you're the one paying for the problems down the road. This should give you a pretty good basis for thinking about quality as early as possible.
While it is true that thinking about quality early looks like you put less time into pushing out features, that initial investment pays off in a matter of few months. If you do that consistently you've already broken even and it's much faster to work this way than the other way around.
What I see happening is that some projects do not prioritize quality and then they get into a vicious cycle. You're late because you don't have quality and then you don't have time to put in quality because you need to push features. This problem then only increases until you're pretty much stuck. And then you can't stop the machines for one year to fix the stuff. It's not possible. No business can afford to do that and that is a problem I see.
From the technical side I believe that engineers and testers are the same. I've always believed that and I think whenever we try to draw a line between those two roles we're only hurting our profession and the project. There's no distinction. To build software you need quality. That's part of building software. There's no “I build the software; you ensure there is quality in it”. That doesn't make sense. Can you imagine other Industries doing the same?
“Okay, I'm only building the car. If it's safe…not my concern…I have a tester for that. That's a crash test dummy job”. Then you would be like “Whaaaat???”
So why do we do that for software? Just because it's intangible? It doesn't feel right.
Yes, absolutely. I really love the example with the car. That's a great picture everybody understands.
Thank you so much for taking the time to share your thoughts. Any last words?
Stop drawing lines on the sand. Engineers and testers are the same. That’s it.