Killing the monsters (Part III) – Engineering tips

PART II: Killing the monster

Automate installation/provisioning of the environments needed for development, quality assurance, staging, and production. Keep the environment as much close to each other as you can – same versions of OS, databases, library. You should be able to access them fast: keep them provisioned and pay for that or make the provisioning fast. Manual instructions should die, and manual changes should never be applied. Remove humans from the deployment process. Maximum involvement should be clicking the Deploy Now button. Set up of development environment for new teammates should be done within a day or so.
Setup delivery-pipeline and measure the throughput. Classically, it includes writing code and deploying code to production to deliver value to the end-customers. In my opinion, it also includes identifying a need (business or engineering one) and prioritizing the needing/scheduling the work.
Document (even better, automate!) “magic fixes” for all incidents. You need to be able to replicate them if the issue occurs again. Keep them in your projects’ knowledge base. You cannot rely on the hope that the engineer that solved the problem the last time will always be available to assist. That’s it, changes in the systems that you own should be transparent and repeatable.
Proactively find all fragile parts of your software. If you work on the system that was developed before you joined the team – be ready for surprises. Things can break where you do not expect that. Besides codebase and project documentation (if your team has good enough documentation) your sources to learn that can be: results of load testing, metrics, and logs from production, registry of closed bugs, customer support tickets.
Stabilize infrastructure to be focused on development, not firefighting. It’s hard to make reasonable estimates and do not work overtime when you need not only to develop new features but also keep existing buggy software up and running. I will post a separate blog on the topic. Stay tuned.
Include slack time into your business commitments. If engineers on your team are 100% loaded according to your plan that means any unplanned work should wait for in-queue (this is bad) or the commitments won’t be met. Having some idle time for your engineers is fine since you cannot predict actual time to accomplish work as well as changes in requirements.
Avoid handoff of tasks between engineers and cross-teams. Context switch kills productivity. Having more than one responsible person enforces corporate ping-pong and makes harder to get things done.
Measure how often your code CAN be deployed to production. Do you know how many deployments per day your business needs? How many of them can you do without affecting your routine? You would like to know the answers at least for the case when an incident occurs, and you need to push the hotfix to prevent loss of the company’s reputation.
Make all code changes accountable and authorized. As well as infrastructure changes they should go through version control system, peer-to-peer review process and sometimes approved by business/budget owners or external teams.
Move your working code to production ASAP. Until the code is in production and is enabled for customers – no value is generated from doing product research, creating Jira tickets, design meetings, writing code, and reviewing pull requests.
Make faster releases and do that in small batches. For me, the ages when we give our customers a new version of backend software that runs on our infrastructure once per month are over. Every merged pull request should be deployed individually (and rolled back). In that case, you can observe how the change affects your system and find failures fast.
Prepare rollback strategy for deployment of all large/risky changes. Examples: altering database tables with dozens of records, extreme refactoring, data migrations, switching vendors. If you think that your testing is not enough (or it’s expensive to cover all needed cases) – I would invest into that.
Know about your incidents before your customers or business find that. First of all, it gives you more time to investigate and fix the issue. Secondary – timely updated status page is the face of your team. It’s just caring about the feelings of your customers.
Build a passionate team that is OK to work late hours and weekends to rescue the business when it’s really needed. It should be compensated somehow eventually including additional days off to recover and spend time with family or friends. You also can set up on-call rotation to have somebody on duty 24/7 be ready to fix any problems.

Credits: DevOps Book Club: Does Your Engineering Team Help Your Business To Win?

Leave a Comment Cancel reply