How to Build AI-Powered Workflows and Systems
Table of Contents
In 2025, I’ve read countless LinkedIn posts about how some CTOs have fully transitioned their tech departments to be AI-based, with agents working all night for you to push to production in the morning. I also had numerous conversations with engineering leaders regarding this matter, from companies of a few employees and startups, to large international corporations. While I haven’t seen such a fully AI-powered tech team outside of the social media tales, it’s clear that AI automation is no longer a nice-to-have. It’s becoming a competitive advantage, and if you don’t have it, your competitors will. The learning curve is real, but once teams move beyond basic usage like IDE suggestions or chat assistants, the productivity gains are striking. From what I see and hear, AI automation can boost productivity by about 1.5x for teams that embrace it thoughtfully.
Industry data backs this up. GitHub’s 2024 report shows developers using Copilot complete tasks 55% faster, while McKinsey estimates AI adoption can increase productivity by 20–45%. Automation has always been part of engineering culture, but it’s a bit different now: the barriers to entry have collapsed. What used to take weeks can now be done in hours, and not necessarily by skilled developers, as this blog post from Alan demonstrates. That changes everything.
I started this journey of AI-based automation, as well in 2025, and I’ll share how I approached AI automation through a concrete example: AI-driven end-to-end testing for WordPress websites, a talk I gave at WordCamp Netherlands. Along the way, I’ll highlight principles that apply far beyond QA.
TL;DR – How to approach AI automation
- Context is key: For most of the failing cases I faced or heard of, the issue isn’t that the LLM is not powerful enough but rather that it’s missing context. Waiting for newer models won’t change much the outcome; spending time on providing and refining the context will.
- Iterate: You won’t get it right on the first try. Even with LLMs, iterations and feedback loops are key, just like when you write code, or when you train a new teammember.
- Invest continuously in instruction files: As you iterate, always capture (or ask AI to capture) learning, refinements, mistakes, and feedback to build guideline files that will fine-tune the LLM’s behavior to your expectations and ways of working.
- Dive in: Tech has never moved so fast as in the past couple of years. Using AI and LLMs efficiently is not that easy; you need to learn. There’s no better way than trying.
AI automation in the CI for end-to-end testing for WordPress
The BackstopJS Honeymoon
Picture this: you discover BackstopJS, a fantastic tool for visual regression testing. It’s simple. You define your scenarios, take a snapshot, make changes, and compare the new version against the reference. You think: “This is great! Let’s add tests for every page so nothing breaks without us knowing.”
And then reality hits. Every frontend change triggers hundreds of BackstopJS reports. You spend hours reviewing them, asking: Is this change expected? Is it a bug? Did the task impact the UI? Your initial enthusiasm fades. Visual regression testing becomes a bottleneck instead of a safety net: for each expected frontend change that you make, many tests fail. Some of them are true positives, but they are buried under all the tests that are expected to fail due to your changes. This is alert fatigue: at some point, you just get used to having your CI failing all the time, so you don’t even bother looking at it anymore. You now have all your fancy tests running for nothing.
Where AI Steps In: Triage at Scale
This is where automation can come in handy. But before LLMs, how would you approach this? An automated triage system would need to identify PRs that are expected to have a front-end change, and which pages would be impacted. I have seen some teams handling this through labels: by putting labels on each PRs, engineers can tell the system what the PR is supposed to do. It works, but the overhead is quite huge and mistakes often happen.
LLMs offer a new approach to this: the triage system can now read your ticket and your PR description to identify what the intent of the change is. It can also understand the intent of each test. So ultimately, it can identify which tests are expected to break, and which ones aren’t. You can build an AI-based system to filter expected versus unexpected changes.
Here’s the workflow I demonstrated in my talk (see scenarios 1, 2 & 3 on GitHub): Run your visual regression tests as usual. Then, feed the BackstopJS reports to an AI model with some context, such as the GitHub issue or the JIRA ticket. This context allows the AI to classify failures: expected changes are marked OK, unexpected ones are flagged for review.
This simple triage saves hours and keeps teams focused on real issues. It’s a perfect example of AI reducing repetitive work and improving quality assurance.
Everyone can build
From Natural Language to Automated Tests
Now that our CI was back on track and test results were no longer ignored, we wanted to double down on automated end-to-end testing. We have QA teams with great knowledge of our products, of how they should work, but also how they can fail: They have the needed functional knowledge to design end-to-end tests. But until recently, the best they could do was to write test scenarios in natural language in our test knowledge base. While tools like Playwright and BackstopJS are accessible without advanced software engineering skills, advanced and complex test scenarios can become cumbersome to implement and maintain for non-developers. AI can bridge this gap by converting natural language test plans into Playwright code.
Here’s an example from my slides:
Go to Appearance → Menus. In the Primary menu, add a new page to be linked: URL: /promo – Link text: Promo. If the Promo page doesn’t exist, create it first. Save the menu. Go to the home page and verify the link. Click Promo and verify the page loads. Undo: remove the Promo item and trash the page.
Not super precise, right? It’s good enough for QA engineers who are used to working with WordPress. It turns out it is good enough for AI, too! AI can still generate a Playwright script from this. And it did.
Except… it failed: The first run threw errors. The AI didn’t wait for navigation properly, and some selectors were off. It broke just like it does when I write code myself; I rarely get it right on the first try. So I ran the test manually, collected the error report, and sent it back to the AI, asking it to fix the test.
The AI iterated. Sometimes it fixed the issue, sometimes it introduced new ones. I kept iterating a few times until the test passed.
Why Manual Steps Matter Before Full Automation
At first, I handled the iteration manually. I could have automated the feedback loop from day one, but doing it myself helped me understand where the AI struggles, what context it needs, and how to structure prompts and guidelines. It can be tempting to rush to the automation part early in the process, but this usually results in flaky and poorly designed automations. To automate something, one must first master it manually, understand the ins and outs, and optimize the process and cycle time; this is a famous recommendation from Elon Musk.
Once I was confident I had reached an efficient manual loop, I automated it. The AI now runs tests, collects errors, updates code, and even proposes guideline updates via pull requests. Peer-review is a well-anchored practice in tech and engineering; there are no reasons for AI to be above this practice: those pull requests are reviewed and eventually merged by a human to keep an eye on how the system behaves and evolves, and maintain a good understanding of the system. This is why we often hear “keep humans in the loop”: just like with a junior team member, this allows to provide feedback, maintain a shared understanding of the new code, and ensure the whole team (AI included) works toward the same direction.
The Power of Instruction Files
While doing those manual iterations, I realized that I was often repeating myself. The AI tended to repeat the same mistakes. I figured it was often because I never told it clearly what I expect, or because its default approach differs from mine. So I started building an instruction file: Every time I spotted a defect or something to improve in an iteration (for example, “Always wait for navigation after clicking”), I added it to the file. After each loop, I asked the AI to update the file with the core learning of this iteration.
Over time, this file grew with instructions and guidelines capturing my expectations and best practices, making future generations more accurate. I worked around each problem one by one and ensured they would not occur again by documenting the feedback for AI, like you would for a junior developer. It allows for learning and ultimately, consistency.
Since LLMs became popular with ChatGPT, I have heard many people complaining that AI is not yet powerful enough for their use-case, that it misses the point of the task, or that it just can’t do the work, and that the solution is to wait for more powerful models. I disagree with this stance: I am convinced, and I have experienced that available models today (GPT4/5, Claude Sonnet, etc.) are powerful enough to cover most of the use-cases people can think of. What matters now is the context: the LLM itself does not know it all. Just like with a junior developer, it’s the senior/leader’s responsibility to give it the right context and guidelines and to collaborate with it on the complex parts (functional design, feedback, etc.) for it to succeed.
flowpilot.one – Different project, same learnings
End-to-end testing for WordPress was not the only challenge AI helped me overcome in 2025. We also worked on a bigger challenge: building websites from prompts. We launched https://flowpilot.one, a web app that creates WordPress sites based on user prompts. Initially, it was fully LLM-driven. Every decision—from layout design to writing content— was made by an agentic system fully powered by LLMs.
This allowed us to ship fast: LLMs were good enough at handling most of those tasks. But as we iterated, we realized some decisions were better handled by dedicated algorithms than LLMs: faster, more accurate, and cheaper. For instance, some design tasks were very well handled by a traditional decision tree, which is also much cheaper and faster; some others were well addressed by small custom dedicated neural networks, etc. We integrated these algorithms into the agentic system so the LLM could delegate some decisions to them when needed.
AI & LLMs were great to have a first automation with low development cost to begin with. But then, we iterated to refine the system and the approaches, and tactically find parts where LLM was not the best approach. We were able to do this because we knew well how to build a WordPress website to begin with. We understood the ins and outs of the process we automated, so we were able to deep-dive into specific parts when needed. When it’s time for you to automate, start broad with AI for speed, then refine with specialized tools as you learn. AI accelerates discovery and lowers the barrier to entry, but it’s not always the best solution ultimately.
I am Mathieu Lamiot, a France-based tech-enthusiast engineer passionate about system design and bringing brilliant individuals together as a team. I dedicate myself to Tech & Engineering management, leading teams to make an impact, deliver value, and accomplish great things.