Overview
A two-part project. First, a comprehensive design audit of a large B2B platform with 12+ apps - inventorying every component, identifying recurring patterns, and surfacing variant gaps that were blocking design system adoption. Second, codifying the entire audit methodology as a Claude Code slash command, so the same audit can be re-run on new screens by any teammate without me in the loop.
The key insight: the methodology was as portable as the findings. By turning the audit into a tool, the deliverable stopped being a static document and became a living capability the team could reach for whenever a new screen, app, or component family needed review.
The challenge
The design system had grown organically across 12+ apps. Teams built local components when the system did not cover their case, but no one had a clear picture of which patterns were proliferating, where, or whether the system needed new variants or new components.
Without a baseline inventory, every design system decision was guesswork. The component roadmap was reactive: build whatever the loudest team asked for next. Adoption stalled. Engineering was blocked on components that "looked complete" in Figma but lacked the variants production actually used.
The objective: produce a single source of truth covering every component, every pattern, every gap - severity-tagged so leadership could prioritise. And do it in a way that would not require me to redo the entire exercise the next time the question came up.
Part 1 - The audit
129 screenshots, 46 unique screens, 12+ applications. The audit had to handle wildly different domains - dashboards, planners, file managers, form-heavy editors, map interfaces, calendar grids - without becoming a soup of inconsistent observations.
Methodology
I structured the audit as seven sequential phases, each producing an artifact the next phase depends on:
- Inventory - every visible UI element, categorised into eight canonical buckets (Form Inputs, Buttons and Actions, Data Display, Navigation, Feedback and Messaging, Layout and Structure, Date and Time, Surfaces).
- Pattern identification - for each high-usage component, identifying distinct patterns. Two instances count as the same pattern only if they share layout, content slots, and interaction behaviour. Pure content differences do not qualify.
- Cross-pattern matrices - three matrices per component (Visual Properties, Interactivity, Content Slots) that make divergence visible at a glance.
- Variant gap analysis - observed variants minus documented variants equals gaps. Documented minus observed equals dead weight worth flagging.
- Component recommendation - a clear API decision per multi-pattern component. Compound or single? Which variants earn their place? What should NOT be built in?
- Severity tagging - every finding gets a P0 to P3 tag using fixed thresholds (instance count, app spread, workaround availability).
- Report assembly - everything lands in a structured Markdown deliverable with a stable section order, so downstream readers know exactly where to look.
Each pattern is documented with an ASCII anatomy diagram. Designers downstream read these to understand the pattern without opening the screenshot again - the most-used artifact of the whole audit.
Findings framework
The audit surfaced the gap between perceived completeness and actual coverage. Headline numbers (anonymised):
- 14 of 37 components built (38%)
- 24 of 37 components designed in Figma (65%)
- 13 components missing both designs and code
- One critical missing component blocking an estimated 800+ nested instances across 40%+ of screens
- One component with 12 distinct production patterns - far more diverse than the design system assumed
The most interesting finding was not a missing component. It was the realisation that the heavily-used Card had 12 distinct patterns, and most of them did not share a fixed layout contract. The recommendation: do not build a compound component with sub-slots. Make Card a surface primitive with three variants, a selected state, and free-form children. Let the apps compose.
That recommendation could not have come from looking at one screen at a time. It only became visible by stacking 12 anatomies side by side in a Content Slots matrix and noticing that no two patterns had the same shape.
Part 2 - Codifying it as a skill
Halfway through the audit I noticed I was doing the same thing over and over - same questions, same matrices, same severity calls. The methodology was disciplined enough that another designer could in principle follow it. So why not let an AI follow it?
I rebuilt the audit methodology as a Claude Code slash command - a single Markdown file teammates install in their project's .claude/commands/ folder. They drop screenshots into Claude Code, type /design-audit, and the same seven-phase audit runs against the new screens with the same severity thresholds, the same report structure, the same recommendation logic.
From document to tool
First attempt: a multi-file plugin with separate reference files for each rubric, a templates folder, an examples folder. Technically cleaner but installation friction was high - teammates needed to register a marketplace, then install the plugin, then learn a separate convention.
Second attempt: a single Markdown file matching the team's existing slash command conventions. No frontmatter, ## Input with $ARGUMENTS, ## Task, ## Rules, ## Section Order, ## Output Structure, ## Success Criteria. British English. No em dashes. Drops into .claude/commands/ next to the existing commands. Zero new conventions for the team to learn.
The single-file version embeds the full rubric - severity thresholds, the eight-category taxonomy, the seven phases, the report template, the success criteria - all in one place a designer can read and adjust without navigating a folder tree.
Iterating with real use
First production run caught most of the patterns it was supposed to catch. It also missed one I cared about: a page header pattern - back button, divider, icon, title - that appeared on dozens of screens but was never flagged.
Root cause: the command counted "1 Icon Button, 1 Divider, 1 Icon, 1 text" and moved on. It never noticed those four elements formed a recurring composition because the trigger threshold (5+ instances or 20%+ of screens) was built for components, not compositions.
Fix: a new Phase 0 - Composition Pattern Scan - that runs before component inventory and explicitly looks for 16 common page-level compositions (page header, section header, filter bar, action bar, toolbar, form footer, empty state, loading state, error state, modal header, modal footer, list item, sidebar item, card header, tab bar with actions, breadcrumb header). Every one found gets the same anatomy treatment as a component pattern.
That iteration captures the design of the tool itself: the command is a living document. Every miss becomes one more line in the checklist. Each audit makes the next one sharper.
Key results
- Audit informed the design system roadmap for 4 P0 components and 7 P1 components, replacing reactive prioritisation with severity-tagged evidence
/design-auditcommand lets any designer on the team run the same audit on new screens without me- Time-to-audit reduced from days to minutes per screen set
- Methodology now reproducible across teams, projects, and future hires
- Composition patterns surfaced design system gaps that were invisible to per-component audits
Learnings
Methodology outlasts findings
The audit document will be out of date in six months. The methodology, codified as a tool, runs forever - and gets sharper every time someone uses it. Investing in the how compounds; investing only in the what depreciates.
AI is a force multiplier for design system work
Component inventory, pattern matching across dozens of screens, severity-tagging hundreds of findings - these are exactly the tasks AI is good at, and exactly the tasks that bottleneck design system teams. Codifying the methodology turns the AI into a junior designer who has already read the playbook.
Codification reveals gaps in your own process
Writing the methodology down forced decisions I had been making implicitly. When does a pattern variation count as a new pattern? What is the threshold between P1 and P2? When should a recommendation push toward an existing primitive instead of a new component? The tool would not work without explicit answers - so the tool forced me to settle them.
Match team conventions, not your own
The first plugin version was elegant but ignored how the team already used Claude Code. The single-file version mirrored existing slash commands exactly - same file structure, same prose style, same British English, same no-em-dashes rule. Adoption was immediate. Tools that respect existing conventions get used; tools that introduce new ones get tolerated.