What We Look for in a Technical Audit

We get brought in to look at engineering systems for a variety of reasons. A startup preparing for a Series A and needing investors to trust the technology. A fintech that just had an incident and wants an external perspective. A company that’s grown and suspects the codebase hasn’t kept pace.

In every case, the brief is roughly the same: tell us what we’re not seeing from the inside.

After doing this across payments, credit, logistics, and operations systems, we’ve found that most codebases share the same set of underlying risks. The surface looks different. The root causes are often identical.

Here’s our framework.

The Things That Actually Put Businesses at Risk

We split a technical audit into four areas: reliability, security, observability, and scalability. Not because these are the only things that matter, but because degradation in these four areas is where businesses actually get hurt.

Reliability: What Happens When Something Goes Wrong

The first thing we look for isn’t code quality. It’s failure handling.

A payment system that works perfectly under normal conditions but loses transactions when a third-party API times out is not reliable. It’s fragile in exactly the scenario that will eventually happen.

We look at:

Error propagation. When a downstream dependency fails, does the error surface correctly? Or does it silently swallow and return a 200? Silent failures in payment systems are how businesses discover, months later, that they’ve been dropping a percentage of transactions.

Retry logic. Does the system retry failed operations? If so, is it idempotent? Retrying a payment charge without an idempotency key can double-charge customers. We’ve seen this in production systems.

Circuit breakers. When a dependency is consistently failing, does the system isolate it quickly or continue sending requests into the void, blocking threads and degrading everything else?

Data consistency. In distributed systems with multiple writes, what happens when one write succeeds and the next fails? Are there cleanup mechanisms? Or does the database accumulate orphaned records?

Observability: Can You See What’s Happening?

The second area is often the most alarming to clients when we discuss it, because it feels abstract until something goes wrong.

We ask: if your payment processing rate dropped by 20% at 11pm tonight, how long would it take your team to know? How long would it take to identify which service, which endpoint, and which specific error?

The answers typically fall into one of three categories:

They’d get a customer complaint first (hours)
They’d see it in a dashboard (minutes to hours, depending on alert configuration)
They’d have a trace showing the exact failure within seconds of it starting (rare, and what good looks like)

We look at whether logs are structured and searchable, whether metrics cover the critical user journeys, and whether distributed tracing is in place across service boundaries. We also look at alert configuration: are alerts on symptoms (error rate elevated) or just causes (CPU high)?

Security: The Boring Stuff That Gets Skipped

Security in application systems is less about exotic vulnerabilities and more about whether the basics have been done consistently.

We look at:

Secret management. Are credentials in environment variables, or in the codebase? We find secrets hardcoded in Git history more often than anyone would expect.

Authentication boundaries. Are internal service-to-service calls authenticated? Or does the internal network act as implicit trust boundary? Lateral movement in a breach is much harder when every service validates its callers.

Input validation. Is user input validated at the boundary, or does it get sanitised somewhere in the middle of a function chain? The further from the entry point validation happens, the more likely something slips through.

Dependency hygiene. When were dependencies last audited? Are there CVEs in packages that haven’t been updated in 18 months?

Scalability: What Breaks First Under Load

We’re not looking for whether the system can handle 10× current load. We’re looking for whether anyone has thought about what breaks first — and whether they’re right.

The common failure modes we see:

Database as bottleneck. Application servers can scale horizontally. Databases typically can’t, or can only do so with significant complexity. Systems with no connection pooling, no read replicas, and no query optimization are one traffic spike away from a database that becomes a wall.

Missing indexes. This is the most frequently found issue in every audit. A query that runs in 40ms on the current dataset will take 4 seconds on data that’s 100× larger. The query is in production, the index is missing, and it’s a time bomb.

Synchronous processing of things that should be async. Sending an email in the HTTP request path. Generating a PDF synchronously when a user downloads it. Calling a third-party API inside the transaction boundary. These are performance cliffs waiting to happen.

What a Healthy Codebase Looks Like

The best systems we’ve reviewed share a few characteristics that have nothing to do with which language or framework they use:

Failures are handled explicitly and logged clearly. The developer who wrote the code thought about what happens when it doesn’t work.
External dependencies are treated as unreliable by default. Timeouts are configured. Retries are idempotent. Fallbacks exist.
The critical path is instrumented. Someone on the team can answer “what is our payment success rate right now?” in under 30 seconds.
The database schema reflects the query patterns. Indexes exist. Queries are reviewed before deployment.

The worst systems have the inverse of all of these. And the interesting thing is that the worst systems aren’t always the oldest or the most complex. We’ve seen three-year-old codebases with excellent engineering habits and six-month-old codebases with every risk factor checked.

The difference is usually whether engineering quality was treated as a discipline from the beginning, or whether it was deferred indefinitely in the name of speed.

If You’re Wondering Whether You Need an Audit

If any of the following are true, the answer is probably yes:

You’ve had a production incident in the last 12 months that took more than 2 hours to diagnose
Your team regularly discovers bugs that have been silently affecting users for weeks
You’re preparing for growth — fundraising, a major product launch, or onboarding enterprise customers
Engineers on your team regularly say “I’m afraid to touch that part of the codebase”

An audit isn’t a judgment. It’s a map. Most teams know something isn’t right. The value is in being specific about what and why — and in giving you a prioritised list of what to fix, not just a list of what’s wrong.

Ellomas Technologies conducts technical audits for engineering teams in fintech, credit, and operations. If you’d like to understand what’s actually at risk in your system, reach out.