"Building Trading Infrastructure: Lessons from Middleware Design"
Trading software occupies an unusual corner of engineering: a domain where a bug does not crash a page or corrupt a report, but buys things — irreversibly, with real money, at machine speed. Four years of operating as a quantitative trader, followed by designing and building GIDEON's middleware layer — 73,000+ lines of production code, twelve months of continuous validation against live CME infrastructure — taught me that the discipline this domain demands is less about algorithms than about a handful of unglamorous architectural commitments. This article is those commitments, written down.
Lesson 1: The system's real job is knowing the truth
The naive mental model of a trading system is decide → send orders. The production reality is that the hardest, most safety-critical function is neither: it is state reconciliation — maintaining, at all times, a correct answer to "what are my positions and working orders, actually?" Every downstream function (risk checks, new decisions, kill switches) consumes that answer; if it is wrong, everything built on it is wrong at full speed.
The engineering consequence is that internal state must never be trusted on faith. The exchange's execution-report stream is the ground truth; your system's beliefs are a cache of it, continuously reconciled — on every fill, after every reconnect, and on a periodic schedule regardless. The FIX session layer we describe elsewhere exists precisely to make that reconciliation possible after failures; infrastructure that skips it is choosing to guess about money.
Lesson 2: Design the failures before the features
Every component of a trading path fails eventually: the data feed gaps, the broker session drops, a process restarts mid-order. The difference between an incident and a catastrophe is whether the failure mode was chosen in advance. For each failure class, the design must answer, in writing, before it happens: what does the system know, what does it assume, and what does it do? Does a disconnect trigger cancel-on-disconnect at the venue? Does a feed gap freeze new decisions? Does a restart begin with reconciliation before any order can be emitted?
This is also the correct frame for the risk layer. Pre-trade checks, position caps, loss floors, and the four-tier kill-switch ladder (pause, reduce, flatten, emergency stop) are not features bolted onto a trading system — they are the failure-mode design for the most important failure of all: the strategy itself being wrong, whether by bug, bad parameter, or bad day. Knight Capital's $460 million morning stands as the permanent case study for what "we'll add controls later" costs.
Lesson 3: Prefer boring determinism to clever speed
For every system outside the microsecond arms race — which is to say, nearly every system — the property worth engineering for is not minimal latency but bounded, predictable behavior: fixed-cost risk checks, bounded queues, pre-allocated resources, latency measured at p99.9 rather than on average. A pipeline that is reliably fast-enough beats one that is usually faster but occasionally pauses at the worst moment, because in markets the worst moments are correlated: your load spikes exactly when everyone's does. Determinism has a second dividend — deterministic systems are testable and replayable, and replaying recorded production message streams against new builds is the single highest-value testing practice in this domain.
Lesson 4: The audit trail is a component, not a report
Everything this series argues about recordkeeping culminates in an architectural rule: logging must be in-line and structural — every signal, order event, risk decision, and reject written to an append-only record by the same path that processes it, with synchronized timestamps, rooted at the originating signal. Bolted-on logging drifts from reality; structural logging is a second, immutable copy of reality. It then pays for itself thrice over — regulatory defensibility, incident forensics, and the research dataset that keeps backtests honest.
Lesson 5: Know which layer you are building
The full stack — signal generation, aggregation, risk, order management, venue connectivity, records — is too much for almost any single team to build and should not be outsourced whole, because strategy is the part that must stay yours. The sane boundary, arrived at independently by most of the industry, is the middleware line: strategies and signals above it, owned by the trader; execution plumbing, risk enforcement, connectivity, and audit below it, built once, certified against the venue, and shared. Exchange certification (iLink and MDP conformance, in CME's world) is the formal expression of that boundary — the venue itself insists the plumbing be proven, message by message, before it touches production.
This is, transparently, the thesis GIDEON is built on — neutral middleware between signal sources and CME execution, with the risk layer and audit trail living at the chokepoint where every order must pass. But the principle stands independent of any product: discipline that must hold for every strategy belongs in a layer no strategy can bypass.
The summary heuristic
If one sentence must carry all five lessons, it is this: build the system assuming that everything above it will occasionally be wrong — the signal, the network, the operator, the market — and make the infrastructure the part that is never surprised. Alpha is a hypothesis; infrastructure is a promise. Keep the promise, and the hypotheses get the longevity they need to prove themselves.
References
- SEC (2013). In the Matter of Knight Capital Americas LLC.
- Kleppmann, M. (2017). Designing Data-Intensive Applications. O'Reilly.
- Nygard, M. (2018). Release It! Design and Deploy Production-Ready Software, 2nd ed. Pragmatic Bookshelf.
- CME Group. iLink / MDP 3.0 certification and conformance documentation (cmegroup.com).
This article is educational material and does not constitute investment advice. Trading derivatives involves substantial risk of loss.