Polymarket Statistical Arbitrage Bot
Building an autonomous system to detect mispricings across 8,000+ prediction markets
Python · TimescaleDB · ChromaDB · Linear Programming · asyncio
What I'm Building
I'm working on a trading system that monitors Polymarket prediction markets for arbitrage opportunities. The core idea comes from academic research on combinatorial arbitrage: prediction markets frequently misprice logical relationships between events, and if you can detect these inconsistencies systematically, you can profit from them.
The system currently runs 24/7 on a Hetzner VPS in paper trading mode. I'm collecting data, testing the detection pipeline, and refining the execution logic before putting real money in.
The Problem
Prediction market arbitrage is weird. Unlike equities where you're comparing the same asset across exchanges, here you're looking for logical inconsistencies between different contracts.
Say there are three markets: "Candidate A wins," "Candidate B wins," and "Neither A nor B wins." These outcomes are mutually exclusive and exhaustive, so their probabilities should sum to 1 (minus fees). When they don't, that's free money. But finding these relationships across 8,000+ markets with free text titles is the hard part. You need to understand what each market is actually asking, not just look at prices.
The types of violations I'm scanning for:
Partition violations where a group of mutually exclusive outcomes doesn't sum correctly.
Subset violations where something like P(wins by more than 10) is priced higher than P(wins by more than 5), which is logically impossible.
Complementary violations where P(A) + P(not A) doesn't equal 1 when both sides are traded as separate contracts.
How It Works
Data Collection
I'm using TimescaleDB (PostgreSQL with time series extensions) because I need both fast time series queries and relational joins in the same query. Correlating market metadata with price history requires both.
Five async collectors run continuously: market metadata from the Gamma API, periodic price snapshots, full orderbook depth, a websocket stream of trades, and resolution tracking for markets that close. Everything goes into hypertables with compression and continuous aggregates for hourly rollups.
The server is in Frankfurt because Polymarket blocks US IPs. I learned this the hard way during development.
Finding Related Markets
I embed all market titles using sentence transformers (all-mpnet-base-v2) and store them in ChromaDB. When I want to find markets that might be logically related, I do an approximate nearest neighbor search to get candidates, then filter by whether they could plausibly resolve around the same time.
This cuts the search space dramatically. Comparing all pairs of 8,000 markets is O(n²) which is too slow. The embedding search plus temporal filtering gets me to a manageable candidate set.
The similarity threshold (0.65 cosine) took some tuning. Too low and you get tons of false positives. Too high and you miss relationships between differently worded markets about the same event.
Classification
Once I have candidate pairs that are semantically similar, I need to figure out what logical relationship they have. The current approach is simple: look at the prices.
If two markets sum to roughly 1, they're probably complementary (same event, opposite sides). If they sum to less than 1, they're probably exclusive (can't both happen). If there's a big price gap in one direction, one might be a subset of the other.
This sounds naive but it works because the embedding search already filtered for semantic similarity. By the time classification runs, the question is just "given these prices, what constraint should hold?" which is arithmetic.
I did build an LLM classifier using DeepSeek and GPT-4o-mini with prompt caching, but it's sitting unused. The rule based approach is 200x faster and doesn't cost anything per query. I might wire the LLM in later as a validation layer for edge cases.
Detecting Arbitrage
This is the core of the system. For each group of related markets, I set up a linear program to check if the prices are consistent with the relationship constraints.
If the LP is infeasible, that means the prices violate the logical constraints and there's an arbitrage. The solver also gives me the optimal trade sizes for each leg. It's about 30 lines of scipy code and runs in under a millisecond per group.
I subtract fees (maker and taker separately) from the edge calculation before flagging anything as an opportunity. A 2 cent arbitrage that costs 3 cents in fees isn't actually an opportunity.
Paper Trading
Polymarket doesn't have atomic multi-leg orders, which is a problem. In real trading, if I need to buy contract A and sell contract B to capture an arbitrage, I have to place those as separate orders. Prices can move between the two, turning a profitable trade into a loss.
My mitigation: after each leg fills, re-check if the remaining legs still have positive edge at current prices. If not, abort and try to unwind what I've already placed. The paper trading simulator models this with real orderbook depth so I can see how often the abort logic triggers.
Position sizes use quarter Kelly because full Kelly has brutal drawdowns when your edge estimates are noisy, which they always are.
Alerts and Risk
Telegram bot sends alerts for detected opportunities, executions, daily summaries, and any errors. Circuit breaker halts trading if there are too many consecutive failures or the daily loss exceeds a threshold.
Where I'm At
The data pipeline and detection logic work. I've been running the paper trading daemon and watching it go through scan cycles. The infrastructure is solid.
What's not done:
The signal thresholds (how big does a price discrepancy need to be before I act on it) are basically guesses right now. I implemented a mean reversion signal but haven't validated whether prediction market prices actually mean revert. They might just trend toward 0 or 1 as events resolve, which would make mean reversion a losing strategy.
Live trading isn't hooked up yet. Currently paper only.
The LLM classifier exists but isn't integrated.
Only handles 2 leg relationships right now, not bigger bundles.
What I've Learned
Building a signal is not the same as validating it. I have code that calculates z-scores and flags deviations, but whether that signal is actually profitable is a completely separate question I haven't answered yet.
LP feasibility checking is underrated. Most people use heuristics or threshold rules to check if prices are consistent. The LP approach gives you an exact mathematical answer and the optimal trade vector comes out for free.
Async Python has some nasty surprises. ChromaDB holds the GIL for about 100 seconds when it initializes, which froze my entire event loop until I figured out what was happening and added a workaround.
Rule based classifiers beat LLMs more often than you'd think. For a problem with a known taxonomy and learnable patterns, just write the rules. The LLM classifier I built is sitting unused because the simple version works fine and costs nothing.
Python, asyncio, TimescaleDB, ChromaDB, sentence-transformers, scipy, asyncpg, Telegram Bot API, Hetzner CPX31