I woke up around 7:30 AM on Monday August 24th, checked my phone while lying in bed, and saw that I had lost thousands of dollars in my sleep. Great way to start the week…
I was not alone – few investors escaped “Black Monday” unscathed. The past several months have been full of sudden surges and declines in stock prices, and extreme volatility is apparently the “new normal” for global financial markets. Frequent and dramatic market swings put a high premium on access to real-time data. For securities traders, data processing speed and analytic latency can be the difference between getting burned and getting rich.
The challenge for trading and financial services companies is not only collecting real-time market data, but also making decisions based on that data in real time. Legacy SQL database technology lacked the speed, scale, and concurrency to ingest market data and execute analytical queries simultaneously. This forced companies into buying and/or building complex and specialized systems for analytics. However, as commercially-available database technology has caught up with the speed of financial markets, it is possible to use SQL to build sophisticated applications and analyze real-time financial data.
Simple Quote Generator
While we tend to talk about stocks as having a single, definitive price per share, the reality of securities trading is more complex. Buyers and sellers see different prices, known as bid and ask price, respectively. Public bid and ask prices are actually the best available prices chosen among all bids placed by buyers and all asks offered by sellers.
Beyond getting the best available deal, trading firms have other incentives for tracking best bid and ask prices. For one, the Securities and Exchange Commision requires that brokers always give their customers the best available price, known as National Best Bid and Offer (NBBO).
The script gen.py simulates a distribution of bid and ask quotes for fluctuating stock prices. The model for the generator is very simple and easily modified. The script treats each stock as a random walk. Every iteration begins at a base value, uses a probability distribution function to generate a spread of bids and asks, then randomly fluctuates the stock value. The entire generator is around 100 lines of Python.
Python interpretation is the bottleneck when running on even a small MemSQL cluster. Even still, it achieves high enough throughput for a realistic simulation. As performance bottlenecks go, “the database is faster than the data generator” isn’t such a bad problem.
Bid and Ask prices for BLAH, generated using gen.py
Scale of the Market
The trading volume of the New York Stock Exchange (NYSE) is on the order of a few billion shares per day (most trading days in July saw between three and four billion shares change hands). Many deals are more complex than a straight cash-for-equity swap, and many shares can change hands in a given trade, so the number of trades is lower than the volume of shares traded. Note also that there can be many bid and ask offers that are never accepted, so the number of bid and ask offers is higher than the number of trades that actually occurred. While this information does not tell us precisely how many bid and ask quotes the NYSE processes per day, it gives us some clues as to the order of magnitude.
Running a modest cluster on EC2 (five nodes; one aggregator, four leaves; all nodes m3.2xlarge instances), I could insert 280 to 290K records per second, or about 6.75 billion quotes in a trading day per aggregator.
Resolving Best Bid and Ask with SQL
Data loading is only half the story – with MemSQL, you can run SQL queries while loading huge volumes of data. The task of finding the best bid or ask quote becomes trivial with SQL.
SELECT ticker, ask_price, ask_size, exchange FROM ask_quotes WHERE ticker='BLAH' ORDER BY ask_price ASC, ask_size DESC limit 5; +--------+-----------+----------+----------+ | ticker | ask_price | ask_size | exchange | +--------+-----------+----------+----------+ | BLAH | 4996 | 99 | NASDAQ | | BLAH | 4996 | 98 | FRA | | BLAH | 4996 | 97 | FRA | | BLAH | 4996 | 95 | TYO | | BLAH | 4996 | 95 | FRA | +--------+-----------+----------+----------+ 5 rows in set (0.00 sec)
This query returns the five best (least expensive) BLAH ask prices. On my EC2 cluster, the query finishes in milliseconds. Running the query repeatedly, you see the best ask price change over time (it’s easiest to see with the UP and DOWN tickers, which monotonically increase and decrease, respectively).
The query above returns the five best ask prices across all exchanges. You can also find the best ask price per exchange in a single query.
SELECT ticker, ask_price, max(ask_size) AS ask_size, exchange FROM ask_quotes INNER JOIN (SELECT Min(ask_price) AS ask_price, exchange, ticker FROM ask_quotes WHERE ticker = 'BLAH' GROUP BY exchange) AS t2 using (ticker, exchange, ask_price) WHERE ticker = 'BLAH' GROUP BY exchange LIMIT 5; +--------+-----------+----------+----------+ | ticker | ask_price | ask_size | exchange | +--------+-----------+----------+----------+ | BLAH | 4996 | 93 | NYS | | BLAH | 4996 | 98 | FRA | | BLAH | 4996 | 95 | TYO | | BLAH | 4996 | 99 | NASDAQ | | BLAH | 4996 | 94 | LON | +--------+-----------+----------+----------+ 5 rows in set (0.02 sec)
We write the query as a join because there could be multiple records that are tied for the minimum ask price, and we want to return the one with the largest ask size. Notice how the results are different from those of the previous query, where three of the quotes were from the FRA exchange.
In the past, single server, disk-based database architectures struggled with concurrent reads and writes, forcing developers to build specialized applications for real-time applications. With a modern system like MemSQL, much of the complex application logic can be pushed into the database. In fact, finding best bid and ask is just the start of what you can do with MemSQL and market data.
Playing (with) the Market
With the ability to ingest real-time data and run queries simultaneously, you can interactively explore market trends as they develop. Moreover, because MemSQL is a database, rather than a more narrow stream processing engine, you can access real-time and historical data through a single interface. This simplifies the process of building, for instance, a trading application that models historical data, as well as detects and reacts to real-time market changes.
For example, regress.py is an easy real-time analytics appliction that samples “reasonable” asks from the 10K most recent quotes and performs a linear regression (minimizing mean-squared error) to estimate the way a given stock is trending.
$ python regress.py BLAH Ticker: BLAH Time: 1444676831.12 Slope: 0.000306657885407 R squared: 0.312113434918 Standard error: 2.28775085195e-05
The database is doing most of the work here – ingesting, sorting, and sampling – which allows the script to finish the end-to-end regression computation several times a second. Because the quote data is updating in real time, the slope of the regression line represents the direction that the stock is trending right now. It’s a real-time predictive application in 40 lines of Python!
You can also tweak the generator parameters to get different bid/ask spreads and rates of change. For instance, if you decrease the first parameter in the lognormvariate function, it will narrow the bid and ask spread. For a fun, easy extension of the demo, you could write an application that takes advantage of arbitrage opportunities.
Playing to Your Strengths
If you let the generator and regression script run for a minute, you will see that the model can produce alarming coefficients of determination and standard errors. This makes sense – a random walk generates fluctuations in base price, meaning the data is all noise.
If you work in finance, presumably you know more about what drives fluctuation in stock prices than I do, and can create a more sophisticated model. Or, better yet, you can analyze real trade data. The beauty of MemSQL is that, whether you are an expert in distributed systems or not, you can leverage the power of tens or hundreds of servers through a familiar SQL interface. Instead of spending time and resources developing analytics infrastructure, you can focus on actually doing analytics.