At the center of every algorithm is data. No algorithm can perform the way you want it to without clean, live and reliable data, says experienced investor Daniel Calugar.
So much of algorithmic trading is focused on the strategies that are created to identify profitable opportunities in the market. This makes sense, of course, because without a good strategy, it doesn't matter what data you have; you would be hard-pressed to be consistently successful.
At the same time, though, the data you use is equally as important -- if not more important -- than the strategies you devise, the broker you use, and the computer hardware and software that powers your system.
Algorithms are effective because they're able to process an enormously large amount of information in a very short period of time -- things that just aren't possible by manual human work. But, you need to find and collect relevant and timely data for these algorithms to work correctly. Otherwise, they won't have any information to process.
Below, Dan Calugar will discuss effective strategies for managing market data in algorithmic trading. This will include data collection, storage, cleaning, and pre-processing techniques. Efficient data processing is vital to ensure accurate and timely trading decisions.
Collecting Data for Algorithmic Trading
Before your algorithms can start running and producing results, you need to find and collect the data that you'll input into them. But where do you get all the relevant data you need?
Luckily, there are many sources of relevant market data that you can use, and a lot of these sources are completely free. It's always a good idea to start with free data sources and then branch out from there. If you feel you need either more data or more in-depth data, you can consider purchasing a subscription to a data service.
Collecting data for algorithmic trading is actually a three-step process, which Daniel Calugar explains below.
Step 1: Figure Out the Type of Data You Want
Before signing up for accounts and pulling in data for your algorithms, you have to decide what type of data you want. It's always better to have more data than less data, but you also want it to be focused on the areas in which you are trading.
For instance, if you are focusing your algo trading on the stock market, you wouldn't want to pull in data that's only focused on cryptocurrencies. That may seem obvious, but honing in on the type of data you want will help you find the best sources to grab that data.
Step 2: Figure Out the Sources for the Data You Want
Once you've figured out the type of data you want, you should start signing up for accounts at the sources that will provide the relevant data.
Below, Daniel Calugar lists some of the top free data sources for three main segments of algorithmic trading -- stock market data, alternative data, and economic data.
Stock Market Data
Alpaca: This is new to the market, but it specifically caters to algo traders. The user interface is easy to use, yet its application programming interface is powerful, tested, and documented.
Yahoo Finance: This is one of the most well-known sources in the industry, and it's one of the most used sites for historical data on the stock market. You can download the data that are available here in a CSV format. They do also offer an application programming interface, but be careful because they don't officially maintain it themselves.
Reuters: This source is a leader in financial markets around the world. Many of their datasets aren't free, but they're vast and very reliable.
Bloomberg: This source is similar to Reuters in terms of reputation, reliability, and variety. It offers a wealth of traditional data as well as financial news. Again, much of the data isn't free, but it's a great source to consider.
Polygon.io: Programmers rely heavily on this data vendor for various uses in finance. It has a great application programming interface with solid documentation and plenty of data on stocks and options, forex, and even cryptocurrency.
SimFin: This provider has a great API and a very intuitive interface. Its data is also very timely and reliable. If you aren't a programming specialist, you can download all the data here in bulk and then load it into a spreadsheet.
EODData: Users can download End of Day data available for a wide variety of companies on this site. If you want intraday prices, you will have to pay for them, but what's free is very reliable.
Nasdaq Data: This site has perhaps the most extensive catalog of datasets. You have to pay for most of them, but you can filter through which ones you want and which you don't quite easily. You can also use the filter to find the free datasets.
FlightRadar24: This site provides real-time data on flights around the world. Hedge funds use this type of data to predict where big deals might be made based on landings for private jets.
Finnhub: This fintech firm has a great library of alternative data available. Plus, they have a reliable API that's very consistent. They provide all typical industry data and alternative data such as senate lobbying, social sentiment, US Patent and Trademark Office grants and registrations, and more.
Tiingo: This site has a great news API, covering 65,000 equity tickers, 75 currencies, and more. It even has historical data that goes back nearly 30 years. You can get 500 requests each hour for free from this site, which will satisfy most traders.
Reddit: This forum site became popular for its WallStreetBets sub. The site offers a solid API that can allow you to pull in data related to social sentiment, which is becoming an increasingly reliable dataset.
FRED: The Federal Reserve Bank of St. Louis has an extensive database that's widely used in economic research and cited in multiple prestigious journals. Much of this data is even available for free, and it's all maintained quite well. You can get data on commodity prices, interest rates, census data, labor statistics, economic activity, and more.
Bureau of Labor Statistics: If you're after wage and employment data, this is the source for you. Daniel Calugar does warn that the data isn't very well documented and is limited in scope, but they do offer a reliable API.
Bureau of Economic Analysis: This agency provides datasets on savings and income, consumer spending, and investments, to name a few areas. There are even special topics that include data from economic activity related to culture and arts, space economy, and intellectual property. The API is usable but somewhat outdated.
World Bank: Journals all over the world cite the data that the World Bank provides. Not all of the datasets that they offer are regularly updated, but there's so much data that it's a must-have source.
Step 3: Start Collecting the Data You Want
With the type of data and the sources figured out, it's now time to start actually collecting the data. This will involve signing up for accounts at the sources you've chosen- if necessary- and deciding the methods you’ll use for collection.
You can obtain the data by downloading it all at once, connecting to an API that sends ongoing data, or through web scraping. Be sure to read the documentation at each of the sources you've chosen, and then set up your collection methods correctly so that you're getting all the data you want in the way that you want it.
Storing Data for Algorithmic Trading
You are going to be collecting a vast amount of data for your algorithmic trading strategies, and you're going to need somewhere reliable to store it all. The place that you're going to store it -- and also take action with it -- is called your database.
If you're new to algorithmic trading, a database, in its simplest form, can be thought of as a hard drive. It's the physical place where the data will go once it's been collected via the various methods that you decided on.
Of course, a database is much more complex than just a hard drive. The database that you use will not only retrieve and store the data but also persist and analyze it.
There are many different options that you'll have for databases to use to store your data for your algorithms. There are some technical and non-technical considerations you should keep in mind when making this decision.
According to Daniel Calugar, the technical aspects you should consider are the ingest speed, the aggregation speed, the operations, the type of query language used, and how the data layout is optimized. There may also be other technical aspects that you'll want to consider, depending on how you or your programmers create the algorithms.
There are other requirements to consider that aren't related to technical aspects at all. These include the overall cost of setting up and running the database, whether you want ongoing support or just need solid documentation, and which operational tools the database offers.
Here are some of the most commonly-used databases for storing and analyzing data for algo trading.
MySQL: One of the great parts about MySQL is that it is completely free and used by a large and varied community. This database has a huge number of use cases in multiple industries. It may fall short of your technical requirements in terms of speed, but it really depends on your particular situation and setup.
Flat Files: You can gain some solid flexibility by using flat files rather than a traditional database. You can use many tools to achieve an efficient operation on Flat Files, including Apache Spark, Python, and Amazon Redshift Spectrum.
TimescaleDB: This is an extension available on PostgreSQL that uses multiple tables to eventually create a hyper table -- or a virtual table. By doing so, it improves the insert rates and gives predictable query times.
ClickHouse: This is a relatively new database on the market that has a lot of impressive features. The company is very transparent in the overall development process, as they maintain a very active Github community. They have new releases every few weeks that include fixes, improvements, and new features.
MemSQL: This database isn't free, but it's certainly solid. It uses hybrid transactional/analytical processing data architecture. The analytics are extremely fast, it provides great support, and it’s structured query language compliant. It's even MySQL compatible, meaning you can use many of the drivers and tools that MySQL provides.
This is just a small sampling of databases that are available for use in algorithmic trading. Ultimately, you need to decide what you want out of a database first so that you can narrow your search more efficiently.
Cleaning Data for Algorithmic Trading
You have collected and stored your data. Now, it's time to clean it.
Data analysis in any form would be considered incomplete if it weren't "cleaned." When you pull in data from various sources, it's very possible that you will get some erroneous data. If you don't go through and clean it, then your computer models will assume that it's legitimate and, ultimately, produce outcomes that aren't favorable.
Dan Calugar points out that when you collect data from various sources, it is referred to as raw data. This type of data can be considered "unprocessed," meaning it's not yet information that you -- or a computer -- can actually use.
Raw data is extremely valuable because it can be processed and analyzed in various ways. But, raw data is only the starting point for algorithmic trading.
A good way to think of raw data is how the human brain processes simple forms of it. A standard traffic light has red, yellow, and green lights from top to bottom. These colors and the traffic light itself can be considered raw data.
Without your brain processing the raw data, you are just staring at an object hanging down over the road with three different colors on it. In other words, it has no meaning.
Your brain has been trained to "clean" this data and process it into useful information. You know that red means stop, yellow means proceed with caution, and green means go.
Computers can't intuitively process raw data the way your brain can. That's why raw data, all by itself, generally isn't considered very useful. It must be processed or "cleaned" in order to become useful.
There are many different ways that data can be cleaned. It could include parsing it so that computers can ingest it more easily, removing any potential outliers (as mentioned above), and sometimes even translating or re-formatting the data to fit into the same format.
The process of cleaning your data will ensure that it's organized and accurate. This will be essential for your algorithms to perform as well as you want them to perform. It'll help you save time, increase productivity, streamline functions, and improve decision-making.
Pre-Processing Data for Algorithmic Trading
The final thing to consider for managing market data for algorithmic trading, according to Daniel Calugar, is pre-processing the data. This step is interchangeable with data cleaning in that it will involve taking your raw data and converting it so that it's ready for your algorithms to actually use.
Data cleaning, in essence, is the first step in this -- removing all the "junk" and "dirt" from the raw data. Then, pre-processing involves transforming the data to whatever format you're going to use so that your model can easily understand it.
In order to produce accurate, trusted, and understandable data, it is essential to perform data pre-processing. This is because the algorithm requires accurate data to function properly, and you need to have confidence in the legitimacy of the results it produces. Additionally, the data must be correctly interpreted in order to achieve successful outcomes.
Many algorithmic traders will use Python to pre-process their data. During this process, you will be looking for any missing values, any data outliers, any data that doesn't have a numerical value, and any data formats that are different from each other.
Underneath each of these pre-processing categories are techniques you can use to ensure that your datasets are presented in the best way for your algorithm.
Daniel Calugar says that it's important to keep in mind that all four of these major ways of managing market data for algorithmic trading are essential to do on a consistent basis. Every time you pull in new data, you need to make sure it is stored, cleaned, and pre-processed correctly.
While each of these steps is highly customizable and will vary from one algo trading platform to another, they're all integral parts of efficient data processing for algorithmic trading.
About Daniel Calugar
Daniel Calugar is a versatile and experienced investor with a background in computer science, business and law. While working as a pension lawyer, he developed a passion for investing and leveraged his technical capabilities to write computer programs that helped him identify more profitable investment strategies. When Dan Calugar is not working, he enjoys working out, being with friends and family, and volunteering with Angel Flight.