EECS 349 Final Project: Stock Market Price Prediction
Project Title: Stock Market Trend Prediction
Group Members: Richard Gates Porter
Contact email: RichardPorter2016@u.northwestern.edu
Northwestern University
EECS 349: Machine Learning
The goal of my final project was to apply machine learning techniques to predicting both the next-day and long term price trends in several stocks. I focused on predicting the long term prices of Apple (AAPL) sock and The Coca-Cola Co (KO) stock, while focusing on trying to predict next-day prices for Charter (CHTR) stock and Delta Airlines (DAL) stock. This task is interesting for pretty obvious reasons: everyone wants to predict the stock market so they can make money. But for less obvious reasons: I was interested to see how the ML algorithms we have been studying for the last quarter (and some new ones) perform differently in predicting the prices of a wide variety of stocks, in both the short term and the long term, because I am interested in the various factors that effect long term and short term stock prices differently.
I focused on optimizing three learners for my final project, most of which were taken from a Python library/package called Scikit-learn and Weka. The three learners I focused on optimizing in regards to their performance in predicting financial data were: the logistic regression algorithms, the SVM algorithms, and the quadratic discriminate analysis algorithms. I tinkered around with fifteen different features, including but not limited to the PE ratio, PX ebitda, PX volume, and the current enterprise value (more detailed in final report).
I collected the data in CSV form using Bloomberg’s API, Quandl’s database, and (while I was still figuring out the feature set) Yahoo Finance. Collected from these sources, my input to my learners was data on the stocks mentioned above, taken from the past 6 years (2009-2015). My output is either a “+” or “-“: if the stock price has risen in the previous time period (either next day or anywhere from 10 to 80 days, respectively), the output was a plus sign (“+”). Otherwise, my output was a minus sign (“-“). I measured the success of the learners on the test data by examining the accuracy. For reasons detailed in the final report, I considered a somewhat successful model to report over a 60% accuracy.
The results of my project were, according to this baseline, relatively successful, given that this was a challenging task. Most of my success was found in the realm of long term prediction (AAPL stock and KO stock) and not in next-day prediction. My best next-day result, given by SVM on CHTR stock with a 56.2% accuracy was barely better than ZeroR’s accuracy of 54.8% on the same data. SVM was also the source of the greatest success in long term prediction: for example, the learner returned an accuracy of 69.4% on AAPL data when given a 35-day time span, which was about 10 percentage points greater than ZeroR’s accuracy of 59.6%. Best performance was achieved when all 15 features were used, and “current enterprise value” was found to be the most helpful
# Features -> Accuracy (in SVM, long term, AAPL):
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
57.2% | 56.9% | 57.9% | 59.1% | 60.0% | 63.4% | 62.2% | 64.3% |
9 | 10 | 11 | 12 | 13 | 14 | 15 | |
65.9% | 66.8% | 68.9% | 68.3% | 67.2% | 67.3% | 69.4% |
^ the above table shows how the accuracy is maximized at 15 features.
(I have made my extended report the next page in my wordpress)