Project Title: Stock Market Trend Prediction

Group Members: Richard Gates Porter

Contact email: RichardPorter2016@u.northwestern.edu

Northwestern University

EECS 349: Machine Learning

The goal of my final project was to apply machine learning techniques to predicting both the next-day and long term price trends in several stocks. I focused on predicting the long term prices of Apple (AAPL) sock and The Coca-Cola Co (KO) stock, while focusing on trying to predict next-day prices for Charter (CHTR) stock and Delta Airlines (DAL) stock. This task is interesting for pretty obvious reasons: everyone wants to predict the stock market so they can make money. But for less obvious reasons: I was interested to see how the ML algorithms we have been studying for the last quarter (and some new ones) perform differently in predicting the prices of a wide variety of stocks, in both the short term and the long term, because I am interested in the various factors that effect long term and short term stock prices differently.

I focused on optimizing three learners for my final project, most of which were taken from a Python library/package called Scikit-learn and Weka. The three learners I focused on optimizing in regards to their performance in predicting financial data were: the logistic regression algorithms, the SVM algorithms, and the quadratic discriminate analysis algorithms. I tinkered around with fifteen different features, including but not limited to the PE ratio, PX ebitda, PX volume, and the current enterprise value (more detailed in final report).

I collected the data in CSV form using Bloomberg’s API, Quandl’s database, and (while I was still figuring out the feature set) Yahoo Finance. Collected from these sources, my input to my learners was data on the stocks mentioned above, taken from the past 6 years (2009-2015). My output is either a “+” or “-“: if the stock price has risen in the previous time period (either next day or anywhere from 10 to 80 days, respectively), the output was a plus sign (“+”). Otherwise, my output was a minus sign (“-“). I measured the success of the learners on the test data by examining the accuracy. For reasons detailed in the final report, I considered a somewhat successful model to report over a 60% accuracy.

The results of my project were, according to this baseline, relatively successful, given that this was a challenging task. Most of my success was found in the realm of long term prediction (AAPL stock and KO stock) and not in next-day prediction. My best next-day result, given by SVM on CHTR stock with a 56.2% accuracy was barely better than ZeroR’s accuracy of 54.8% on the same data. SVM was also the source of the greatest success in long term prediction: for example, the learner returned an accuracy of 69.4% on AAPL data when given a 35-day time span, which was about 10 percentage points greater than ZeroR’s accuracy of 59.6%. Best performance was achieved when all 15 features were used, and “current enterprise value” was found to be the most helpful

# Features -> Accuracy (in SVM, long term, AAPL):

1 2 3 4 5 6 7 8
57.2% 56.9% 57.9% 59.1% 60.0% 63.4% 62.2% 64.3%
9 10 11 12 13 14 15  
65.9% 66.8% 68.9% 68.3% 67.2% 67.3% 69.4%  

 ^ the above table shows how the accuracy is maximized at 15 features.

(I have made my extended report the next page in my wordpress)