Machine Learning Course – Deep Learning Foundations
Part of my New Year’s Resolution for 2017 was to take and commit to learning machine learning – through various combinations of different machine learning courses from EdX, Coursera, and Udacity.
I was looking into Udacity and considering the Machine Learning nanodegree program -but my hesitation was that on average, it takes 12 months to complete, and I didn’t want to commit to something that long when I was mostly afraid that I wouldn’t be able to get the assistance I needed whenever I’m stuck with the code.
I already started looking into Artificial Intelligence and Machine Learning courses on the various online learning platforms. I actually applied to the A.I. track for Udacity and was about to commit to that, but I realized that for my interests in the stock market, the machine learning course is more relevant to me than voice recognition and other aspects of general A.I.
So I started going through some of the Machine Learning courses at Udacity / Georgia Tech – covering various topics.
Udacity Deep learning Foundations Degree Program
Siraj is basically the new Bill Nye the Science Guy – except applied to machine learning. I’m a big fan of his enthusiasm for this subject.
As you may know, through this site, I am primarily interested in applying machine learning to pattern recognition in the stock market. Hence, my interest in taking this course.
I’m currently in week 2-3 of the course – just finished the first project. Here are a few things I can comment on:
They advertise it as something people can do part-time with just a few hours a week. Ha!
This course covers stuff that most PhD students spend full-time on – and they try to cram a lot in a really short amount of time.
Personally I’ve been spending around 20+ hours a week trying to figure things out.
On Youtube, they pump out a live video and a prepared video – usually around 50 min or so for live and the prepared one closer to 10 minutes.
But the course itself has a TON of extra material to go through.
I’d say if you simply go through the weekly videos and try to do them yourself – then yea, maybe 3-5 hours a week could work.
For me my concern is less about the amount of material – and more about making sure I get the help I need to actually understand before moving on. I don’t think I’m quite there yet.
Previous Coding Experience Kind Of Needed
So they advertise saying that you don’t need to know programming – or it’s very minimal. Not true. You definitely should be familiar and have taken programming courses before.
I’ve taken CS106A and CS106B at Stanford – covering basic programming along with recursion and data structures.
There are Python lines of code that I don’t still fully get. I wish they did a better job explaining each line of code – translating it to English.
I find myself using the print( ) function a lot to try to decipher what each variable is. It would be helpful, for example, if they gave a tutorial on how to debug or how to use print functions or other techniques as a way to help is figure out what what’s going on.
It would be nice if I could some how “step into” a line of code and find out what the value is without printing functions. I know there was a tool for other languages – but perhaps there is one in Python – I haven’t come across it yet.
Lots of Math
Wow – within the first week – the level of math just hits you like that.
I’m generally pretty good with math – but understanding gradient descent – the use of partial derivatives.
When multiplying two things together – it was a bit frustrating to make sure the matrix dimensions are correct so that you can properly multiply them together or do a dot product.
Oftentimes, when there’s a big chunk of code, it’s not easy to find out how to isolate that specific area to find out the matrix dimensions and backwards engineer a way to put them together in such a way
Forward and Back Propagation
Perhaps the most complex math – which I still don’t fully understand – is the backpropagation math. I understsand what needs to be done – but I wouldn’t say I’m smart enough yet to explain the nuts and bolts of why to someone else.
I love Siraj’s Bill Nye-type style of explaining something much more complex than basic science concepts.
I also like how he types from scratch – though someone needs to dive in and further explain what each line of code means in English as he plows through – and how different snippets of code interplay with each other.
It would be helpful if the same thing can be explained by like 5 different instructors — or even by other students.
Some of the Udacity lessons that are most helpful are the ones with animated diagrams. For example, the AND, OR, NOT perceptron animated diagrams and tables were very helpful.
I wish there was some animated diagram for each complex concept – particularly forward and backpropagation and gradient descent.
I initially liked Andrew Tasks’ think out loud approach in looking at sentiment analysis – but as I progressed through the mini-projects, I found there were a lot of questions that I had with the code. He’s clearly an expert – in a way, I’d prefer if a dumbass (like myself or dumber) explained these same concepts – without skipping over things that seem easy or basic to an expert.
My Critiques: Some Not Fond of Questions
I don’t know about you – but I’m the kind of student who asks LOTS of questions – even though I did very well in school – I continuously asked questions and have no shame in asking dumb questions. I’m not naturally smart like some people are – which is why I rely on asking these dumb questions over and over. I not only ask why something is the way it is – I often as why not another way?
One of the TA’s at Udacity told another student to NOT answer my question. His exact words were “don’t bother” – I believe — as in, don’t bother helping this guy out.
No matter how dumb my question may be – I don’t think Udacity staff should be telling students to not help me or each other out – no matter what the question is.
This makes me question the kind of online learning culture that some people at Udacity may be creating. Not everyone is like that – but one guy was definitely against me and the assumed “level of effort” that he was saying I wasn’t putting in – based on my dumb questions.
I understand they are trying to get lots of students – but are simultaneously trying to graduate lots of people to do big things in machine learning and for society, etc.
But really, given the level of difficulty here and the lack of patience for so-called dumb questions – I have to say this program is really geared for advanced people. They should have a better way of allowing dumb questions to be asked and creating an environment so that others can learn from other dumb questions.
Beginners who look at Siraj’s video – and think it’s entertaining and easy to learn – can be misled into thinking it’s easy or that the environment is conducive to asking lots of dumb questions. But it’s not.
Sometimes I feel because of the pressure of meeting the weekly project deadlines – people desperately try to do whatever it takes to meet the deadline – without necessarily understanding – or taking the time to ask about details – that make learning about learning. I completed the project – but as with anything highly complex – there are some holes in my understanding that I hope to fill over time.
That said, I think there automated feedback – is pretty good. Further below is the text portion of the Week 1 project feedback. From my understanding, it’s a mix of human and automated feedback.
There’s some 5,000 people in this course from all around the world which impressive. I can’t imagine everyone else breezing through.
Week 1 Project
So the first project – actually covered about the first 2 weeks, not 1 week – and the task was a monstrous task: to design your very first neural network.
That’s a high task. I’m writing this up in hopes of “better understanding what just happened.”
Below I mention “I’m not sure” – several times – indicating that I’m not sure why something is the way it is. If you have better understanding, please do share and I’ll be sure to update this post with new learnings!
So you have a bikeshop – and you want to make sure you have enough bikes to take care of / service the biker demand for that specific day properly.
If you have too many bikes, then that means you have too much inventory of bikes that aren’t being fully utilized.
If you don’t have enough bikes, then that means there is a demand of bikers that isn’t being met by your supply of bike.
The sweetspot is somewhere in the midle – such that you have just enough bikes on each day — throughout the many seasons of the year – and various days of the week.
In general, once you have Anaconda and Jupyter Notebook setup (it’s a pain to install everything properly with all the various versions of software – and it’s different for a mac vs pc) – you open up Anaconda prompt – navigate to your folder and type in “jupyter notebook” – this pops up a browser tab as follows:
The python file is: dlnd-your-first-neural-network.ipynb
The way it works with Jupyter Notebook is it sets up a localhost so that the URL you load looks something like this:
You do this while simultaneously having your Anaconda prompt window open in the background – but what you work off of is the browser tab.
Looking at the .CSV data file
We downloaded a project folder – that contains a Bike-Sharing-Dataset folder – which has hour.csv inside.
In order to read the file, we had to “import pandas as pd” – so we can utilize the pd.read_csv( ) function.
Since we stored the pd.read_csv() function into a variable called “rides” — we can call the “rides.head()” function to look at the first few rows of the data to get a glimpse of what it looks like.
There’s a total of 17 columns in the data.
I also opened the CSV file in Excel – and there are 17,380 rows. So the variable “rides = pd.read_csv(data_path)” – effectively stored the contents of the .csv file into a 17,380 x 17 dataframe matrix called “rides.”
I can confirm this by using the “.shape” after “rides” — WITHOUT the parenthesis “( )” – so it looks like this:
and the output would be:
— which makes sense since row 0 is considered the first row – so the 17,379 from the “rides.shape” line of code matches row number 17,380 that I saw in Excell.
Making Sense of the Data
From the assignment:
“This dataset has the number of riders for each hour of each day from January 1 2011 to December 31 2012. The number of riders is split between casual and registered, summed up in the
cntcolumn. You can see the first few rows of the data above.
Below is a plot showing the number of bike riders over the first 10 days in the data set. You can see the hourly rentals here. This data is pretty complicated! The weekends have lower over all ridership and there are spikes when people are biking to and from work during the week. Looking at the data above, we also have information about temperature, humidity, and windspeed, all of these likely affecting the number of riders. You’ll be trying to capture all this with your model.”
Each row = A New Hour / Day
Each row is a different hour — and eventually after 24 hours — or 24 rows of data – you reach the end of day 1 – and it goes into day 2 and so on.
For example, row 0 = day 1 — which is January 1, 2011. Notice the first many rows – are all January 1, 2011 — except each row represents a different hour of that day.
Notice the “hr” row — changes for each row – so that’s how each row of data is different. For that specific hour on the first hour of that January 1 day — which if you look at the historical calendar – you’ll notice is a Saturday – which corresponds to the “weekday” column value of “6.”
So January 1, 2011 – was a Saturday (weekday = 6) – and on that specific day – it had various column attributes – including Temperature [‘temp], humidity, windspeed, etc. I’m not sure what “atemp” is.
I believe [‘holiday’] and [‘workingday’] are binary values – “0” – if it is not a holiday or not a workinday — and “1” – if it is a holiday or is a working day. Note that Saturday is not a working day – hence for the first row, we have “0” under working day.
The last 3 columns on the right — [‘casual’], [‘registered’], and [‘cnt’] — are such that casual + registered = cnt
In other words, the number of casual bikers + the number of registered bikers = the total count of bikers that needed bikes that day.
Plotting the Data
In order to plot the data, we had to already have imported matplotlib
“import matplotlib as plt”
The code would then be:
I believe the “rides[:24*10]” – portion specifies to look at the first 240 rows — or 24 hours per day * 10 days = (240 rows of data) – – so the first 10 days of the 2011 from January 1 to January 10, 2011.
The x-axis (“dateday”) – represents the date – such as January 1, 2011
The y-xaxis (“cnt”) – represents the total count of bikers for that hour of that day.
So each date – January 1, January 2, January 3, etc to January 10 – contains 24 data points – one for each hour. And, as you can imagine, the blue graph above goes up and down to 0 and then up – the 0 the happens repeatedly one each day – is probably the overnight low in biker demand.
So the data is inside the “rides” dataframe.
The next step – I didn’t quite get – as they manipulated the “rides” data – and changed the data structure around such that there were more columns – by making it look like this – not sure why they did this:
Once I figure out why this step was done, I’ll report back.
Splitting the data into training, testing, and validation sets
Out of 2 years worth of data from 2011 to end of 2012, we then allocated the last 60 days as the validation set of data for use after training.
So effectively the last 3 months of data is validation data set.
The first 21 months of data is the training data set.
Lastly, we set the last 21 days of the data as the testing set.
I don’t yet quite fully grasp why we chose this specific split. It almost seems like the training data set is a subset of the validation set. I’ll be sure to report back when I better understand this.
But in terms of code – how do we allocate the last 21 days to be the testing data set?
Recall that we are no longer using the “riders” dataframe – instead after the manipulation above, it’s now the “data” dataframe:
# Save the last 21 days
test_data = data[-21*24:] data = data[:-21*24]
Recall that before we used “rides[:24*10]” – to look at the first 240 rows
I’m not sure why the numbers are sometimes to the LEFT of the “:” and sometimes to the RIGHT.
But it appears the negative number indicates that we are referring to the LAST (as opposed to FIRST) – rows of data.
test_data = data[-21*24:]
means that the test_data will begin from the 21st row from the end — and then proceed til the end of the data set.
data = data[:-21*24]
means that “data” will begin from the beginning and end at the 21st to last row from the end.
That’s my best guess for the meaning of the placement of the “:” symbol in the code.
Last 60 Days for Validation Data Set
So for the last 60 days of day – that we use for the validation data set, the code is:
# Hold out the last 60 days of the remaining data as a validation set
train_features, train_targets = features[:-60*24], targets[:-60*24] val_features, val_targets = features[-60*24:], targets[-60*24:]
I’m not sure what the difference is between “features” and “targets” is. I’ll update when it makes more sense.
Goal: Use Machine Learning to Predict Hourly Biker Demand for Dec 10-31, 2012
Our goal was to — train the machine using historical data from Jan 1, 2011 – Dec 9, 2012 – on what biker demand was in the past, then use machine learning to predict for every hour of every day – for the last 21 days of the data set (Dec 10-31, 2012) – what will the biker demand be?
Compare this hourly prediction with what we know actually happened during those last 21 days of 2012.
Since we know what biker demand actually was – we can compare our machine-generated prediction (blue below) with what actually happened (green below) – and see whether they line up.
After implementing the complex neural network code, here’s my graph result – notice how the blue prediction roughly lines up with what actually happened – except it was kind of off as we got towards the end of the year – likely due to the Christmas and New Year’s holiday schedule. The data shows that there was less bike demand during those days.
The machine learned from historical data – and based on what bike demand was in that historical time period of data from January 2011 to Dec 2012 – the machine was able to predict what demand would be. Since the repetition of pattern from peak times of the day to troughs of the day happened repeatedly for 700+ days – the machine was able to generate a prediction that had similar cycles.
The code they used to generate this graph is quite complicated – I didn’t write this portion of the code:
fig, ax = plt.subplots(figsize=(8,4))
mean, std = scaled_features[‘cnt’] predictions = network.run(test_features)*std + mean
ax.plot((test_targets[‘cnt’]*std + mean).values, label=’Data’)
dates = pd.to_datetime(rides.ix[test_data.index][‘dteday’])
dates = dates.apply(lambda d: d.strftime(‘%b %d’))
_ = ax.set_xticklabels(dates[12::24], rotation=45)
I would like to eventually understand each line of code above that was used to generate the chart. At the moment, I’m still deciphering.
The code I did write was the code below that involves a lot of math in context of the neural network with forward and backpropagation.
Here’s My Code
class NeuralNetwork(object): def sigmoid(self, x): return 1/(1+np.exp(-x)) def __init__(self, input_nodes, hidden_nodes, output_nodes, learning_rate): # Set number of nodes in input, hidden and output layers. self.input_nodes = input_nodes self.hidden_nodes = hidden_nodes self.output_nodes = output_nodes # Initialize weights self.weights_input_to_hidden = np.random.normal(0.0, self.hidden_nodes**-0.5, (self.hidden_nodes, self.input_nodes)) self.weights_hidden_to_output = np.random.normal(0.0, self.output_nodes**-0.5, (self.output_nodes, self.hidden_nodes)) self.lr = learning_rate #### Set this to your implemented sigmoid function #### # Activation function is the sigmoid function self.activation_function = self.sigmoid def train(self, inputs_list, targets_list): # Convert inputs list to 2d array print ("Inputs_list:",inputs_list) inputs = np.array(inputs_list, ndmin=2).T print ("Inputs_list:",inputs_list) targets = np.array(targets_list, ndmin=2).T #### Implement the forward pass here #### ### Forward pass ### # TODO: Hidden layer hidden_inputs = np.dot(self.weights_input_to_hidden, inputs) # signals into hidden layer# signals into hidden layer hidden_outputs = self.activation_function(hidden_inputs)# signals from hidden layer # TODO: Output layer final_inputs = np.dot(self.weights_hidden_to_output, hidden_outputs)# signals into final output layer final_outputs = final_inputs# signals from final output layer #### Implement the backward pass here #### ### Backward pass ### # TODO: Output error output_errors = targets - final_outputs# Output layer error is the difference between desired target and actual output. output_grad = 1 # TODO: Backpropagated error hidden_errors = np.dot(self.weights_hidden_to_output.T, output_errors)# errors propagated to the hidden layer hidden_grad = hidden_outputs * (1.0 - hidden_outputs)# hidden layer gradients # TODO: Update the weights self.weights_hidden_to_output += self.lr * np.dot(output_errors * output_grad, hidden_outputs.T)# update hidden-to-output weights with gradient descent step self.weights_input_to_hidden += self.lr * np.dot(hidden_errors * hidden_grad, inputs.T)# update input-to-hidden weights with gradient descent step def run(self, inputs_list): # Run a forward pass through the network inputs = np.array(inputs_list, ndmin=2).T #### Implement the forward pass here #### # TODO: Hidden layer hidden_inputs = np.dot(self.weights_input_to_hidden, inputs)# signals into hidden layer hidden_outputs = self.activation_function(hidden_inputs)# signals from hidden layer # TODO: Output layer final_inputs = np.dot(self.weights_hidden_to_output, hidden_outputs)# signals into final output layer final_outputs = final_inputs# signals from final output layer return final_outputs
So above I defined the Neural Network Class that began with
class NeuralNetwork(object):def __init__(self, input_nodes, hidden_nodes, output_nodes, learning_rate):
In order to run the neural network, we have to set something equal to that class and specify the parameters as such:
network = NeuralNetwork(N_i, hidden_nodes, output_nodes, learning_rate)
Each parameter lines up with the _init__ function above – so the “N_i” matches with “input_nodes” – and “hidden_nodes” matches with “hidden_notes”, etc.
To run / make predictions:
predictions = network.run(test_features)*std + mean
The Train function
So we defined:
def train(self, inputs_list, targets_list):
And we ran with:
The Train Function
So inside the class NeuralNetwork() – we defined a “train()” – function that has 2 parameters: inputs_list and targets_list:
class NeuralNetwork(object):def train(self, inputs_list, targets_list): # Convert inputs list to 2d array print ("Inputs_list:",inputs_list) inputs = np.array(inputs_list, ndmin=2).T print ("Inputs_list:",inputs_list) targets = np.array(targets_list, ndmin=2).T
Intuition: What exactly is “inputs” and what is “targets”?
I’m pretty sure I’m missing something – as I’m not sure why both inputs_list and targets_list has to be a 2-dimensional array.
Target is what the desired target – what the actual data is– like the fact that there were 16 bikers on the first hour of day 1 — that’s the target .
I’m not sure on Inputs – I believe you need the Inputs – as a vector or matrix numerical representation of data. The entire “riders” dataframe is effectively the Input – each row of data is a “record” – hence when we run “network.train(record, target)” – each record or row of data as passed in as a vector of Inputs.
That numerical vector representation – each number representing a different column header such as “weather, temperature, total biker count, etc”
So that entire first row of data is your Inputs_list – and I believe these numbers are “normalized” to something between 0 and 1 – so what matters is the relative distance between data values from one another – rather then the actual number itself.
Each numerical value corresponds to some kind of concept or feature.
But we don’t know which of these columns is most important in determining the thing we care about – which is total counts [‘cnt’] of bikers for each hour of the day.
How Much Weight?
In other words, we don’t know how much WEIGHT to place on each of these variables in terms of predicting the TOTAL COUNT.
We know intuitively that the [‘hr’] HOUR of the day matters – so I would imagine the weighting here should be stronger than it is for, say, [‘windspeed’].
This process of trying thousands and thousands of combinations of weights for each of the dozens of data columns that we have – is effectively what the neural network accomplishes.
The neural network initially tests a random weight combination given the input data – and then sees how OFF this prediction is – or how much ERROR there is compared to the actual training data set.
Given this error, which isn’t just a number – but rather a function that can be graphed in ‘n’ dimensions — given this error function that changes according to how much weight you place on different variables — how can we MINIMIZE this error?
We minimize it by shifting the weight combination in such a way that we incrementally move towards the MINIMUM error value of this error function. This process is called gradient descent.
So in minimizing the error function, we use calculus, specifically partial derivatives with respect to a whole series of weights for each of many variables — to tweak in such a way that the little white ball in the animation to the right – eventually “descends” down this n-dimension surface and reaches the absolute lowest points.
In the process, the mathematics allow us to take the slope (gradient) at any specific point on that surface. Once we know that slope – we know in which direction we should tweak our weights such that our next iteration or step results in lower error.
Specifically, if the slope is positive – that means we are too far to the right – and that means we need to move to the left.
If the slope is negative – that means we are too far to the left – and that means we need to move to the right.
So the formula for adjusting weights – involves a NEGATIVE sign – which takes this into account.
I’ll have to go into further detail in a separate post dedicated to gradient descent.
Gradient Descent is the fastest way of testing thousands / millions of different possibilities – without actually going through that many iterations – that efficiently identifies the optimal weights across each of the n dimensions. That numerical representation of weights – is in the weights vector.
The exact math corresponding to each line of code gets complicated – but that’s the general intuition in terms of my understanding.
Huge, huge, huge potential with the intended vision. I like the ambition and there are so many applications. The goal of bringing this power to a larger subset of people beyond PhDs is very compelling.
But there’s got to be a better way to onboard and better explain the inner intricacies and details of every step.
As both a teacher and student, I can see things from both sides. Having taught thousands of students how to ace the GMAT exam – covering complex reasoning questions, combinatorics / permutations, etc – I can say it’s not easy, but definitely d0-able to explain things so that regular people can understand. Machine learning is definitely more complex than the GMAT, but because the demand is there – it’s totally worth spending the time and effort to explain things in detail from multiple angles.
I personally find machine learning fascinating but I do have a strong desire of fully understanding what I’m working with. At the moment, I feel like I’m playing with fire and even though I know fire can be used for lots of things, I don’t yet know how to control or understand its origins, etc.
That said, I am hopeful for filling the gaps in my understanding. The breadth and size of what this course covers is very ambitious and it’s definitely an exciting and interesting experience to be part of a course that has so much world-wide interest.
Even as I progress through the next few weeks, I will find time to revisit things I “should already know” – to see if it makes more sense to me as I progress through the course.
I will have to make a post on gradient descent and forward/backpropagation as that’s the meat of machine learning and I want to make sure I really understand.
My experience is by no means representative of every student. In the midst of progressing through the course, I will be quite busy trying to learn stuff, but I hope to find time to document stuff on this blog when I can.
Allen is an entrepreneur and amateur machine learning enthusiast. His career has spanned management consulting with Booz & Co to derivatives trading on Wall Street and even mobile product management at TripAdvisor. But his biggest impact has been as founder of GMATPill.com – an online GMAT course that has helped thousands of students pass the rigorous GMAT exam used in MBA admissions. He received a B.S. in Management Science from Stanford University with a focus on Finance and Decision engineering. He was actually in the audience during Steve Jobs’ now famous graduation speech. This trading education blog is partly a result of the inspiration from that speech. One day, Allen hopes to incorporate machine learning into stock market pattern recognition so he can automate a lot of the manual pattern recognition that he is doing on a daily basis.
Here at LST101, Allen advises pro traders, high net-worth individuals, and hedge fund managers with his expert wave analysis on the S&P500. Now anyone, including amateurs, can subscribe to his Trade Of the Week premium service to learn exactly how he is trading today’s market, week after week.