Machine Learning

Gradient Descent dependence on order of training data

Problem

Which of the following optimization algorithms does not depend on the order of training data?

Options

Answer

Solution

A full batch gradient descent goes through the entire dataset in every iteration. Also, the operations we perform are commutative and hence the order of the training examples does not matter.

Need for feature scaling

Problem

Which of these methods does not require you to do feature scaling on the input data?

Options

Answer

Solution

Decision trees don’t require feature scaling as they use a rule-based approach instead of calculating distances.

Random Forest vs Boosting

Problem

Which statement is false when comparing the random forest algorithm with gradient boosting methods?

Options

Answer

Solution

Both random forest and boosting are ensemble methods. However, random forest relies on the weak law of large numbers for its accuracy, that is, it trains a lot of trees independently, and then selects the model of all the predictions made by the individual decision trees (majority voting) and then returns the result as its final prediction. However, in gradient boosting methods, the trees are not built independently but instead they are built in a sequential manner where each tree effectively learns the mistake from the ones that come before it. Now, since random forest trains each tree independently, it can afford to have deep trees since deep trees have low bias and high variance, and the high variance gets reduced due to model averaging. On the other hand, in gradient boosting methods, if a single tree is deep, it might overfit and get stuck in a local minima very soon. So, it’s better to use shallower trees since each of them will have low variance due to low complexity and the large number of trees will also reduce bias. Now, coming to the last option, gradient boosting methods use gradients to optimize the loss where a learning rate is required to control the step size, whereas random forest uses the bagging method which does not require a learning rate.

Feature selection methods in Machine Learning

Problem

Which of the following algorithms are not suitable for feature selection before training a machine learning model?

Options

Answer

Solution

Both Chi-Square Test and ANOVA are statistical tests which select features on the basis of their correlation with the outcome variable. Lasso regression shrinks the coefficients of the irrelevant features to zero thus removing them altogether which results in a useful set of selected features. On the contrary Ridge regression shrinks the coefficient estimate towards zero, but not exactly zero which is why it is not suitable for feature selection.