Without mumbling our fancy technical jargon let us understand this thing more simply by getting an insight of more practical general situation.
Suppose you are mathematics teacher and being a good teacher you want to make your student learn a new concept. To tackle this problem you bought him a book comprising of example and exercise problems. Now there are a couple of method you can choose from to teach your student.
You can either teach him all the concepts at once and take one single test to evaluate his skills. Or what you can do is divide up the book in ‘k’ chapters and take a test after he finishes up one chapter. Following this method of teaching you can test his skills in a modular approach. Testing in over a wide range of problems and thus getting a detailed insights on his ability.
Cross validation or Rotation estimation in the field of machine learning also works in a similar fashion. It is a statistical method of evaluating and comparing learning algorithms. By dividing data into two segments. One used to learn or train a model and the other used to validate the model. In typical cross-validation, the training and validation sets must cross-over in successive rounds such that each data point has a chance of being validated against.
Introduction to K-Fold Cross Validation:
The basic form of cross-validation is k-fold cross-validation similar to k chapters of our student. Other forms of cross-validation are special cases of k-fold cross-validation or involve repeated rounds of k-fold cross-validation. In k-fold cross-validation the data is first partitioned into k equally (or nearly equally) sized segments or folds (chapters). Subsequently k iterations of training and validation are performed such that within each iteration a different fold of the data is held-out for validation while the remaining k-1 folds are used for learning. In data mining and machine learning 10-fold cross-validation (k = 10) is the most common type.
Cross-validation is used to evaluate or compare learning algorithms as follows: in each iteration, one or more learning algorithms use k-1 folds of data to learn one or more models, and subsequently the learned models are asked to make predictions about the data in the validation fold. The performance of each learning algorithm on each fold can be tracked using some predetermined performance metric like accuracy. Upon completion, k samples of the performance metric will be available for each algorithm. Different methodologies such as averaging can be used to obtain an aggregate measure from these sample, or these samples can be used in a statistical hypothesis test to show that one algorithm is superior to another.
We can define our goals in a well descriptive manner
- To estimate performance of the learned model from available data using one algorithm. In other words, to gauge the generalizability of an algorithm.
- To compare the performance of two or more different algorithm. And find out the best algorithm for the available data. Or alternatively to compare the performance of two or more variants of a parameterized model.
Though presented in two different points the key goal is highly related since the second goal is automatically achieved if one can accurately estimate the performance.
This can be achieved in a couple of different ways of validation.
- Resubstitution Validation is a naive approach that suffers from data over-fitting and large variance since it makes use of all the available data for training and then uses the same for testing purposes. It is analogous to asking our student the same questions in exam that is used in the examples.
- Hold-Out Validation is used to avoid our data over-fitting problem in this case an independent test set is preferred by splitting our data-set into two non-overlapping potentially unique sets out of which one can be used solely for training and other for obviously testing. This approach is far more efficient in handling large data sets due to its computation simplicity. It can be a one stop solution in those cases where immense ample data is available like a weather monitoring system that predicts the probability of rainfall based on pressure and temperature data collected over a week.
- K-Fold Cross-Validation as discussed earlier in our article partitions the data-set into k equally (or nearly equally) sized segments or folds. Subsequently, k iterations of training and validation are performed. Such that within each iteration a different fold of the data is held-out for validation while the remaining k-1 folds are used for learning. Data is commonly stratified prior to being split into k folds. Stratification is the process of rearranging the data as to ensure each fold is a good representative of the whole. This method provides a better insights on data and successfully creates an intuitive model for even small limited data at the cost of complex computation. A sub-variant of cross validation is Repeated K-fold Cross-Validation. where every the data is flushed and re-stratified before each round of cross validation.
#Simple Code Snippet on K-fold Cross Validation from sklearn.model_selection import KFold X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) # input arrary y = np.array([1, 2, 3, 4]) # Output Array kf = KFold(n_splits=2) # Define the split - into 2 folds kf.get_n_splits(X) # returns the number of splitting iterations in the cross-validator for train, test in kf.split(X): print(“TRAIN:”, train, “TEST:”, test) X_train, X_test = X[train], X[test] y_train, y_test = y[train], y[test]
Here is the output:
('TRAIN:', array([2, 3]), 'TEST:', array([0, 1])) ('TRAIN:', array([0, 1]), 'TEST:', array([2, 3]))
- Leave-one-out cross-validation (LOOCV) is generally used when data is very scarce, it is a special case of k-fold cross-validation where k equals the number of instances in the data. In each iteration, nearly all the data except for a single observation are used for. Training and the model is tested on that single observation. Then we can obtain an accuracy estimate obtained using LOOCV which is known to be almost unbiased but it has high variance. This often leads to unreliable estimates.
from sklearn.model_selection import LeaveOneOut X = np.array([[1, 2], [3, 4]]) y = np.array([1, 2]) leave_oo = LeaveOneOut() leave_oo.get_n_splits(X) for train, test in leave_oo.split(X): print("TRAIN:", train, "TEST:", test) X_train, X_test = X[train], X[test] y_train, y_test = y[train], y[test] print(X_train, X_test, y_train, y_test)
And here is the output:
('TRAIN:', array(), 'TEST:', array()) (array([[3, 4]]), array([[1, 2]]), array(), array()) ('TRAIN:', array(), 'TEST:', array()) (array([[1, 2]]), array([[3, 4]]), array(), array())
Major Applications of Cross-Validation includes
Estimation of algorithmic performance as we have seen earlier that cross validation can be used to mark a performance point of a supervised learning algorithm. It allows for all the data to be used in obtaining an estimate. Most commonly one wishes to estimate the accuracy of a classifier in a supervised-learning environment. In such a setting, a certain amount of labeled data is available and one wishes to predict how well a certain classifier would perform if the available data is used to train the classifier and subsequently ask it to label unseen data. By using our generally used 10-fold cross-validation one can repeatedly use 80% to 90% of the data to build a model and test its accuracy on the remaining portions.
The resulting average accuracy is likely somewhat of an underestimate for the true accuracy when the model is trained on all data and tested on unseen data, luckily in most cases this estimate is pretty reliable, particularly if the amount of labeled data is sufficiently large and obviously the remains data follows a similar distribution pattern to have our supervised learning algorithm or more simply our function approximation algorithm to work.
Since our Cross-Validator can benchmark our algorithm we can also make use of it as a Model Selector since we can compare two different algorithms or models choose the better performing one on the basis of our insights on the data points. In generalized model selection one has a large library of learning algorithms or classifiers to choose from and wish to select the model that will perform best for a particular data-set. In either case the basic unit of work is pair-wise comparison of learning algorithms. For generalized model selection combining the results of many pair wise comparisons to obtain a single best algorithm may be difficult.
There is no general rule of thumb in selection of a validator or supervised learning algorithm we are human we have instincts. Based on the available data-sets and it’s size we have to make wise decision while choosing the validator based on the size of the data-set and type of the problem being encountered.
You can check our Introductory Machine Learning Course here: