A Guide on Splitting Datasets With Train_test

TL;DR – The train_test_split function is for splitting a single dataset for two different purposes: training and testing. The testing subset is for building your model. The testing subset is for using the model on unknown data to evaluate the performance of the model.

1. What Sklearn and Model_selection are
2. What is train_test_split?
2.1. Parameters
2.2. The use of train_test_split
3. Why use the Sklearn train_test_split function?
3.1. Overfitting and underfitting
4. Train_test_split: useful tips

What Sklearn and Model_selection are

Before discussing train_test_split, you should know about Sklearn (or Scikit-learn). It is a Python library that offers various features for data processing that can be used for classification, clustering, and model selection.

Model_selection is a method for setting a blueprint to analyze data and then using it to measure new data. Selecting a proper model allows you to generate accurate results when making a prediction.

To do that, you need to train your model by using a specific dataset. Then, you test the model against another dataset.

If you have one dataset, you'll need to split it by using the Sklearn train_test_split function first.

What is train_test_split?

train_test_split is a function in Sklearn model selection for splitting data arrays into two subsets: for training data and for testing data. With this function, you don't need to divide the dataset manually.

By default, Sklearn train_test_split will make random partitions for the two subsets. However, you can also specify a random state for the operation.

Parameters

Sklearn test_train_split has several parameters. A basic example of the syntax would look like this:

train_test_split(X, y, train_size=0.*,test_size=0.*, random_state=*)

X, y. The first parameter is the dataset you're selecting to use.
train_size. This parameter sets the size of the training dataset. There are three options: None, which is the default, Int, which requires the exact number of samples, and float, which ranges from 0.1 to 1.0.
test_size. This parameter specifies the size of the testing dataset. The default state suits the training size. It will be set to 0.25 if the training size is set to default.
random_state. The default mode performs a random split using np.random. Alternatively, you can add an integer using an exact number.

Pros

Easy to use with a learn-by-doing approach
Offers quality content
Gamified in-browser coding experience
The price matches the quality
Suitable for learners ranging from beginner to advanced

Main Features

Free certificates of completion
Focused on data science skills
Flexible learning timetable

GET 50% OFF

Pros

Simplistic design (no unnecessary information)
High-quality courses (even the free ones)
Variety of features

Main Features

Nanodegree programs
Suitable for enterprises
Paid Certificates of completion

UP TO 70% OFF

Pros

A wide range of learning programs
University-level courses
Easy to navigate
Verified certificates
Free learning track available

Main Features

University-level courses
Suitable for enterprises
Verified certificates of completion

FREE COURSES

The use of train_test_split

First, you need to have a dataset to split. You can start by making a list of numbers using range() like this:

X =  list(range(15))
print (X)

Then, we add more code to make another list of square values of numbers in X:

y = [x * x for x in X]
print (y)

Now, let's apply the train_test_split function. Here, we set the train size to 65% of the entire dataset. Remember to write 0.65.

import sklearn.model_selection as model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, train_size=0.65,test_size=0.35, random_state=101)
print ("X_train: ", X_train)
print ("y_train: ", y_train)
print("X_test: ", X_test)
print ("y_test: ", y_test)

You can set only the test_size as the train_size will adjust accordingly. You can also set the random_state to 0 as shown below:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

Note: Sklearn train_test_split function ignores the original sequence of numbers. After a split, they can be presented in a different order.

Why use the Sklearn train_test_split function?

Using the same dataset for both training and testing leaves room for miscalculations, thus increases the chances of inaccurate predictions.

The train_test_split function allows you to break a dataset with ease while pursuing an ideal model. Also, keep in mind that your model should not be overfitting or underfitting.

Overfitting and underfitting

Overfitting is a situation when a model shows almost perfect accuracy when handling training data. This situation happens when the model has a complex set of rules. When a model is overfitting, it can be inaccurate when handling new data.

Underfitting is when a model doesn't fit the training data due to sets of rules that are too simple. You can't rely on an underfitting model to make an accurate prediction.

Train_test_split: useful tips

Unless specified to use random_state function, train_test_split will split arrays into random subsets.
The ideal split is said to be 80:20 for training and testing. You may need to adjust it depending on the size of the dataset and parameter complexity.

Previous Topic Next Topic

Splitting Datasets With the Sklearn train_test_split Function

Contents

What Sklearn and Model_selection are

What is train_test_split?

Parameters

The use of train_test_split

Why use the Sklearn train_test_split function?

Overfitting and underfitting

Train_test_split: useful tips

Best-rated MOOCs to Learn Programming:

Related Code Examples

Python

Python

Python

Python

Python

Python

DATACAMP DEAL: GET 25% OFF