How to use the train_test_split() method present in the Sklearn In Python

Q: How to use the train_test_split() method present in the Sklearn In Python

The approach that we will follow to perform splitting is will consider the first 80% of the rows as the training data and the remaining ones will serve as the testing data.

Using the DataFrame.sample() method

Using Numpy.random.rand() method

In practice one of the most common methods that are used to perform the splitting of the dataframe is the train_test_split() method. This method can help us to randomly split two data frames as well simultaneously that may be your feature vector and the target vector.

Python3

train_set, test_set = train_test_split(df,random_state=42,test_size=0.2) 
print(train_set.shape, test_set.shape)

Output:

(120, 4) (30, 4)

Here, we are making use of the train_test_split() method present in the sklearn.model_selection module to split our DataFrame into train and test sets. We are passing three arguments to the train_test_split() function, the first argument is the original DataFrame itself, the second argument is the random_state which works as explained in the previous method and the third argument is the test_size which means how many samples of the entire DataFrame we want in the test set. Since we need 20% data as a test set we are passing test_size=0.2. The train_test_split() function returns 80% of the rows in the train set and rests 20% data in the test set.

Pandas – Create Test and Train Samples from DataFrame

We make use of large datasets to make a machine learning or deep learning model. While making one of these models, it is required to split our dataset into train and test sets because we want to train our model on the train set and then observe its performance on the test set. These datasets are loaded inside the Python environment in the form of a DataFrame. In this article, we are going to learn about different ways in which we can create train and test samples from a Pandas DataFrame in Python. For demonstration purposes, we will be using a toy dataset (iris dataset) present in the sklearn.datasets module and load it inside a DataFrame. Firstly we will import all the necessary libraries.