Decision Tree Data Preparation


Data Preparationg

Select columns of season, teamName, playerName, position, pass_accura, tackle_blocks, tackle_intercep, and fouls_committed in the player dataset. Then, choose players whose position is the defender and playing season from 2020 to 2022. Dropping rows that are pass_accura, tackle_blocks, tackle_intercep, and foul_committed all are equal to 0. The main reason is these players are transferred by clubs at the beginning of the season or have never player in La Liga. After generating a new data set.
The Players Dataset

The New Dataset

Since the new dataset does not have a label column. Therefore, using k-mean clustering based on numeric variables of pass_accura, tackle_bolcks, tackle_intercep and fouls_committed. The k-value used here is 3 (Categories: Good performance, normal performance, and bad performance). After K-mean clustering, the label columns are merged into the new dataframe, as shown below.

The New Dataset with label(Performance) column

Train Data and Test Data Splitting

playerTrainDF, playerTestDF = train_test_split(player,test_size = 0.3, random_state=42)

playerTrainDF.to_csv("DTTrainDF.csv")
playerTestDF.to_csv("DTTestDF.csv")
print(playerTrainDF)
print(playerTestDF)

Also using the train_split function as same in the Multinomial NB section, and getting the train dataset and test dataset. Here the test is divided into 30% of the new dataframe. Also, use reandom_state=42 to ensure that the train set and test set do not change once running code every time. Creating a disjoint split is essential for the prediction model. If there are overlapping samples in the training set and test dataset, the model will view the labels in the test set and remember them during training. It can lead to errors in the evaluation of the performance of the model. Just as a student knows the answers to an exam in advance, a teacher cannot determine whether this student has really learned content by the test score.
Using the code below to check whether train set and test set disjoint splitting or not.

disjoint_check = pd.merge(playerTrainDF, playerTestDF, how='inner')

# Check Train and Test sets are disjoint
if not disjoint_check.empty:
    print("Train and Test set have same rows")
else:
    print("Train and Test set have not same rows")
Train Dataframe Test Dataframe
The Train DatasetTrain Set The Test DatasetTest Set

After that, the train set and test will be fine-tuned using the following code in preparation for use with the decision tree algorithm.


playerTestLabel = playerTestDF['performance']

playerTestDF = playerTestDF.drop(['performance'],axis = 1)
playerTrainDF_nolabel = playerTrainDF.drop(['performance'],axis = 1)
playerTrainLabel = playerTrainDF['performance']

dropcols = ['season','teamName','playerName','position']
playerTrainDF_nolabel_quant = playerTrainDF_nolabel.drop(dropcols,axis = 1)
playerTestDF_quant = playerTestDF.drop(dropcols,axis=1)

Resource

Decision Tree Data Preparation Code Python Code


  TOC