Data Gathering, Cleaning, and Preparation


Gathering Data

Some datasets available on the website are suitable for meeting needs (either a single season or only up to 2020). Therefore, in order to get the latest seasons from 2020-2023, just using APIs to gather data and create datasets is a good option. Finding an API calls api-football from the rapidapi'‘s website and using it to get all the data. This API provides rich data of La Liga players and clubs. Obtaining data has wonderful acquisition and accuracy. After gathering data from this API, create two unique datasets: one for players’s data, and another one for clubs’ data.

API Sample Code

  1. The example code below shows the API to get the various data of FC Barcelona’s players in La Liga from 2022-2023. Using this API to get all player’s data for La Liga from 2020-2023.

    import requests
    #endpoint
    url = "https://api-football-v1.p.rapidapi.com/v3/players"
    
    #to find the players'data in FC Barcelona 2020
    # 529 means Barcelona
    # 140 means Laliga
    # season means 2020 year
    query = {"team":"529","league":"140","season":"2020"}
    
    headers = {
        "X-RapidAPI-Key": Your_key, # Your private API key
        "X-RapidAPI-Host": "api-football-v1.p.rapidapi.com"
    }
    
    response = requests.get(url, headers=headers, params=query)
    response.text

    Run code, it shows the players’ data of FC Barcelona(click picture for zoom in)
    Players Data From FC Barcelona (JSON)

  2. The example code below shows the API get various data of Barcelona (include wins, loses, draws, goal for, goal against, etc for each season) for La Liga from 2020-2023.

    import requests
    
    url = "https://api-football-v1.p.rapidapi.com/v3/teams/statistics"
    
    # Getting the data of FC Barcelona in 2020
    # 529 means Barcelona
    # 140 means Laliga
    # season means 2020 year
    query = {"league":"140","season":"2020","team":"529"}
    
    headers = {
        "X-RapidAPI-Key": Your_key,
        "X-RapidAPI-Host": "api-football-v1.p.rapidapi.com"
    }
    
    response = requests.get(url, headers=headers, params=query)
    response.text

    Run this sample code, it show the Barcelona’s Data in 2020 (click picture for zoom in)
     Data of FC Barcelona in 2020 (JSON)

Getting Raw Data

Getting the raw data from these two APIs.
Raw Dataset of Players

The players’ raw data have the amount of missing values, Nan, and incorrect values. Therefore, for the next step, pick a suitable way to clean/replace these values and get a clean player’s dataset.

Raw Dataset of Clubs

Also, for the next step, check these two data frames data types, changing them as well to get a better analysis process and results.

Here is Full Version Code for gathering data:API Code
Full Raw Dataset of Players: Raw Dataset of Plyasers
Full Raw Dataset of Clubs: Raw Dataset of Clubs


Cleaning Data & Visualization

Cleaning

For the data cleaning step, the first thing is to remove some of the attributes that are not useful. At the same time, use countbox plots to check whether the values of those columns are all the same or all blank. The picture below shows variables of injured, number, captain, game_minutes, penalty_won, and dribble_past meet it, so they should be dropped. The teamId and birthDate are also not useful, they also should be removed. For the data club data, just drop the variable of teamID. Therefore, removing these attributes does not affect future tasks. The dataset of clubs almost perfect, it does not some NaN value, missing values, or incorrect values. Thus, dataset of clubs is clean and ready to use.

Check Arrtibute

After that, searching and replacing the NaN in datasets. First, the player’s height and weight contain some NaN, and the quantity of them is small, so replacing these NaN values with height means and weight means. Other variables contain a lot of NaN data, such game_assist, goal_saved, game_appearc, etc. Because NaN is these columns mean 0, they also mean that players are not performing in these skills. Therefore, using 0 to replace NaN for these attributes is a good choice.

Last, changing some data types for the dataset, such as switching height and weight data types from string to integer. Also changing position data type from string to category.

Clean Dataset of Players

Clean Dataset of Clubs

Here is Full Version Code for Cleaning DataCleaning Data Code
Full Clean Dataset of PlayersClean Dataset of Players
Full Clean Dataset of ClubsClean Dataset of Clubs


Visualization

Number of Nationality in La Liga
The plot shows in addition to Spanish players, Argentine, Brazilian, French, and Portuguese players play more in La Liga.

Number of Red Card fro each Club 2020-2023
This plot shows the total number of red cards received by the various clubs in La Liga from 2020 to 2023. Clubs getting more red cards may indicate having poor results. Teams that do not get red cards may show a less aggressive style of play.

number of Yellow Card For each Club 2020-2023
This plot shows the total number of yellow cards received by various clubs in La Liga from 2020 to 2023. The number of yellow cards received may indicate a more aggressive style of play. For example, Getafe, this team style of play is very unappealing and they get good results through a lot of fouls.

Player Age and Total Goal Relationship
The age of players from 25 to 30 may be are golden age, they can score more goals.

Number of Goal Save For each Club 2020-2023
A team with more saves may indicate an inconsistent defense and may have a poor performance.

Total Wins For Clubs 2020-2023
The top three teams in La Liga(Barcelona, Real Madrid, and Atletico Madrid) have won more games than any other team, which shows that they are stronger than any other team in La Liga.

Box Plot for Socred Looses For Clubs From 2020 to 2023
The total scored at home by each team in La Liga is generally between 50 to 110 from 2020 to 2023.

Box Plot for Total Losses For Clubs From 2020 to 2023
The total losses at home by each team in La Liga is generally between 50 to 70 from 2020 to 2023.

Total Losses For Clubs From 2020 To 2023
The line plot shows that Real Madrid plays the most consistently each season. Because they have fewer losses every season.

Here is Full Version Code for Data VisualizationData Visualization Code


Resource

Background & Cover Image:https://images.app.goo.gl/rrFyz3J2dLG91KgBA
API for Gathering Data: https://rapidapi.com/api-sports/api/api-football/


  TOC