Exploratory data analysis is the first step towards solving any data science or machine learning problem. It refers to the critical process of performing initial investigations on the data we have available and getting familiar with it. EDA makes a thorough examination on our dataset to find trends, patterns, and relationships between features within the data with the help of graphs and plots using libraries like Matplotlib and Seaborn. We will also be using the Pandas library. Pandas makes importing, analysing and visualizing much easier. In this section, we will use the Titanic dataset, a popular introductory dataset, to learn the step by step exploratory data analysis process. The purpose is to bring the reader in the position of going on with its own ideas, and by the end of this course searching for a dataset with a subject of its own interest to explore.

Before we begin to solve the problem, we need to make sure we understand the problem statement very well.

Problem definition:

The sinking of the Titanic resulted in the death of 1502 out of 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others. Apparently there were some patterns in the people who died and the people who survived. The problem here is that we are given certain data with specific characteristics of each passenger and the data is already labeled which let us know if the passenger lived or died. We have also been given a test dataset with more Titanic passengers and their characteristics but this dataset is not labeled, so we don't know who lived and who died.

We need to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc). To be able to predict which passengers were more likely to survive we will use a couple of algorithms to train the first dataset and when we decide which one is the best, we will use it to predict what passengers in the unlabeled dataset survived.

For this specific section, we will focus on the Titanic exploratory data analysis only.

-If you want to read the complete problem statement and data description, it can be found here:

https://www.kaggle.com/competitions/titanic/

-Please download the data directly from the following link:

https://github.com/4GeeksAcademy/machine-learning-content/tree/master/05-3d-data/assets

Our next step is to read in the data and do some preliminary exploration. This will help us figure out how we want to approach creating groups and finding patterns. In order to do that we need to import some necessary libraries (for this example). In case any of them does not appear, make sure to install it.

In [2]:

```
#Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
```

The data is stored as a comma-separated values, or csv file, where each row is separated by a new line, and each column by a comma (,). In order to read in the data, we’ll need to use the pandas.read_csv function. This function will take in a csv file and return a DataFrame.

In [3]:

```
#Reading the train and test data and assign to a variable
train_data = pd.read_csv('https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/titanic_train.csv')
test_data = pd.read_csv('https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/titanic_test.csv')
```

After reading the dataframes available, we will analyze their shape, size and the kind of data available. We will do this for both train and test dataset. It is important to also write observations at the end of each step. In this initial data exploration we will use:

-data.head() returns the first 5 rows of the dataframe

-data.shape displays the number of rows and number of columns of the dataframe

-data.info() prints a concise summary with the index dtype, column dtypes, non-null values and memory usage.

In [4]:

```
#Let's see how many rows and columns does my train_data has.
train_data.shape
```

Out[4]:

In [5]:

```
#Looking at the first rows of our train_data. If we want to see more than 5 rows, we just have to add the number of rows as parameter.
train_data.head()
```

Out[5]:

In [5]:

```
#Let's look at some information aboyt data types and null values.
train_data.info()
```

Observations:

-We can see that our train_data has 891 rows and 12 columns.

-Our data has 7 numerical features and 5 categorical features.

-Feature 'Age' has 714 non-null values from a total of 891 rows, which means that our 'Age' column has 177 null values. The same happens with our 'Cabin' feature having 687 null values, and 'Embarked' feature with 2 null values.

Now, we will do the same analysis for our test_data:

In [6]:

```
#Let's see how many rows and columns does my test_data has.
test_data.shape
```

Out[6]:

In [7]:

```
#Let's look at the first 3 rows of our test_data
test_data.head(3)
```

Out[7]:

In [8]:

```
#Let's see the data types and which features have null values in our test_data
test_data.info()
```

Observations:

-We can see that our test_data has 418 rows and 11 columns. There is one column less than our train_data because this new dataset is not labeled so we don't have the column that indicates whether the passenger died or survived.

-Our test_data has the same data types as our train_data for each feature.

-This time the 'Age' feature has 86 null values, and our 'Cabin' feature has 327 null values from the 418 total number of rows.

Now we need to find some insights from the dataset to see if there is any kind of hidden pattern or relationship between columns. We will start with the 'Survived' column which seems to be our target variable as it is not given to us in the test dataset.

**Target variable**

In [16]:

```
#Let's first visualize the distribution of our target variable.
sns.countplot(x=train_data['Survived'])
plt.title('Distribution of passenger survival status')
plt.show()
```

In [10]:

```
train_data['Survived'].value_counts()
```

Out[10]:

Observations: As our target variable is supposed to classify passengers in 1 or 0, whether they survive or not, we used a countplot to see if the data is balanced. We also used the method value_counts() to see exactly how many people survived(1) and how many did not survived(0) in our train_data. It is balanced and we know for a fact that the sinking ok Titanic resulted in the death of most of its passengers.

**Using histograms to visualize all features**

In [17]:

```
train_data.hist(bins=10,figsize=(9,7),grid=False);
```

**Countplot for categorical variables**

In [7]:

```
#Let's check the categories in each of our object type features
def countplot_features(feature):
plot=sns.countplot(x=feature,data=train_data)
plt.show()
def countplot_targetvsfeature(feature,y):
fig = plt.figure(figsize=(15,10))
plot=sns.countplot(x=feature,data=train_data,hue=y)
plt.show()
```

In [12]:

```
countplot_features('Sex')
```

In [14]:

```
countplot_targetvsfeature('Sex','Survived')
```

Observations:

Most of our data passengers were male, but from male, most of them did not survive. On the other side, even though there were less female passengers, most of them survived.

In [13]:

```
countplot_features('Embarked')
```

In [8]:

```
countplot_targetvsfeature('Embarked','Survived')
```

Observations:

Most of our Titanic passengers embarked by the Southampton station.

In [14]:

```
countplot_features('Pclass')
```

In [9]:

```
countplot_targetvsfeature('Pclass','Survived')
```

Observations: Most of the passengers were travelling in the third class, but most of them did not survive. However, in the first class, most of the passengers survived.

**Distribution Plots for Continuous variables**

In [4]:

```
#Let's plot the Probability Density Function (PDF) of Age of 891 passengers traveling in the Titanic.
sns.distplot(train_data['Age'])
```

Out[4]:

In [16]:

```
#View if there is a linear relation between continuous numerical variable Age & target variable Survived.
sns.regplot(x = "Age", y = "Survived", data = train_data)
plt.ylim(0,)
```

Out[16]:

Observations:

There is clear negative linear relation between Age and our target variable. This makes sense considering that children was one of the groups who had preference in using the boats to survive (Survive = 1).

In [8]:

```
#Let's plot the Probability Density Function (PDF) of Fare paid by 891 passengers traveling in the Titanic.
sns.distplot(train_data['Fare'])
```

Out[8]:

Observations: From the above plotted PDF of Fare we can see that a majority of points in distribution lie between 0 to 100

In [3]:

```
# View if there is a linear relation between continuous numerical variable Fare & target variable Survived.
sns.regplot(x = "Fare", y = "Survived", data = train_data)
plt.ylim(0,)
```

Out[3]:

Observations:

Yes, there is a positive linear relation between 'Fare' and 'Survived' feature which means that people who paid a more expensive fare had more probabilities to survive (Survive = 1).

Duplicates are entries that represent the same sample point multiple times. For example, if a measurement or record was registered twice by two different people. Detecting such duplicates is not always easy, as each dataset might have a unique identifier (e.g. an index number or an ID that is unique to each new sample). If we are not sure yet about which is the column that identifies each unique sample, we might want to ignore them first. And once we are aware about the number of duplicates in our dataset, we can simply drop them with drop_duplicates().

In the case of our dataset, it is not difficult to find that unique identifier column because it's column name is very clear: PassengerId.

In [3]:

```
train_duplicates = train_data['PassengerId'].duplicated().sum()
print(f'It seems that there are {train_duplicates} duplicated passenger according to the PassengerId feature')
```

In [4]:

```
test_duplicates = test_data['PassengerId'].duplicated().sum()
print(f'It seems that there are {test_duplicates} duplicated passenger according to the PassengerId feature')
```

The following columns will not be useful for prediction, so we will eliminate them, in train and test datasets.

In [5]:

```
#Drop irrelevant columns in train data
drop_cols = ['PassengerId','Cabin', 'Ticket', 'Name']
train_data.drop(drop_cols, axis = 1, inplace = True)
```

In [6]:

```
#Drop same irrelevant columns in test data
test_data.drop(drop_cols, axis = 1, inplace = True)
```

**Pandas drop_duplicates() Function Syntax:**

DataFrame.drop_duplicates(subset=None, keep=’first’, inplace=False)

**Pandas drop_duplicates() Function Parameters:**

subset: Subset takes a column or list of column label for identifying duplicate rows. By default, all the columns are used to find the duplicate rows.

keep: allowed values are {‘first’, ‘last’, False}, default ‘first’. If ‘first’, duplicate rows except the first one is deleted. If ‘last’, duplicate rows except the last one is deleted. If False, all the duplicate rows are deleted.

inplace: if True, the source DataFrame itself is changed. By default, source DataFrame remains unchanged and a new DataFrame instance is returned.

Correlations between variables can be found using pandas “.corr()” function and visualized with a matrix by using a heatmap in seaborn. The following heatmap shows some strong and weak correlations between variables. Dark shades represent negative correlation while lighter shades represent positive correlation.

In [8]:

```
#Plotting a heatmap to find relations between features
plt.figure(figsize=(12, 8))
sns.heatmap(train_data.corr(), annot=True, cmap='viridis')
```

Out[8]:

Observations:

Here you can infer that there is a strong negative relation between Fare and PClass. This is totally understandable because if a passenger instead of buying a ticket in 1st class (PClass = 1), decided to buy a ticket in 3rd class (PClass = 3), the ticket fare would certainly decrease.

There is also a negative relation between the passenger class (pclass) and the age of the passenger. That means that 3rd class (Pclass = 3) had younger passengers than the 1st class (Pclass = 1).

Also, we can see that Pclass is very related to the target variable 'Survived'. So the better passenger class, more probabilities to survive. We can confirm this relationship with the following graph.

In [9]:

```
#Checking correlation between Pclass and Fare:
plt.figure(figsize = (8, 4))
sns.boxplot(y = train_data.Pclass, x = train_data.Fare, orient = 'h', showfliers = False, palette = 'gist_heat')
plt.ylabel('Passenger Class')
plt.yticks([0,1,2], ['First Class','Second Class', 'Third Class'])
plt.show()
```

The parameter showfliers = False is ignoring the outliers. But if we do not establish that parameter, we can use boxplots to view outliers.

There are different ways of visualizing relationships:

In [10]:

```
#Using seaborn.pairplot for a grid visualization of every relationship
sns.pairplot(data=train_data)
```

Out[10]:

In [20]:

```
#Correlation of features with target
train_data.corr()["Survived"]
```

Out[20]:

In [11]:

```
#Using transpose
train_data_corr = train_data.corr().transpose()
train_data_corr
```

Out[11]:

In [12]:

```
# Using a different way of correlation matrix
background_color = "#97CADB"
fig = plt.figure(figsize=(10,10))
gs = fig.add_gridspec(1,1)
gs.update(wspace=0.3, hspace=0.15)
ax0 = fig.add_subplot(gs[0,0])
fig.patch.set_facecolor(background_color)
ax0.set_facecolor(background_color)
# train_data_corr = train_data[['Age', 'Fare', 'SibSp', 'Parch', 'Pclass','Survived']].corr().transpose()
mask = np.triu(np.ones_like(train_data_corr))
ax0.text(2,-0.1,"Correlation Matrix",fontsize=22, fontweight='bold', fontfamily='cursive', color="#000000")
sns.heatmap(train_data_corr,mask=mask,fmt=".1f",annot=True)
plt.show()
```

**End of Day 1!**

Now, let's do a lot of exploratory data analysis on this module's project!

To read about what exactly are features and why should we do feature engineering, click on the following link: https://github.com/4GeeksAcademy/machine-learning-content/blob/master/05-3d-data/feature-engineering.ipynb

The first process we will learn in our Titanic feature engineering will be how to find and deal with extreme values (outliers).

**FINDING OUTLIERS**

In statistics, an outlier is an observation point that is distant from other observations. In data, it means that our dataframe feature has some extreme values which we need to analyse further. Those extreme values may be typing errors, or they may be extreme values but considered normal in the population we are studying. In the case our outliers are typing errors we need to decide if we are going to eliminate them or replace them with another value. In the case a feature's outliers are considered normal and part of the population, it may be better if we keep them because it will give important information to our model.

How important we consider the feature for our model will influence in our decision about what to do with outliers.

Pandas describe() method is used to view some basic statistical details like percentile, mean, std, etc. of a data frame or a series of numeric values. In the case we would like to see the object type features using describe(), this should be entered as 'dataframe.describe(include='O')' and it will show us the most frequent value and how many times it appears.

Syntax:

DataFrame.describe(percentiles=None, include=None, exclude=None)

Parameters:

percentile: list like data type of numbers between 0-1 to return the respective percentile

include: List of data types to be included while describing dataframe. Default is None

exclude: List of data types to be Excluded while describing dataframe. Default is None

Return type: Statistical summary of data frame.

In [15]:

```
#Let's use the describe method to see the statistics on our numerical features
train_data.describe()
```

Out[15]:

We can see that the 891 records contain data in each and every column left.

In [18]:

```
#Now, let's modify its parameters to be able to see some statistics on our categorical features.
train_data.describe(include=['O'])
```

Out[18]:

**WHY IS THIS USEFUL TO FIND OUTLIERS?**

In the numerical features, we can look at the min and max value for a especific feature, and compare it to its 25% and 75% percentile. We can also compare the mean to the 50% percentile and confirm if there is any extreme high or low value making my mean go up or down, much more than the 50% percentile.

Once we suspect there are outliers, we can use a boxplot for that feature to have a better visualization of outliers.

Observations: According to our statistics dataframe, everything seems normal except for the 'Fare' column which has a mean of 32.20 but its 50% percentile is 14, and its max value is 512. We could say 512 seems to be an outlier but it could be a typing error. It is also possible that the most expensive ticket had that price. It would be useful if we do some research and confirm that information.

Let's see how to write the code for a boxplot in order to visualize outliers. A box plot shows the distribution of quantitative data in a way that facilitates comparisons between variables.The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution.

In [19]:

```
#Let's evaluate our 'Fare' variable.
plt.figure(figsize=(6,6))
sns.boxplot(data=train_data['Fare'])
plt.title('Looking for outliers in Fare feature')
plt.ylabel('Fare')
```

Out[19]:

Observations:

-It looks like the ticket fare of 512 is not very common. We should establish some upper and lower bounds to determine whether a data point should be considered or not an outlier. There are a couple of ways to determine this and we will learn about them in the data cleaning process, on how to deal with outliers.

**HOW TO DEAL WITH OUTLIERS**

To learn about the types of outliers and different methods to deal with them, read the information from the following link:

We will apply one of those methods by defining upper and lower bounds. Let's see how is it implemented:

In [21]:

```
fare_stat = train_data['Fare'].describe()
print(fare_stat)
```

In [22]:

```
IQR = fare_stat['75%']-fare_stat['25%']
upper = fare_stat['75%'] + 1.5*IQR
lower = fare_stat['25%'] - 1.5*IQR
print('The upper & lower bounds for suspected outliers are {} and {}.'.format(upper,lower))
```

Based on this results, we should drop Fare values above 65. However, our criteria is very important here, and based on the prices we saw in the boxplot the most extreme values are above 300. Let's see how many values represent that extreme value of 512 and drop them.

In [23]:

```
#visualizing data with fare above 300
train_data[train_data['Fare'] > 300]
```

Out[23]:

Observations: The three individuals who payed a fare of '512.3292' did survive. Should we drop them? Or can they bring valuable information to our model?

We'll learn how to drop rows with values bigger than certain value. But you are welcome to investigate more about Titanic fares and decide if keeping them or not.

In [7]:

```
#Dropping data with fare above 300
train_data.drop(train_data[(train_data['Fare'] > 300)].index, inplace=True)
```

In [14]:

```
#Confirm there are three rows less.
train_data.shape
```

Out[14]:

We confirm that we have eliminated those 3 outliers!

**FINDING MISSING OR NULL VALUES**

Most of the machine learning algorithms are not able to handle missing values. Having some missing values is normal and we should decide if eliminating them or replacing them with other values. What we want to identify at this stage are big holes in the dataset with features that have a lot of missing values.

We begin by separating our features into numerical and categorical columns. We do this because the method to handle missing values, later, will be different for these two data types.

In [15]:

```
# Separate numerical and categorical variables.
num_vars = train_data.columns[train_data.dtypes != 'object']
cat_vars = train_data.columns[train_data.dtypes == 'object']
print("Numerical variables:", num_vars)
print("Categorical variables:", cat_vars)
```

We will use the pandas “isnull()” function to find out all the fields which have missing values. This will return True if a field has missing values and false if the field does not have missing values. To get how many missing values are in each column we use sum() along with isnull(). This will sum up all the True’s in each column.

In [12]:

```
train_data[num_vars].isnull().sum()
```

Out[12]:

In [13]:

```
train_data[cat_vars].isnull().sum()
```

Out[13]:

Now, sort_values() will sort the missing values in ascending order. It is always a good practice to sort them in descending order so we can see the columns that have the highest number of missing values first.

In [29]:

```
train_data[num_vars].isnull().sum().sort_values(ascending=False)
```

Out[29]:

In [30]:

```
train_data[cat_vars].isnull().sum().sort_values(ascending=False)
```

Out[30]:

Finally , we can divide that result by the length of our dataframe (the number of rows) in order to get the percentage of missing values in each column. Missing values are usually represented in the form of Nan, null or None in the dataset.

In [31]:

```
train_data[num_vars].isnull().sum().sort_values(ascending=False)/len(train_data)
```

Out[31]:

In [32]:

```
train_data[cat_vars].isnull().sum().sort_values(ascending=False)/len(train_data)
```

Out[32]:

In [16]:

```
# How many null values should I deal with in the test data?
test_data.isnull().sum()
```

Out[16]:

**HOW TO DEAL WITH MISSING VALUES**

To learn about the techniques on how to deal with missing values, read the information from the following link:

In [8]:

```
# Handling Missing Values in train_data
## Fill missing AGE with Median
train_data['Age'].fillna(train_data['Age'].median(), inplace=True)
## Fill missing EMBARKED with Mode
train_data['Embarked'].fillna(train_data['Embarked'].mode()[0], inplace=True)
```

The notation '[0]' means that the thing before it (mode() in this case) is a collection, a list, an array, ..., and you are taking the first element.

The mode() returns 2 values, first is mode value, second is count. So 'train_data['Embarked'].mode()[0]' means we get the mode value of 'train_data['Embarked']'.

Feel free to use scikit-learn instead.

Let's verify there were no missing values left:

In [16]:

```
train_data.isnull().sum()
```

Out[16]:

Now let's also handle the missing values in our test data:

In [9]:

```
# Handling Missing Values in test data
## Fill missing AGE and FARE with Median
test_data['Age'].fillna(test_data['Age'].median(), inplace=True)
test_data['Fare'].fillna(test_data['Fare'].median(), inplace=True)
```

In [18]:

```
test_data.isnull().sum()
```

Out[18]:

As part of the feature engineering and before encoding our label variables, we will learn how to create new features based on the existing ones. Let's look how is our dataset so far by taking a look at the first 10 rows.

In [ ]:

```
train_data.head(10)
```

In [10]:

```
# We will create a new column to show how many family members of each passenger were in the Titanic.
# We will calculate it based on the sum of SibSp (siblings and spouse) and Parch (parents and children)
print(train_data)
train_data["fam_mbrs"] = train_data["SibSp"] + train_data["Parch"]
print(train_data)
```

In [11]:

```
#Repeat process in test data
test_data["fam_mbrs"] = test_data["SibSp"] + test_data["Parch"]
```

Feature encoding is the process of turning categorical data in a dataset into numerical data. It is essential that we perform feature encoding because most machine learning models can only interpret numerical data and not data in text form.

To read about the different methods of feature encoding go to following link: https://github.com/4GeeksAcademy/machine-learning-content/blob/master/05-3d-data/feature-encoding-for-categorical-variables.ipynb

To add some additional information, here we will use a different and short method on how to apply specific numbers directly to our label features, so we are manually encoding our categorical features, but you are free to use Scikit learn or Pandas.

In [12]:

```
# One-hot encoding multiple columns
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
transformer = make_column_transformer(
(OneHotEncoder(), ['Embarked', 'Sex']),
remainder='passthrough')
transformed = transformer.fit_transform(train_data)
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names())
print(transformed_df.head())
```

In [14]:

```
#Changing to more friendly names
transformed_df = transformed_df.rename(columns = {'onehotencoder__x0_C':'Embarked_C',
'onehotencoder__x0_Q':'Embarked_Q',
'onehotencoder__x0_S':'Embarked_S',
'onehotencoder__x1_female':'Female',
'onehotencoder__x1_male':'Male'
})
```

In [15]:

```
#verifying my final train dataframe
transformed_df.head()
```

Out[15]:

In [16]:

```
#Repeating transformation in test dataframe
transformed_test = transformer.fit_transform(test_data)
transformed_test_df = pd.DataFrame(transformed_test, columns=transformer.get_feature_names())
```

In [17]:

```
#Changing to more friendly names in test dataframe
transformed_test_df = transformed_test_df.rename(columns = {'onehotencoder__x0_C':'Embarked_C',
'onehotencoder__x0_Q':'Embarked_Q',
'onehotencoder__x0_S':'Embarked_S',
'onehotencoder__x1_female':'Female',
'onehotencoder__x1_male':'Male'
})
```

In [18]:

```
# Verifying new test dataframe
transformed_test_df.head()
```

Out[18]:

So now that we have all our features converted into numbers, are they ready for modeling? It depends if all our features are on the same scale or not. To read what does it mean to have different scales and the methods to standarize them, go to the following link: https://github.com/4GeeksAcademy/machine-learning-content/blob/master/05-3d-data/feature-scaling.ipynb

After reading it we decide to implement the StandardScaler, but you are free to find reasons to scale your dataframe in a different way.

In [20]:

```
# import module
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
train_scaler = scaler.fit(transformed_df[['Age', 'Fare']])
transformed_df[['Age', 'Fare']] = train_scaler.transform(transformed_df[['Age', 'Fare']])
test_scaler = scaler.fit(transformed_test_df[['Age', 'Fare']])
transformed_test_df[['Age', 'Fare']] = test_scaler.transform(transformed_test_df[['Age', 'Fare']])
```

In [21]:

```
#Verifying
transformed_df.head()
```

Out[21]:

In [22]:

```
#Verifying
transformed_test_df.head()
```

Out[22]:

Before showing you some feature selection techniques, let's save our clean train and test datasets.

In [23]:

```
# Save transformed train_data as clean_titanic_train
transformed_df.to_csv('assets/processed/clean_titanic_train.csv')
```

In [24]:

```
# Save transformed test_data as clean_titanic_test
transformed_test_df.to_csv('assets/processed/clean_titanic_test.csv')
```

**End of Day 2!**

Now let's clean our project dataset and leave it almost ready for modeling!

Go to the following link https://github.com/4GeeksAcademy/machine-learning-content/blob/master/05-3d-data/feature-selection.ipynb to read the information about when is it neccesary to make feature selection and what are the existing methods to do it.

How to retrieve the 5 right informative features in the Titanic dataset?

In [25]:

```
transformed_df.head()
```

Out[25]:

In [26]:

```
#Separate features from target
X = transformed_df.drop("Survived",axis=1)
y = transformed_df["Survived"]
```

In [27]:

```
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
mdlsel = SelectKBest(chi2, k=5)
mdlsel.fit(X,y)
ix = mdlsel.get_support()
data2 = pd.DataFrame(mdlsel.transform(X), columns = X.columns.values[ix])
data2.head(n=5)
```

Out[27]:

It gives me the 5 most important features according to the Chi-square method.

This is just to show you how to apply one of the feature selection methods in the Titanic dataset in order to reduce the number of features before modeling, however, Titanic is a short dataset so you should evaluate if doing feature selection or not.

There are algorithms that include feature selection in their modeling process( for example Lasso modeling).

This process is very related to the modeling process, because, in order to verify if we are doing a good feature selection, sometimes is necessary to do some modeling with different feature groups to find out the accuracy achieved.

Now go ahead and analyse your project dataframe!