Merluza

A principios de semana estaba en el mejor lugar, en el mejor momento, en el mejor sentimiento. Mi cuerpo extasiado de energía, alegría y aire. Hoy es viernes y me encuentro sobre mi cama con el…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Data Cleaning for Beginners

What is data cleaning and why is it important?

When starting a new project, it is important to use a clear framework as to have a clear objective and find the most efficient solution to a given problem. As you work on more projects, it becomes clearer to see which data science process fits your personal preferences and the way you work. Currently, I have become comfortable using the OSEMin model. OSEMin stands for Obtain, Scrub, Explore, Model and Interpret.

The first part of every data science project is to define its goal and understand the project’s main objective. Once this is clear, you can start gathering data that is related to the project, which can be sourced organically or from external databases. As you start gathering data, it is common that it will not be perfect and will need to be cleaned or preprocessed. This step, the Scrubbing or cleaning step, is one of the most important steps in the OSEMin model. The goal of this post is to gain an understanding of what this step is, why it is important and strategies to properly clean or scrub data.

What is Data Scrubbing?

Data Scrubbing is the process of identifying the incorrect, incomplete, inaccurate, irrelevant or missing parts of the data and then modifying, replacing or deleting them according to necessity. This step is a fundamental element of basic data science.

Data is highly valuable for analytics and machine learning. When dealing with real-world data, it is not improbable that data may contain incomplete, inconsistent and missing values. Cleaning data goes a long way to improving a model’s performance. Let’s see an example that highlights the importance of data cleaning.

Kings County dataset

We can do this with the help of python’s Pandas library. The Pandas library is used primarily for data processing and viewing .csv files. As you can see below, the pandas library is imported, then pandas is used to read the relevant .csv file. Finally, the first 5 houses can be viewed using the .head() function.

Figure 1: First fives houses in the data set

The first step to data cleaning to remove the information that is not relevant to the main objective: predicting housing prices. In this case, the “id” and “date” columns are not relevant to the objective, so they were dropped. We can use pandas.DataFrame.drop to drop specific rows and columns.

Next, you can move on to dealing with missing data. Handling missing data is very important because if you leave missing values as is, you will affect your analysis. So, we must deal with them moving forward. Depending on the situation, there are three strategies for dealing with missing values in the dataset:

To confirm if a dataset has any missing values, we can use pandas.isna() to detect any missing values and add .sum() to display how many missing there might be in a specific column.

Figure 2: Null values in the dataset

Waterfront

With the help of the .unique() function, you can see that there are three unique values in this column: 1(waterfront view), 0(no waterfront view) and nan(not a number). Because waterfront views do affect housing prices, the column could not be dropped. Instead, the data frame was split into two: houses with waterfront views (1) and houses without waterfront views(0). This way the houses without waterfront information (nans) are removed from the dataset. Finally, the two datasets are combined using pandas.concat().

View

For the view column, there were only 63 missing values and decided to fill the missing values with the mean value using pandas.fillna()

In order confirm that all missing values were dealt with, we can call on .isna().sum().

Outliers

Next, the possibility of outliers can be dealt with. But what is an outlier? According to Wikipedia: An outlier is a data point that differs slightly from other observations. Outliers can be created due to errors in experiments, incorrect data entry or variability in measurements. Using pandas.describe(), which shows us the five-point statistics of each column, it became obvious that there was a clear outlier in the bedroom columns.

As we can see the max value for bedrooms is 33 and upon closer inspection, the house in question had 33 bedrooms, with one floor and 1.75 bathrooms. Since this seemed impossible, this specific column was dropped.

Using the .fillna() with the column’s median value, the question marks were replaced and then the datatype was changed using pandas.DataFrame.astype()

Conclusion

Finally, the data set is clean and we can move on to exploring your dataset! Data cleaning is very important for making our analytics error-free. In the real world, many of the most interesting datasets are filled with missing or incorrect information. A small error can ruin a model’s performance. So, make sure the data is always clean.

Add a comment

Related posts:

He Craved Her Salt

A Poem

Facebook and Google Ads for Authors

Whenever I mention that I work in Marketing to an author, usually the first question I’m asked is “how do I run ads for my book?” But... should you run ads?

Token to watch!

Based on the Ethereum blockchain, the company has created an online platform on which users can decide for themselves about the placement of advertisements. This fact even allows them to earn Basic…