Python is a really handy language that can do a lot of things, especially when it comes to dealing with data, especially with Pandas library. In this article, we’re going to dive into astype() in Pandas. It lets us change the type of data in Pandas to whatever we want. Plus, it’s got this extra power where it can turn existing columns into special categories.
What is the astype() Method in Pandas?
The astype() method in Pandas is used to cast a pandas object, such as a DataFrame or Series, to a specified data type. Hence, it provides a flexible way to convert the data types of one or more columns in a DataFrame. It is truly useful when we are required to change the data type of a specific column or multiple columns simultaneously.
Besides changing the data type of columns, the astype() method also allows us to convert columns to categorical types. This is useful when dealing with variables that only have a limited number of unique values, such as categorical variables or factors.
Syntax and Parameters of astype() Method
Now let us explore the syntax of the astype() method.
DataFrame.astype(dtype, copy=True, errors='raise', **kwargs)
Here is the breakdown of the syntax:
- dtype: Specifies the data type to which the DataFrame should be cast. It can be a numpy.dtype or a Python type. Alternatively, we can provide a dictionary with column names as keys and their corresponding data types as values.
- copy: Specifies whether to return a copy of the DataFrame when copy=True. By default, it is set to True. If copy=False, changes made to the values may get reflected to other pandas objects.
- errors: Handles errors on invalid data for the provided data type. It can take two values: ‘raise’ (default) allows exceptions to be raised, while ‘ignore’ ignores exceptions and returns the original object on error.
- **kwargs: Additional keyword arguments that can be passed to the constructor of the class.
Now let us see various use cases of the astype() function.
Casting the Data Type of a Single Column
The astype() method is commonly used to change the data type of a specific column in a DataFrame. Let’s consider an example: We have a DataFrame with columns representing different attributes of a person, such as Name, Age, and Weight. We want to convert the Weight column to an integer data type.
We can use the astype() method in this way:
import pandas as pd data = { "Name": ["John", "Emma", "Michael"], "Age": [25, 30, 35], "Weight": [65.2, 68.5, 73.1] } df = pd.DataFrame(data) # Display the original DataFrame print('Original DataFrame:\n', df) # Change the data type of 'Weight' column df["Weight"] = df["Weight"].astype('int64') # Display the new DataFrame. print('Updated DataFrame:\n',df)
Output:
Original DataFrame:
Name Age Weight
0 John 25 65.2
1 Emma 30 68.5
2 Michael 35 73.1
Updated DataFrame:
Name Age Weight
0 John 25 65
1 Emma 30 68
2 Michael 35 73
Casting the Data Type of Multiple Columns
In addition to changing the data type of a single column, the astype() method also allows us to change the data types of multiple columns simultaneously. This can be achieved by providing a dictionary containing column names as keys and their corresponding data types as values.
Let us take an example, same as above let us now try to change the data types of the Weight and Age columns. Check the Python code below:
import pandas as pd data = { "Name": ["John", "Emma", "Michael"], "Age": [25, 30, 35], "Weight": [65.2, 68.5, 73.1] } df = pd.DataFrame(data) # Display the original DataFrame print('Original DataFrame:\n', df) # Change the data type of 'Weight' and 'Age' columns df = df.astype({"Age": 'float', "Weight": 'int64'}) # Display the new DataFrame. print('Updated DataFrame:\n',df)
Output:
Original DataFrame:
Name Age Weight
0 John 25 65.2
1 Emma 30 68.5
2 Michael 35 73.1
Updated DataFrame:
Name Age Weight
0 John 25.0 65
1 Emma 30.0 68
2 Michael 35.0 73
Converting Columns to Categorical Type
Astype() can also be used to convert the column type into categorical. Categorical types are useful when dealing with variables that have a limited number of unique values or represent categories of factors.
Consider a scenario where we have a DataFrame with a column representing the gender of individuals. We want to convert the Gender column to a categorical type. Converting the column to categorical data type will allow for more efficient data storage. Here is the code:
import pandas as pd data = { "Name": ["John", "Emma", "Michael"], "Gender": ["Male", "Female", "Male"], "Age": [25, 30, 35] } df = pd.DataFrame(data) # Display the original DataFrame print('Original DataFrame:\n', df) # Convert the 'Gender' column to categorical data type df["Gender"] = df["Gender"].astype('category') # Display the new DataFrame. print('Updated DataFrame:\n',df)
Output:
Original DataFrame:
Name Gender Age
0 John Male 25
1 Emma Female 30
2 Michael Male 35
Updated DataFrame:
Name Gender Age
0 John Male 25
1 Emma Female 30
2 Michael Male 35
Handling Missing Values
When working with real-world datasets, it is common to encounter missing or NaN (Not a Number) values. To learn more about NaN values and how to handle them, refer to Pandas-fillNA.
The astype() method provides a convenient way to handle missing values while changing the data type of columns. Consider a scenario where we have a DataFrame with a column representing the weight of individuals. However, this column contains some missing values also called NaN.
To avoid errors while changing the data type of the Weight column, we are required to handle the missing values first. We can accomplish this by dropping the rows containing any NaN values using the dropna() method.
After handling the missing values, we can proceed with changing the data type of the Weight column using the astype() method. Check the example below:
import pandas as pd data = { "Name": ["John", "Emma", "Michael"], "Weight": [65.2, 68.5, None], "Age": [25, 30, 35] } df = pd.DataFrame(data) # Display the original DataFrame print('Original DataFrame:\n', df) # Use dropna to handle missing values df.dropna(inplace=True) # Change the data type of 'Weight' column df["Weight"] = df["Weight"].astype('int64') # Display the new DataFrame. print('Updated DataFrame:\n',df)
Output:
Original DataFrame:
Name Weight Age
0 John 65.2 25
1 Emma 68.5 30
2 Michael NaN 35
Updated DataFrame:
Name Weight Age
0 John 65 25
1 Emma 68 30
Conclusion
In this article, we have discussed a very powerful tool in Pandas Python called astype(). It is very useful while performing data analysis to change the data types of single or multiple columns in a DataFrame. By understanding its syntax, and use cases, we can effectively manipulate data types in Pandas for various data analysis tasks.