Introduction
We love having containers that carry our sweets or the chocolates that we collected from a very long time. Just like these containers, Python too have various ounces of containers or what we call in the coding world, ‘Python libraries’, that store information or data that can be used.
Pandas and NumPy are two of the most important Python libraries used for working with data. They help you store, organise, and analyse large amounts of information efficiently. Think of them as tools that help you turn messy piles of numbers into easy-to-understand tables or charts.
- NumPy (Abv: Numerical Python) is a library designed for performing fast mathematical operations on large arrays and matrices of numbers. It’s great for handling lots of numerical data quickly.
- Pandas: It builds on top of NumPy and makes it easier to work with data, especially tabular data like spreadsheets. It allows you to manipulate, analyze, and visualize data in a way that’s simple and user-friendly.
Let’s dive deeper into how they work and how you can use them.
Getting Started with NumPy
Before using these libraries, you need to install them. You can install them by running the following command in your terminal or command prompt:
pip install numpy pandas |
Once they are installed, you can start using them in your Python program by importing the libraries:
import numpy as np import pandas as pd |
Here, we’re importing NumPy as np and Pandas as pd to make the code shorter and easier to read.
Working with NumPy
NumPy is especially useful when working with arrays of numbers. Let’s look at how to create and use NumPy arrays.
Creating a NumPy Array
A NumPy array is like a list of numbers, but it’s faster and has more capabilities. Here’s how you can create one:
import numpy as np # Creating a 1D NumPy array arr = np.array([1, 2, 3, 4, 5]) print(arr) |
This will create a one-dimensional array (1D) and print it: [1, 2, 3, 4, 5].
You can also create a two-dimensional array (2D), which looks like a grid or a table:
# Creating a 2D NumPy array arr_2d = np.array([[1, 2, 3], [4, 5, 6]]) print(arr_2d) |
This will create a table of numbers like:
[[1 2 3] [4 5 6]] |
Basic Operations with NumPy
NumPy lets you perform mathematical operations on your arrays easily. For example:
import numpy as np arr = np.array([1, 2, 3, 4, 5]) # Adding 10 to each element arr_plus_ten = arr + 10 print(arr_plus_ten) |
This will add 10 to each number in the array and print [11, 12, 13, 14, 15].
You can also do multiplication, division, and subtraction:
# Multiply each element by 2 arr_times_two = arr * 2 print(arr_times_two) # Output: [2, 4, 6, 8, 10] |
NumPy is very powerful for working with large datasets because it can perform these operations extremely fast.
Working with Pandas
While NumPy is great for handling numerical data, ‘Pandas’ is perfect for working with labeled data, such as tables, spreadsheets, or CSV files. The two main structures in Pandas are Series and DataFrames.
Pandas Series
A Series is like a column in a table. It’s a one-dimensional array, but it also has labels (called an index). Here’s how you create a Pandas Series:
import pandas as pd # Creating a Pandas Series data = pd.Series([10, 20, 30, 40], index=[“a”, “b”, “c”, “d”]) print(data) |
This will output the following:
a 10 b 20 c 30 d 40 dtype: int64 |
The index gives each data point a label (like a row name). You can access the data by using these labels:
print(data[“b”]) # Output: 20 |
Pandas DataFrame
A DataFrame is like an entire table. It’s a two-dimensional data structure that stores data in rows and columns. Here’s how you can create one:
import pandas as pd # Creating a DataFrame data = { “Name“: [“Alice“, “Bob“, “Charlie“], “Age“: [25, 30, 35], “City“: [“New York“, “Los Angeles“, “Chicago“] } df = pd.DataFrame(data) print(df) |
This will create a table (DataFrame) and print it:
Name Age City 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 35 Chicago |
You can think of each column as a Pandas Series, and the entire table as a collection of Series.
Reading Data from CSV Files
One of the best features of Pandas is that it makes reading data from files super easy. For example, if you have a CSV (Comma Separated Values) file containing data, you can load it into a DataFrame like this:
df = pd.read_csv(“data.csv”) |
Now you can work with the data just like you would with any table.
Accessing Data in a DataFrame
Once you have a DataFrame, you can access specific rows, columns, or individual values.
- Accessing a column:
print(df[“Name”]) |
This will print the “Name” column of your DataFrame.
- Accessing a row by its index:
print(df.loc[1]) |
This will print the data for the second row (because index starts at 0):
Name Bob Age 30 City Los Angeles Name: 1, dtype: object |
- Accessing a specific value using row and column names:
print(df.at[1, “City”]) # Output: Los Angeles |
Modifying Data in a DataFrame
You can easily modify or update data in your DataFrame. For example, to change Bob’s age:
df.at[1, “Age”] = 32 print(df) |
This will update Bob’s age to 32.
Basic Operations with Pandas
Pandas lets you perform various operations on your data, such as filtering, sorting, and summarising data.
- Filtering data:
You can filter the rows of a DataFrame based on a condition. For example, let’s find people older than 30:
older_than_30 = df[df[“Age”] > 30] print(older_than_30) |
This will show only the rows where the age is greater than 30.
- Summarizing data:
You can quickly calculate statistics like the average, sum, or maximum value of a column:
print(df[“Age”].mean()) # Output: average age |
Pandas also has functions like sum(), max(), and min() to find the total, maximum, or minimum values in a column.
Working Together: NumPy and Pandas
Pandas is built on top of NumPy, which means you can use them together to do more complex operations. For example, you can use NumPy functions inside a Pandas DataFrame to perform calculations on your data.
Here’s an example of using NumPy’s sqrt() function (which calculates the square root) on a DataFrame column:
import numpy as np df[“Age_sqrt”] = np.sqrt(df[“Age”]) print(df) |
This will create a new column in the DataFrame called ‘Age_sqrt’, containing the square roots of the ages.
Conclusion
Both the softwares, NumPy and Pandas are powerful tools for working with data in Python. NumPy excels in handling large arrays of numbers and performing mathematical operations quickly, while Pandas makes it easy to work with tabular data like spreadsheets or CSV files.
These libraries allow you to manipulate, analyze, and visualize data efficiently, which is essential for everything from scientific research to business analytics. This is just the basics of coding with Python. Want to learn more?