Pandas is a prominent data-munging tool in Python. This data analysis library is well suited for various kinds of data. In this article, we list down 10 important interview questions on Python pandas one must know.
1| Define Python pandas
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. This is a high-level data manipulation tool developed by Wes Mckinney and is built on the Numpy package. This package provides active and flexible data structures in order to make easy working with relational or labelled data.
2| Mention The Different Types Of Data Structures In pandas?
There are two data structures supported by pandas library, Series and DataFrames. Both of the data structures are built on top of Numpy. Series is a one-dimensional data structure in pandas and DataFrame is the two-dimensional data structure in pandas. There is one more axis label known as Panel which is a three-dimensional data structure and it includes items, major_axis, and minor_axis.
3| Explain Series In pandas. How To Create Copy Of Series In pandas?
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:
>>> s = pd.Series(data, index=index), where the data can be a Python dict, an ndarray or a scalar value.
To create a copy in pandas, we can call copy() function on a series such that
s2=s1.copy() will create copy of series s1 in a new series s2.
4| What Is A pandas DataFrame? How Will You Create An Empty DataFrame In pandas?
pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). It consists of three principal components, the data, rows, and columns. pandas DataFrame can be created from the lists, dictionary, and from a list of dictionary, etc.
To create an empty DataFrame in pandas, type
import pandas as pd
df = pd.DataFrame()
5| Explain Reindexing In pandas.
Reindexing means to conform DataFrame to a new index with optional filling logic, placing NA/NaN in locations having no value in the previous index. It changes the row labels and column labels of a DataFrame.
6| What Are The Key Features Of pandas Library?
There are various features in pandas library and some of them are mentioned below
- Data Alignment
- Memory Efficient
- Reshaping
- Merge and join
- Time Series
7| What Is pandas Used For?
This library is written for the Python programming language for performing operations like data manipulation, data analysis, etc. The library provides various operations as well as data structures to manipulate time series and numerical tables.
8| Explain Categorical Data In pandas.
Categoricals are a pandas data type corresponding to categorical variables in statistics. A categorical variable takes on a limited and usually fixed, number of possible values (categories; levels in R). Examples are gender, social class, blood type, country affiliation, observation time or rating via Likert scales. All values of categorical data are either in categories or np.nan.
The categorical data type is useful in the following cases:
- A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory, see here.
- The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see here.
- As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).
9| What Are The Different Ways A DataFrame Can Be Created In pandas?
DataFrame can be created in different ways here are some ways by which we create a DataFrame:
- Using List:
# initialize list of lists
data = [[‘p’, 1], [‘q’, 2], [‘r’, 3]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = [‘Letter’, ‘Number’])
# print dataframe.
df
- Using dict of narray/lists:
To create DataFrame from dict of narray/list, all the narray must be of same length. If index is passed then the length index should be equal to the length of arrays. If no index is passed, then by default, index will be range(n) where n is the array length.
- Using arrays:
# DataFrame using arrays.
import pandas as pd
# initialise data of lists.
data = {‘Name’:[‘Tom’, ‘Jack’, ‘nick’, ‘juli’], ‘marks’:[99, 98, 95, 90]}
# Creates pandas DataFrame.
df = pd.DataFrame(data, index =[‘rank1’, ‘rank2’, ‘rank3’, ‘rank4’])
# print the data
df
10| What Is Time Series In pandas
A time series is an ordered sequence of data which basically represents how some quantity changes over time. pandas contains extensive capabilities and features for working with time series data for all domains.
pandas supports:
- Parsing time series information from various sources and formats
- Generate sequences of fixed-frequency dates and time spans
- Manipulating and converting date time with timezone information
- Resampling or converting a time series to a particular frequency
- Performing date and time arithmetic with absolute or relative time increments