Pandas with Python

Pandas with Python What is pandas?

 


Hello everyone!!!!

So in this blog, we are going to learn about pandas.....

So let's get started.....

Course content:

  • Introduction to pandas
  • Data Structures
  • Series 
  • DataFrame
  • Re-indexing
  • Operation between series and dataframe
  • Sorting and Ranking
  • Descriptive statistics 
  • Data loading, storage, and file formats
Introduction to Pandas:

  • Pandas is an open-source, BSD-licensed Python library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
  • Python with Pandasis used in a wide range of fields including academic and commercial domains including finance, economics, statistics, analytics, etc.
  • Fast and efficient DataFrame object with default and customized indexing.
  • Tools for loading data into in-memory data objects from different file formats. 
  • Data alignment and integrated handling of missing data.
  • Reshaping and pivoting of date sets.
  • Label-based slicing, indexing, and sub setting of large data sets.
  • Columns from a data structure can be deleted or inserted.
  • Group by data for aggregation and transformations.
  • High-performance merging and joining of data.
Data Structures:
  • Pandas deals with the following three data structures —
  • Series:- 1 Dimensional 
  • DataFrame:- 2 Dimensional
  • Panel:- 3 Dimensional 
These data structures are built on top of a Numpy array, which means they are fast.
The best way to think of these data structures is that the higher dimensional data structure is a container of its lower-dimensional data structure. 
For example,DataFrameis a container of Series, Panel is a container of DataFrame.

Series:
  • Series is a 1-dimensional array like structure with homogenous data capable of holding data of any type (int, float, string, python objects, etc.)
  • The axis labeled is collectively called index.
  • Key points: Homogenous data, Size immutable, Values of data mutable.
  • Pandas series can be created by using the-
pandas.series(data,index,dtype,copy)
  
import pandas as pd
import numpy as np
s=pd.Series(dtype=float)
print(s)
print(type(s))
Output=Series([], dtype: float64)
class 'pandas.core.series.Series'>
data=np.array(['a','b','c','d'])
print(data)
Output=['a' 'b' 'c' 'd']
s=pd.Series(data)
s
Output=
0    a
1    b
2    c
3    d
dtype: object
s=pd.Series(data,index=[111,222,333,444])
s
Output=
111    a
222    b
333    c
444    d
dtype: object

data={'a':0,'b':1,'c':2}
print(data)
Output={'a': 0, 'b': 1, 'c': 2}
s=pd.Series(data)
s
Output=
a    0
b    1
c    2
dtype: int64
s=pd.Series(data,index=['b','c','d','a'])
s
Output=
b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64
s=pd.Series(5,index=[0,1,2,3])
s
Output=
0    5
1    5
2    5
3    5
dtype: int64
s=pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
s
Output=
a    1
b    2
c    3
d    4
e    5
dtype: int64
print(s[0])
print(s[:3])
print(s[-3:])
print(s['a'])
print(s[['a','c','d']])
Output=
1

a    1
b    2
c    3
dtype: int64

c    3
d    4
e    5
dtype: int64

1

a    1
c    3
d    4
dtype: int64

                     Data Operations

s=pd.Series(np.random.randn(8))
print(s)
Output=
0   -0.263335
1    0.598104
2    2.115186
3    0.163075
4   -0.529759
5    1.268830
6    0.215765
7    1.313002
dtype: float64
s.axes
Output=[RangeIndex(start=0, stop=8, step=1)]
s2=pd.Series(np.random.randn(4),index=[11,12,13,14])
s2
Output=11    0.661457
12    0.112885
13   -0.334969
14    0.350130
dtype: float64
s2.axes
Output=[Int64Index([11, 12, 13, 14], dtype='int64')]
s.dtypes
Output=dtype('float64')
s.ndim
Output=1
s.size
Output=8
se=pd.Series(dtype=float)
print(se)
Output=Series([], dtype: float64)
se.empty
Output=True
s.empty
Output=False
s.values
Output=array([-0.26333516,  0.59810371,  2.11518637,  0.16307502, -0.52975919,
        1.26882994,  0.21576517,  1.31300243])
s.head(2)
Output=
0   -0.263335
1    0.598104
dtype: float64

s.tail(2)
Output=
6    0.215765
7    1.313002
dtype: float64

s.head()
Output=
0   -0.263335
1    0.598104
2    2.115186
3    0.163075
4   -0.529759
dtype: float64

s.tail()
Output=
3    0.163075
4   -0.529759
5    1.268830
6    0.215765
7    1.313002
dtype: float64


DataFrame:
  • A dataframe is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.
Features of dataframe:
  • Potentially columns are of different types
  • Size is mutable
  • Labeled axes (rows and columns)
  • Can perform arithmetic operations on rows and columns.
A pandas Dataframe can be created using various inputs like-
pandas.Dataframe(data,index,columns,dtype,copy)
  • Lists
  • Dictionary
  • Series
  • Numpy ndarrays
  • Another DataFrame

df=pd.DataFrame()
print(df)
Output=
Empty DataFrame
Columns: []
Index: []
data=[11,22,33,44,55]
df=pd.DataFrame(data)
print(df)
Output=   
0
0  11
1  22
2  33
3  44
4  55
data=[["Alok",10],["Bhushan",20],["Anil",30]]
df=pd.DataFrame(data,columns=["Name","age"])
print(df)
Output=      
	  Name  age
0     Alok   10
1  Bhushan   20
2     Anil   30
data=[["Alok",10],["Bhushan",20],["Anil",30]]
df=pd.DataFrame(data,columns=["Name","age"],dtype=float)
print(df)
Output=   
	  Name   age
0     Alok  10.0
1  Bhushan  20.0
2     Anil  30.0
data={"Name":["Swapnil","Anil","Anup","Viraj"],
     "Age":[20,21,22,24]}
print(data)
Output={'Name': ['Swapnil', 'Anil', 'Anup', 'Viraj'], 'Age': [20, 21, 22, 24]}
df=pd.DataFrame(data)
print(df)
Output= 
	Name    Age
0  Swapnil   20
1     Anil   21
2     Anup   22
3    Viraj   24

d={'Names':pd.Series(["Praanay","Prem","Atul","Amar","Sarthak"]),"Age":pd.Series([20,25,21,22,23]),"Rating":pd.Series([2.2,2.3,5.3,1.6,4.5])}
df=pd.DataFrame(d,columns=['Names',"Age","Rating"])
print(df)
Output=
	Names  Age  Rating
0  Praanay   20     2.2
1     Prem   25     2.3
2     Atul   21     5.3
3     Amar   22     1.6
4  Sarthak   23     4.5
print(df.T)
Output=              0     1     2     3        4
Names   Praanay  Prem  Atul  Amar  Sarthak
Age          20    25    21    22       23
Rating      2.2   2.3   5.3   1.6      4.5
print(df.axes)
Output=[RangeIndex(start=0, stop=5, step=1), Index(['Names', 'Age', 'Rating'], dtype='object')]
print(df.ndim)
Output=2
print(df.shape)
Output=(5, 3)
print(df.size)
Output=15
print(df.values)
Output=
[['Praanay' 20 2.2]
 ['Prem' 25 2.3]
 ['Atul' 21 5.3]
 ['Amar' 22 1.6]
 ['Sarthak' 23 4.5]]

data=pd.DataFrame(np.arange(16).reshape(4,4),index=["Indore","Raipur","Nagpur","Hyderabad"],columns=['one','two','three','four'])
print(data)
Output=	
one	two	three	four
Indore	0	1	2	3
Raipur	4	5	6	7
Nagpur	8	9	10	11
Hyderabad	12	13	14	15

Re-Indexing:
  • A critical method on pandas objects is reindex(), which means to create a new object with the data conformed to a new index.
  • For ordered data like time series, it may be desirable to do some interpolation or filling of values when reindexing.
  • The method option allows us to do this, using a method such as ffil which forward fills the values.
states=["Raipur","Indore","Hyderabad"]
frame.reindex(columns=states)
Output=
Raipur	Indore	Hyderabad
a	2	NaN	NaN
b	5	NaN	NaN
c	8	NaN	NaN
frame.reindex(index=['a','b','c','d'],columns=states)
Output=
Raipur	Indore	Hyderabad
a	2.0	NaN	NaN
b	5.0	NaN	NaN
c	8.0	NaN	NaN
d	NaN	NaN	NaN

Drop Command:
  • With the dataframe, index values can be deleted from either axis.
  • Dropping one or more entries from an axis is easy if one has an index array or list without those entries. As that can require a bit of set logic, the drop method will return a new object with the indicated value or values deleted from an axis.
obj=pd.Series(np.arange(5),index=['a','b','c','d','e'])
print(obj)
Output=
a    0
b    1
c    2
d    3
e    4
dtype: int32
new_obj=obj.drop('c')
new_obj
Output=
a    0
b    1
d    3
e    4
dtype: int32
new_obj=obj.drop(['c','d'])
new_obj
Output=
a    0
b    1
e    4
dtype: int32

Arithmetic and data alignment:
  • One of the most important pandas features is the behavior of arithmetic between objects with different indexes.
  • When adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs.
  • The internal data alignment introduces NAN values in the indices that don't overlap.
  • In the case of dataframe, alignment is performed on both the rows and the columns, which returns a dataframe whose index and columns are the unions of the ones in each dataframe.
  • Relatively, when reindexing a series or dataframe, one can also specify a different fill value.
Operations between Dataframe & Series:
  • As with NumPy arrays, arithmetic between Dataframe and series is well defined.
  • By default, arithmetic between Dataframe and series matches the index of the series on the dataframe's columns, broadcasting down the rows.
  • If an index value is not found in either the dataframe columns or the series index, the objects will be reindexed to form the union.
  • If one wants to instead broadcast over the columns, matching on the rows, one has to use arithmetic methods.
series2=pd.Series(range(3),index=['b','e','f'])
series2
Output=
b    0
e    1
f    2
dtype: int64
print(frame)
Output=
b	d	e
Raipur	0	1	2
Nagpur	3	4	5
hyderabad	6	7	8
indore	9	10	11
frame+series2
Output=	
b	d	e	f
Raipur	0.0	NaN	3.0	NaN
Nagpur	3.0	NaN	6.0	NaN
hyderabad	6.0	NaN	9.0	NaN
indore	9.0	NaN	12.0	NaN
series3=frame['d']
series3
Output=
Raipur        1
Nagpur        4
hyderabad     7
indore       10
Name: d, dtype: int32
frame
Output=
b	d	e
Raipur	0	1	2
Nagpur	3	4	5
hyderabad	6	7	8
indore	9	10	11
frame.sub(series3,axis="index")
Output=
b	d	e
Raipur	-1	0	1
Nagpur	-1	0	1
hyderabad	-1	0	1
indore	-1	0	1

Function application and mapping:
  • NumPy ufuncs (element-wise array methods) work fine with pandas objects.
  • Another frequent operation is applying a function on 1D arrays to each column or row. DataFrame’s apply method does exactly this.
  • Many of the most common array statistics (like sum and mean) are DataFram methods, so using apply is not necessary. 
  • The function passed to apply need not return a scalar value, it can also return a scaler value it also returns a Series with multiple values.
frame=pd.DataFrame(np.random.randn(4,3),columns=list('bde'),index=['raipur','nagpur','hyderabad','indore'])
frame
Output=
b	d	e
raipur	1.977048	-1.860493	0.768591
nagpur	-1.498661	-2.329090	0.222861
hyderabad	0.110777	-0.467806	-0.943308
indore	-0.033976	-0.147853	0.157741

np.abs
Output= ufunc 'absolute'>

f=lambda x:x.max()-x.min()
frame.apply(f)
Output=
b    3.475709
d    2.181237
e    1.711899
dtype: float64

frame.apply(f,axis='columns')
Output=
raipur       3.837542
nagpur       2.551952
hyderabad    1.054084
indore       0.305594
dtype: float64

def f(x):
    return pd.Series([x.min(),x.max()],index=['min','max'])
frame.apply(f)
Output=
		b	       d	      e
min	-1.498661	-2.329090	-0.943308
max	1.977048	-0.147853	0.768591

Sorting and Ranking: 
  • Sorting a data set by some criterion is another important built-in operation. To sort lexicographically by row or column index, use the sort_index () method, which returns a new, sorted object.
  • With a dataframe, one can sort by index on either axis. The data is sorted in ascending order by default but can be sorted in descending order too.
  • The rank methods for Series and DataFrame are the place to look; by default, rank breaks ties by assigning each group the mean rank.
  • Ranks can also be assigned according to the order they’re observed in the data.
  • Naturally, one can rank in descending order, too.
obj=pd.Series(range(4),index=['d','a','b','c'])
obj
Output=
d    0
a    1
b    2
c    3
dtype: int64
obj.sort_index()
Output=
a    1
b    2
c    3
d    0
dtype: int64

frame=pd.DataFrame(np.arange(8).reshape((2,4)),index=['three','one'],columns=['d','a','b','c'])
print(frame)
Output=
d	a	b	c
three	0	1	2	3
one	4	5	6	7

frame.sort_index()
Output=	
d	a	b	c
one	4	5	6	7
three	0	1	2	3

frame.sort_index(axis=1)
Output=	
a	b	c	d
three	1	2	3	0
one	5	6	7	4

frame=pd.DataFrame({'b':[4,7,-3,2],'a':[0,1,0,1]})
frame
Output=
b	a
0	4	0
1	7	1
2	-3	0
3	2	1

frame.sort_values(by='b')
Output=
b	a
2	-3	0
3	2	1
0	4	0
1	7	1


Axis indexes with duplicate values: 
  • Up until now all of the examples we have seen, had unique axis labels (index values).While many pandas functions (like reindex()) require that the labels be unique, it’s notmandatory.
  • The index's is_unique property can tell you whether its values are unique or not.
  • Data selection is one of the main things that behaves differently with duplicates. Indexing a value with multiple entries returns Series while single entries return a scalar value.
obj=pd.Series(range(5),index=['a','a','b','b','c'])
obj
Output=
a    0
a    1
b    2
b    3
c    4
dtype: int64
obj.index.is_unique
Output=False
obj['a']
Output=
a    0
a    1
dtype: int64
obj['c']
Output=4
df=pd.DataFrame(np.random.randn(4,3),index=['a','a','b','c'])
df
Output=
0	1	2
a	-0.948989	-0.236842	1.203461
a	-1.186551	0.934325	-1.282523
b	0.679511	-1.089725	1.387880
c	0.743163	-0.895804	0.361094

df.loc['b']
Output=
0    0.679511
1   -1.089725
2    1.387880
Name: b, dtype: float64

df.loc['a']
Output=
0	1	2
a	-0.948989	-0.236842	1.203461
a	-1.186551	0.934325	-1.282523

Descriptive statistics with pandas:
  • Pandas objects are equipped with a set of common mathematical and statistical methods. Most of these fall into the category of reductions or summary statistics, methods that extract a single value (like the sum or mean) from a Series or a Series of values from the rows or columns of a DataFrame. Compared with the equivalent methods of NumPy arrays, they are all built from the ground up to exclude missing data.
  • NA values are excluded unless the entire slice is NA. This can be disable using skipna option.
df=pd.DataFrame([[1.4,np.nan],[7.1,-4.5],[np.nan,np.nan],[0.75,-1.3]],index = ['a','b','c','d'],columns=['one','two'])
df
Output=
one	two
a	1.40	NaN
b	7.10	-4.5
c	NaN	NaN
d	0.75	-1.3

df.sum()
Output=
one    9.25
two   -5.80
dtype: float64

df.sum(axis='columns')
Output=
a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

df.mean(axis='columns',skipna=False)
Output=
a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

df.cumsum()
Output=
one	two
a	1.40	NaN
b	8.50	-4.5
c	NaN	NaN
d	9.25	-5.8

df.describe()
Output=
one	two
count	3.000000	2.000000
mean	3.083333	-2.900000
std	3.493685	2.262742
min	0.750000	-4.500000
25%	1.075000	-3.700000
50%	1.400000	-2.900000
75%	4.250000	-2.100000
max	7.100000	-1.300000

obj=pd.Series(['a','a','b','c']*4)
obj
Output=
0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object

obj.describe()
Output=count     16
unique     3
top        a
freq       8
dtype: object

Data loading,storage and file formats:
  • The tools & libraries for data analysis are of little use if one can’t easily import and export data in Python. We will be focused on input and output with pandas objects, though there are of course numerous tools in other libraries to aid in this process.
  • Input and output typically falls into a few main categories:
  • Reading text files and other more efficient on-disk formats
  • Loading data from databases 
  • Interacting with network sources like web APIs.

    I am giving you a txt file for practice.,in future you have to work with the database.

  • Download txt file (temp)
df=pd.read_csv("temp.txt")
df
Output=
S.No	Name	Age	City	Salary	DOB
0	1	Vishal	NaN	Nagpur	20000	22-12-1998
1	2	Pranay	32.0	Mumbai	3000	23-02-1991
2	3	Akshay	43.0	Banglore	8300	12-05-1985
3	4	Ram	38.0	Hyderabad	3900	01-12-1992

print(df.shape)
Output=(4, 6)
df=pd.read_csv("temp.txt",usecols=["Name","Age"])
df
Output=
Name	Age
0	Vishal	NaN
1	Pranay	32.0
2	Akshay	43.0
3	Ram	38.0

df=pd.read_csv("temp.txt",index_col=['S.No'])
df
Output=
S.No Name	Age	City	Salary	DOB
					
1	Vishal	NaN	Nagpur	20000	22-12-1998
2	Pranay	32.0	Mumbai	3000	23-02-1991
3	Akshay	43.0	Banglore	8300	12-05-1985
4	Ram	38.0	Hyderabad	3900	01-12-1992

df.dtypes
Output=
Name       object
Age       float64
City       object
Salary      int64
DOB        object
dtype: object

date_cols=['DOB']
df=pd.read_csv('temp.txt',parse_dates=date_cols)
df
Output=
S.No	Name	Age	City	Salary	DOB
0	1	Vishal	NaN	Nagpur	20000	1998-12-22
1	2	Pranay	32.0	Mumbai	3000	1991-02-23
2	3	Akshay	43.0	Banglore	8300	1985-12-05
3	4	Ram	38.0	Hyderabad	3900	1992-01-12

df['DOB'].dt.year
Output=0    1998
1    1991
2    1985
3    1992
Name: DOB, dtype: int64

df=pd.read_csv('temp.txt',names=['a','b','c','d','e','f'])
df
Output=
a	b	c	d	e	f
0	S.No	Name	Age	City	Salary	DOB
1	1	Vishal	NaN	Nagpur	20000	22-12-1998
2	2	Pranay	32	Mumbai	3000	23-02-1991
3	3	Akshay	43	Banglore	8300	12-05-1985
4	4	Ram	38	Hyderabad	3900	01-12-1992


df=pd.read_csv('temp.txt',names=['a','b','c','d','e','f'],header=0)
df
Output=
a	b	c	d	e	f
0	1	Vishal	NaN	Nagpur	20000	22-12-1998
1	2	Pranay	32.0	Mumbai	3000	23-02-1991
2	3	Akshay	43.0	Banglore	8300	12-05-1985
3	4	Ram	38.0	Hyderabad	3900	01-12-1992

df=pd.read_csv('temp.txt',skiprows=2,names=['a','b','c','d','e','f'],header=0)
df
Output=
a	b	c	d	e	f
0	3	Akshay	43	Banglore	8300	12-05-1985
1	4	Ram	38	Hyderabad	3900	01-12-1992

df=pd.read_csv("temp.txt")
df.loc[0,'Age']=21
df
Output=
S.No	 Name	Age 	City	Salary	DOB
0	1	Vishal	21.0	Nagpur	20000	22-12-1998
1	2	Pranay	32.0	Mumbai	3000	23-02-1991
2	3	Akshay	43.0	Banglore	8300	12-05-1985
3	4	Ram	38.0	Hyderabad	3900	01-12-1992

In this way you have to do the operations on different files.

This topic is a vast topic, try to understand it and practise regularly. 
Best regards from,
msbtenotes:)

THANK YOU!!!

Hello, I am here for helping the students who are eager to learn to code.