Pandas with Python
Pandas with Python
What is pandas?
Hello everyone!!!!
So in this blog, we are going to learn about pandas.....
So let's get started.....
Course content:
- Introduction to pandas
- Data Structures
- Series
- DataFrame
- Re-indexing
- Operation between series and dataframe
- Sorting and Ranking
- Descriptive statistics
- Data loading, storage, and file formats
Introduction to Pandas:
- Pandas is an open-source, BSD-licensed Python library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
- Python with Pandasis used in a wide range of fields including academic and commercial domains including finance, economics, statistics, analytics, etc.
- Fast and efficient DataFrame object with default and customized indexing.
- Tools for loading data into in-memory data objects from different file formats.
- Data alignment and integrated handling of missing data.
- Reshaping and pivoting of date sets.
- Label-based slicing, indexing, and sub setting of large data sets.
- Columns from a data structure can be deleted or inserted.
- Group by data for aggregation and transformations.
- High-performance merging and joining of data.
Data Structures:
- Pandas deals with the following three data structures —
- Series:- 1 Dimensional
- DataFrame:- 2 Dimensional
- Panel:- 3 Dimensional
These data structures are built on top of a Numpy array, which means they
are fast.
The best way to think of these data structures is that the higher
dimensional data structure is a container of its lower-dimensional data
structure.
For example,DataFrameis a container of Series, Panel is a container of
DataFrame.
Series:
- Series is a 1-dimensional array like structure with homogenous data capable of holding data of any type (int, float, string, python objects, etc.)
- The axis labeled is collectively called index.
- Key points: Homogenous data, Size immutable, Values of data mutable.
- Pandas series can be created by using the-
import pandas as pd import numpy as np s=pd.Series(dtype=float) print(s) print(type(s)) Output=Series([], dtype: float64) class 'pandas.core.series.Series'> data=np.array(['a','b','c','d']) print(data) Output=['a' 'b' 'c' 'd'] s=pd.Series(data) s Output= 0 a 1 b 2 c 3 d dtype: object s=pd.Series(data,index=[111,222,333,444]) s Output= 111 a 222 b 333 c 444 d dtype: object data={'a':0,'b':1,'c':2} print(data) Output={'a': 0, 'b': 1, 'c': 2} s=pd.Series(data) s Output= a 0 b 1 c 2 dtype: int64 s=pd.Series(data,index=['b','c','d','a']) s Output= b 1.0 c 2.0 d NaN a 0.0 dtype: float64 s=pd.Series(5,index=[0,1,2,3]) s Output= 0 5 1 5 2 5 3 5 dtype: int64 s=pd.Series([1,2,3,4,5],index=['a','b','c','d','e']) s Output= a 1 b 2 c 3 d 4 e 5 dtype: int64 print(s[0]) print(s[:3]) print(s[-3:]) print(s['a']) print(s[['a','c','d']]) Output= 1 a 1 b 2 c 3 dtype: int64 c 3 d 4 e 5 dtype: int64 1 a 1 c 3 d 4 dtype: int64
Data Operations s=pd.Series(np.random.randn(8)) print(s) Output= 0 -0.263335 1 0.598104 2 2.115186 3 0.163075 4 -0.529759 5 1.268830 6 0.215765 7 1.313002 dtype: float64 s.axes Output=[RangeIndex(start=0, stop=8, step=1)] s2=pd.Series(np.random.randn(4),index=[11,12,13,14]) s2 Output=11 0.661457 12 0.112885 13 -0.334969 14 0.350130 dtype: float64 s2.axes Output=[Int64Index([11, 12, 13, 14], dtype='int64')] s.dtypes Output=dtype('float64') s.ndim Output=1 s.size Output=8 se=pd.Series(dtype=float) print(se) Output=Series([], dtype: float64) se.empty Output=True s.empty Output=False s.values Output=array([-0.26333516, 0.59810371, 2.11518637, 0.16307502, -0.52975919, 1.26882994, 0.21576517, 1.31300243]) s.head(2) Output= 0 -0.263335 1 0.598104 dtype: float64 s.tail(2) Output= 6 0.215765 7 1.313002 dtype: float64 s.head() Output= 0 -0.263335 1 0.598104 2 2.115186 3 0.163075 4 -0.529759 dtype: float64 s.tail() Output= 3 0.163075 4 -0.529759 5 1.268830 6 0.215765 7 1.313002 dtype: float64
DataFrame:
- A dataframe is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.
- Potentially columns are of different types
- Size is mutable
- Labeled axes (rows and columns)
- Can perform arithmetic operations on rows and columns.
A pandas Dataframe can be created using various inputs like-
pandas.Dataframe(data,index,columns,dtype,copy)
- Lists
- Dictionary
- Series
- Numpy ndarrays
- Another DataFrame
df=pd.DataFrame() print(df) Output= Empty DataFrame Columns: [] Index: [] data=[11,22,33,44,55] df=pd.DataFrame(data) print(df) Output= 0 0 11 1 22 2 33 3 44 4 55 data=[["Alok",10],["Bhushan",20],["Anil",30]] df=pd.DataFrame(data,columns=["Name","age"]) print(df) Output= Name age 0 Alok 10 1 Bhushan 20 2 Anil 30 data=[["Alok",10],["Bhushan",20],["Anil",30]] df=pd.DataFrame(data,columns=["Name","age"],dtype=float) print(df) Output= Name age 0 Alok 10.0 1 Bhushan 20.0 2 Anil 30.0 data={"Name":["Swapnil","Anil","Anup","Viraj"], "Age":[20,21,22,24]} print(data) Output={'Name': ['Swapnil', 'Anil', 'Anup', 'Viraj'], 'Age': [20, 21, 22, 24]} df=pd.DataFrame(data) print(df) Output= Name Age 0 Swapnil 20 1 Anil 21 2 Anup 22 3 Viraj 24 d={'Names':pd.Series(["Praanay","Prem","Atul","Amar","Sarthak"]),"Age":pd.Series([20,25,21,22,23]),"Rating":pd.Series([2.2,2.3,5.3,1.6,4.5])} df=pd.DataFrame(d,columns=['Names',"Age","Rating"]) print(df) Output= Names Age Rating 0 Praanay 20 2.2 1 Prem 25 2.3 2 Atul 21 5.3 3 Amar 22 1.6 4 Sarthak 23 4.5 print(df.T) Output= 0 1 2 3 4 Names Praanay Prem Atul Amar Sarthak Age 20 25 21 22 23 Rating 2.2 2.3 5.3 1.6 4.5 print(df.axes) Output=[RangeIndex(start=0, stop=5, step=1), Index(['Names', 'Age', 'Rating'], dtype='object')] print(df.ndim) Output=2 print(df.shape) Output=(5, 3) print(df.size) Output=15 print(df.values) Output= [['Praanay' 20 2.2] ['Prem' 25 2.3] ['Atul' 21 5.3] ['Amar' 22 1.6] ['Sarthak' 23 4.5]] data=pd.DataFrame(np.arange(16).reshape(4,4),index=["Indore","Raipur","Nagpur","Hyderabad"],columns=['one','two','three','four']) print(data) Output= one two three four Indore 0 1 2 3 Raipur 4 5 6 7 Nagpur 8 9 10 11 Hyderabad 12 13 14 15
Re-Indexing:
- A critical method on pandas objects is reindex(), which means to create a new object with the data conformed to a new index.
- For ordered data like time series, it may be desirable to do some interpolation or filling of values when reindexing.
- The method option allows us to do this, using a method such as ffil which forward fills the values.
states=["Raipur","Indore","Hyderabad"] frame.reindex(columns=states) Output= Raipur Indore Hyderabad a 2 NaN NaN b 5 NaN NaN c 8 NaN NaN frame.reindex(index=['a','b','c','d'],columns=states) Output= Raipur Indore Hyderabad a 2.0 NaN NaN b 5.0 NaN NaN c 8.0 NaN NaN d NaN NaN NaN
Drop Command:
- With the dataframe, index values can be deleted from either axis.
- Dropping one or more entries from an axis is easy if one has an index array or list without those entries. As that can require a bit of set logic, the drop method will return a new object with the indicated value or values deleted from an axis.
obj=pd.Series(np.arange(5),index=['a','b','c','d','e']) print(obj) Output= a 0 b 1 c 2 d 3 e 4 dtype: int32 new_obj=obj.drop('c') new_obj Output= a 0 b 1 d 3 e 4 dtype: int32 new_obj=obj.drop(['c','d']) new_obj Output= a 0 b 1 e 4 dtype: int32
Arithmetic and data alignment:
- One of the most important pandas features is the behavior of arithmetic between objects with different indexes.
- When adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs.
- The internal data alignment introduces NAN values in the indices that don't overlap.
- In the case of dataframe, alignment is performed on both the rows and the columns, which returns a dataframe whose index and columns are the unions of the ones in each dataframe.
- Relatively, when reindexing a series or dataframe, one can also specify a different fill value.
Operations between Dataframe & Series:
- As with NumPy arrays, arithmetic between Dataframe and series is well defined.
- By default, arithmetic between Dataframe and series matches the index of the series on the dataframe's columns, broadcasting down the rows.
- If an index value is not found in either the dataframe columns or the series index, the objects will be reindexed to form the union.
- If one wants to instead broadcast over the columns, matching on the rows, one has to use arithmetic methods.
series2=pd.Series(range(3),index=['b','e','f']) series2 Output= b 0 e 1 f 2 dtype: int64 print(frame) Output= b d e Raipur 0 1 2 Nagpur 3 4 5 hyderabad 6 7 8 indore 9 10 11 frame+series2 Output= b d e f Raipur 0.0 NaN 3.0 NaN Nagpur 3.0 NaN 6.0 NaN hyderabad 6.0 NaN 9.0 NaN indore 9.0 NaN 12.0 NaN series3=frame['d'] series3 Output= Raipur 1 Nagpur 4 hyderabad 7 indore 10 Name: d, dtype: int32 frame Output= b d e Raipur 0 1 2 Nagpur 3 4 5 hyderabad 6 7 8 indore 9 10 11 frame.sub(series3,axis="index") Output= b d e Raipur -1 0 1 Nagpur -1 0 1 hyderabad -1 0 1 indore -1 0 1
Function application and mapping:
- NumPy ufuncs (element-wise array methods) work fine with pandas objects.
- Another frequent operation is applying a function on 1D arrays to each column or row. DataFrame’s apply method does exactly this.
- Many of the most common array statistics (like sum and mean) are DataFram methods, so using apply is not necessary.
- The function passed to apply need not return a scalar value, it can also return a scaler value it also returns a Series with multiple values.
frame=pd.DataFrame(np.random.randn(4,3),columns=list('bde'),index=['raipur','nagpur','hyderabad','indore']) frame Output= b d e raipur 1.977048 -1.860493 0.768591 nagpur -1.498661 -2.329090 0.222861 hyderabad 0.110777 -0.467806 -0.943308 indore -0.033976 -0.147853 0.157741 np.abs Output= ufunc 'absolute'> f=lambda x:x.max()-x.min() frame.apply(f) Output= b 3.475709 d 2.181237 e 1.711899 dtype: float64 frame.apply(f,axis='columns') Output= raipur 3.837542 nagpur 2.551952 hyderabad 1.054084 indore 0.305594 dtype: float64 def f(x): return pd.Series([x.min(),x.max()],index=['min','max']) frame.apply(f) Output= b d e min -1.498661 -2.329090 -0.943308 max 1.977048 -0.147853 0.768591
Sorting and Ranking:
- Sorting a data set by some criterion is another important built-in operation. To sort lexicographically by row or column index, use the sort_index () method, which returns a new, sorted object.
- With a dataframe, one can sort by index on either axis. The data is sorted in ascending order by default but can be sorted in descending order too.
- The rank methods for Series and DataFrame are the place to look; by default, rank breaks ties by assigning each group the mean rank.
- Ranks can also be assigned according to the order they’re observed in the data.
- Naturally, one can rank in descending order, too.
obj=pd.Series(range(4),index=['d','a','b','c']) obj Output= d 0 a 1 b 2 c 3 dtype: int64 obj.sort_index() Output= a 1 b 2 c 3 d 0 dtype: int64 frame=pd.DataFrame(np.arange(8).reshape((2,4)),index=['three','one'],columns=['d','a','b','c']) print(frame) Output= d a b c three 0 1 2 3 one 4 5 6 7 frame.sort_index() Output= d a b c one 4 5 6 7 three 0 1 2 3 frame.sort_index(axis=1) Output= a b c d three 1 2 3 0 one 5 6 7 4 frame=pd.DataFrame({'b':[4,7,-3,2],'a':[0,1,0,1]}) frame Output= b a 0 4 0 1 7 1 2 -3 0 3 2 1 frame.sort_values(by='b') Output= b a 2 -3 0 3 2 1 0 4 0 1 7 1
Axis indexes with duplicate values:
- Up until now all of the examples we have seen, had unique axis labels (index values).While many pandas functions (like reindex()) require that the labels be unique, it’s notmandatory.
- The index's is_unique property can tell you whether its values are unique or not.
- Data selection is one of the main things that behaves differently with duplicates. Indexing a value with multiple entries returns Series while single entries return a scalar value.
obj=pd.Series(range(5),index=['a','a','b','b','c']) obj Output= a 0 a 1 b 2 b 3 c 4 dtype: int64 obj.index.is_unique Output=False obj['a'] Output= a 0 a 1 dtype: int64 obj['c'] Output=4 df=pd.DataFrame(np.random.randn(4,3),index=['a','a','b','c']) df Output= 0 1 2 a -0.948989 -0.236842 1.203461 a -1.186551 0.934325 -1.282523 b 0.679511 -1.089725 1.387880 c 0.743163 -0.895804 0.361094 df.loc['b'] Output= 0 0.679511 1 -1.089725 2 1.387880 Name: b, dtype: float64 df.loc['a'] Output= 0 1 2 a -0.948989 -0.236842 1.203461 a -1.186551 0.934325 -1.282523
- Pandas objects are equipped with a set of common mathematical and statistical methods. Most of these fall into the category of reductions or summary statistics, methods that extract a single value (like the sum or mean) from a Series or a Series of values from the rows or columns of a DataFrame. Compared with the equivalent methods of NumPy arrays, they are all built from the ground up to exclude missing data.
- NA values are excluded unless the entire slice is NA. This can be disable using skipna option.
df=pd.DataFrame([[1.4,np.nan],[7.1,-4.5],[np.nan,np.nan],[0.75,-1.3]],index = ['a','b','c','d'],columns=['one','two']) df Output= one two a 1.40 NaN b 7.10 -4.5 c NaN NaN d 0.75 -1.3 df.sum() Output= one 9.25 two -5.80 dtype: float64 df.sum(axis='columns') Output= a 1.40 b 2.60 c 0.00 d -0.55 dtype: float64 df.mean(axis='columns',skipna=False) Output= a NaN b 1.300 c NaN d -0.275 dtype: float64 df.cumsum() Output= one two a 1.40 NaN b 8.50 -4.5 c NaN NaN d 9.25 -5.8 df.describe() Output= one two count 3.000000 2.000000 mean 3.083333 -2.900000 std 3.493685 2.262742 min 0.750000 -4.500000 25% 1.075000 -3.700000 50% 1.400000 -2.900000 75% 4.250000 -2.100000 max 7.100000 -1.300000 obj=pd.Series(['a','a','b','c']*4) obj Output= 0 a 1 a 2 b 3 c 4 a 5 a 6 b 7 c 8 a 9 a 10 b 11 c 12 a 13 a 14 b 15 c dtype: object obj.describe() Output=count 16 unique 3 top a freq 8 dtype: object
- The tools & libraries for data analysis are of little use if one can’t easily import and export data in Python. We will be focused on input and output with pandas objects, though there are of course numerous tools in other libraries to aid in this process.
- Input and output typically falls into a few main categories:
- Reading text files and other more efficient on-disk formats
- Loading data from databases
-
Interacting with network sources like web APIs.
I am giving you a txt file for practice.,in future you have to work with the database.
Download txt file (temp)
df=pd.read_csv("temp.txt") df Output= S.No Name Age City Salary DOB 0 1 Vishal NaN Nagpur 20000 22-12-1998 1 2 Pranay 32.0 Mumbai 3000 23-02-1991 2 3 Akshay 43.0 Banglore 8300 12-05-1985 3 4 Ram 38.0 Hyderabad 3900 01-12-1992 print(df.shape) Output=(4, 6) df=pd.read_csv("temp.txt",usecols=["Name","Age"]) df Output= Name Age 0 Vishal NaN 1 Pranay 32.0 2 Akshay 43.0 3 Ram 38.0 df=pd.read_csv("temp.txt",index_col=['S.No']) df Output= S.No Name Age City Salary DOB 1 Vishal NaN Nagpur 20000 22-12-1998 2 Pranay 32.0 Mumbai 3000 23-02-1991 3 Akshay 43.0 Banglore 8300 12-05-1985 4 Ram 38.0 Hyderabad 3900 01-12-1992 df.dtypes Output= Name object Age float64 City object Salary int64 DOB object dtype: object date_cols=['DOB'] df=pd.read_csv('temp.txt',parse_dates=date_cols) df Output= S.No Name Age City Salary DOB 0 1 Vishal NaN Nagpur 20000 1998-12-22 1 2 Pranay 32.0 Mumbai 3000 1991-02-23 2 3 Akshay 43.0 Banglore 8300 1985-12-05 3 4 Ram 38.0 Hyderabad 3900 1992-01-12 df['DOB'].dt.year Output=0 1998 1 1991 2 1985 3 1992 Name: DOB, dtype: int64 df=pd.read_csv('temp.txt',names=['a','b','c','d','e','f']) df Output= a b c d e f 0 S.No Name Age City Salary DOB 1 1 Vishal NaN Nagpur 20000 22-12-1998 2 2 Pranay 32 Mumbai 3000 23-02-1991 3 3 Akshay 43 Banglore 8300 12-05-1985 4 4 Ram 38 Hyderabad 3900 01-12-1992 df=pd.read_csv('temp.txt',names=['a','b','c','d','e','f'],header=0) df Output= a b c d e f 0 1 Vishal NaN Nagpur 20000 22-12-1998 1 2 Pranay 32.0 Mumbai 3000 23-02-1991 2 3 Akshay 43.0 Banglore 8300 12-05-1985 3 4 Ram 38.0 Hyderabad 3900 01-12-1992 df=pd.read_csv('temp.txt',skiprows=2,names=['a','b','c','d','e','f'],header=0) df Output= a b c d e f 0 3 Akshay 43 Banglore 8300 12-05-1985 1 4 Ram 38 Hyderabad 3900 01-12-1992 df=pd.read_csv("temp.txt") df.loc[0,'Age']=21 df Output= S.No Name Age City Salary DOB 0 1 Vishal 21.0 Nagpur 20000 22-12-1998 1 2 Pranay 32.0 Mumbai 3000 23-02-1991 2 3 Akshay 43.0 Banglore 8300 12-05-1985 3 4 Ram 38.0 Hyderabad 3900 01-12-1992
In this way you have to do the operations on different files.
This topic is a vast topic, try to understand it and practise
regularly.
Best regards from,
msbtenotes:)
Join the conversation