Pandas, an open source data science library of python which is used mainly for the analyses, cleaning and manipulation of data. Pandas allow to load the data, clean and manipulate the complex big data, and based on this, analysis is done on clean data and make conclusions. Pandas library is very fast and give very high performance on large datasets also.
Python provide Series (1-D array) and Dataframe (2-D array) in manipulating the data. The data can be loaded in both Series and Dataframe from array, list, dictionary, tuple, CSV files, excel files, SQL database..
Below are the set of multiple questions which will help to give the basic knowledge of all type of programming of Pandas Data science library in Python. You can visit website https://https://www.w3resource.com/python-exercises/pandas/ or https://www.geeksforgeeks.org/python-pandas-practice-exercises-questions-and-solutions/
Steps to run below programs:
- Install Anaconda in your PC/ Laptop.
- Open Spyder.
- Copy and paste below programs.
- Press Run button. You can see the result in Console section at right side.
#1. Create 1-Dimensional array of data using Series in Pandas and then convert that series again into list
import pandas as pd
import numpy as np
ps = pd.Series([5,9,13,8,15])
print("1-Dimensional array of data using Series :")
print(ps)
list =ps.tolist()
print("convert series again into list :")
print(list)
#2. Addition and multiplication of two pandas series
ps1 = pd.Series([5,9,13,8,15])
ps2 = pd.Series([2,8,9,11,4])
ps_add= ps1+ps2
ps_mul= ps1*ps2
print("Addition of two pandas series :")
print(ps_add)
print("Multiplication of two pandas series :")
print(ps_mul)
#3. Change Dicionary and numpy array into pandas Series
dict ={'x':1000, 'y':2000, 'z':3000}
dict_to_series = pd.Series(dict)
print("convert Dictionary into pandas :")
print(dict_to_series)
num_array = np.array([5,9,13,8,15])
numpy_to_Series = pd.Series(num_array)
print("convert numpy into pandas :")
print(numpy_to_Series)
#4. Adding some new data to existing Series
ps = pd.Series([2, 8, 12, 'data', 'series', 21])
new_ps = ps.append(pd.Series(['numpy',32]))
print("Adding some new data to existing Series :")
print(new_ps)
#5. Reindexing or change the order of existing series data
ps = pd.Series(data=[2, 8, 12, 'data', 'series', 21], index=['a','b','c','d','e','f'])
reindex_ps= ps.reindex(index=['c','e','f','b','a','b'])
print("Reindexing or change the order of existing series data :")
print(reindex_ps)
#6. Calculate the mean, Standard deviation, median, minimum value and maximum value of given data
#Note 0 percetile is the minimum value, 50th percentile is median and 100th is maximum value
data =pd.Series([12,23,14,27,9,22,34,19,26,31,45])
print("Mean of given data is :", data.mean())
print("Standard deviation of given data is :", data.std())
print("Minimum value, 25th Percentile, median, 75th percentile, maximum of the given series are :")
data_percentile= np.percentile(data, q=[0,25,50,75,100])
print(data_percentile)
#7. Generate Series from random number and calculate the unique count of numbers from that series
ps = pd.Series(np.random.randint(10,size =20))
print("Random Series values are :", ps)
count =ps.value_counts()
print("Frequancy of each number are :", count)
#8. Change series of date strings to a timeseries
series_date = pd.Series(['25 Jan 2022', '20210408', '2022/05/12', '21-02-2020', '2021-09-11', '2021-10-10T12:20'])
print("Changed series of date strings to a timeseries :")
print(pd.to_datetime(series_date))
#9. Create Dataframe using list
list =[['Akansha', 'TCS', 32],['Ashish','HCL',32],['Ankit','Canara',35]]
df = pd.DataFrame(list,columns=['Name','Organization','Age'])
print("Dataframe using list :")
print(df)
#10.Create Dataframe using Series
num_name = ['Abhishek','Sahil','Swati','Rohan','Prerna']
num_age = [5,9,13,8,15]
num_name_series=pd.Series(num_name)
num_age_series = pd.Series(num_age)
df = {'num_name':num_name_series, 'num_age':num_age_series}
Dataframe = pd.DataFrame(df)
print("Dataframe using Series :")
print(Dataframe)
#11. Create dataframe using dictionary
dict={'num_name':['Aakansha','Rohan','Isha','Ashish','Prerna'], 'num_age':[5,9,13,8,15]}
df = pd.DataFrame(dict)
print("Dataframe using dictionary :")
print(df)
#12. Create dataframe using random values
index =['I1','I2','I3','I4','I5']
Col=['C1','C2','C3','C4','C5']
df = pd.DataFrame(np.random.rand(5,5), index=index, columns=Col)
print("Dataframe using random values :")
print(df)
df.reindex(['C3','C5','C1','C4','C2'])
print("Dataframe after reindexing :", df)
#13. Iteration in Dataframe
dict11={'num_name':['Aakansha','Rohan','Isha','Ashish','Prerna'],'Organization':['TCS','HCL','Infosys','Sopra','TCS'], 'num_age':[5,9,13,8,15],'Salary':[30000,45000,33000,44000,20000]}
df11 = pd.DataFrame(dict11)
print("Iteration in Dataframe :")
print(df11)
Condition_res=df11[df11['Salary']>40000]
print(Condition_res)
#14. Iteration using loc[] function
print("Iteration using loc[] function :")
for i in range(len(df11)):
print(df11.loc[i,"num_name"],df11.loc[i,"Salary"])
for i in range(len(df11)):
print(df11.iloc[i,0], df11.iloc[i,2])
print(df11.loc[1:2,['num_name','num_age']])
#15. Filtering Row 1 and all columns using 'loc' function
print("Filtering Row 1 and all columns using 'loc' function :")
print(df11.loc[0,:])
print(df11.iloc[1:3,0:2]) #iloc provide both row and column slicing
#16. To find the shape of the dataframe
print("shape of the dataframe :")
print(df11.shape)
#17. To find the first two rows of any dataframe
print("first two rows of any dataframe :")
print(df11.head(2))
#18. Finding the maximum value in the particular column of dataframe
print("maximum value in the particular column of dataframe :")
print(df11.Salary.max()) #finding maximum salary
#19. To find maximum value of all column
print("maximum value of all column :")
print(df11.max()) #finding maximum in all columns
#20. Finding row which has the maximum salary
print("Finding row which has the maximum salary :")
print(df11[df11.Salary == df11.Salary.max()])
#21. Similar as above for max, Finding the minimum value in the particular column of dataframe
print("minimum value in the particular column of dataframe :")
print(df11.Salary.min()) #finding minimum salary
#22. Finding row which has the minimum salary
print("Finding row which has the minimum salary :")
print(df11[df11.Salary == df11.Salary.min()])
#23. Filtering for the 'TCS organization' with 'Aakansha' name
print("Filtering for the 'TCS organization' with 'Aakansha' name :")
Res = df11[df11['Organization'].str.contains("TCS") & df11['num_name'].str.contains("Aakansha")]
print(Res)
#24. Renaming the column name
print("Renaming the column name :")
df11.columns =['NAME', 'ORGANIZATION','AGE','SALARY']
print(df11.columns)
#25. Fetching unique values of column
print("Fetching unique values of column :")
print(df11.ORGANIZATION.unique())
#26. Fetching frequancy counts of columns
print("Fetching frequancy counts of columns :")
print(df11.ORGANIZATION.value_counts())
#27. Lowercase column name
print("Lowercase column name :")
df11['ORGANIZATION']=df11['ORGANIZATION'].str.lower()
print(df11)
#28. Uppercase column name
print(" Uppercase column name :")
df11['NAME']=df11['NAME'].str.upper()
print(df11)
#29. Capitalize first letter in the column
print(" Capitalize first letter in the column :")
df11['ORGANIZATION']=df11['ORGANIZATION'].str.capitalize()
print(df11)
#30. Dropping 'NAME and AGE' column
print("Dropping 'NAME and AGE' column :")
df11.drop(df11.columns[[0, 2]], axis = 1, inplace = True)
print(df11)
#31. Reading CSV files using read_csv
print("Reading CSV files using read_csv :")
dataframe = pd.read_csv("data.csv")
#32. Finding column names in given csv file
print("Finding column names in given csv file :")
print(dataframe.columns)
#33. Finding first 10 rows in the csv file
print("Finding first 10 rows in the csv file :")
print(dataframe.head(10))
#34. Finding the type of column 'fractal_dimension_worst'
print("Finding the type of column 'fractal_dimension_worst' :")
print(dataframe['fractal_dimension_worst'].dtypes)
#35. Finding last 10 rows from csv file
print("Finding last 10 rows from csv file :")
print(dataframe.tail(10))
#36. Give detail information of csv file
print("Give detail information of csv file :")
print(dataframe.info())
#37. dropna(inplace = True) will remove all rows which contain NULL values from the DataFrame without any change in original DF.
#dataframe.dropna(inplace=True)
#38. Replace NULL values in the particular columns with 75:
print("Replace NULL values in the particular columns with 75 :")
dataframe["radius_worst"].fillna(75, inplace = True)
#39. Replacing null values in the column with either mean, median or mode
print("Replacing null values in the column with either mean, median or mode :")
rp = dataframe["radius_worst"].median()
dataframe["radius_worst"].fillna(rp, inplace = True)
#40. If you want to drop rows with some condition
for i in dataframe.index:
if dataframe.loc[i,"fractal_dimension_worst"]<0.03:
dataframe.drop(i,inplace =True)
print('To find value less than 0.03 :', dataframe[dataframe['fractal_dimension_worst']<0.03])
#41. if you want to replace any particular value, here 15 is the row
print(" if you want to replace any particular value, here 15 is the row :")
dataframe.loc[15,'fractal_dimension_worst']=0.1
print(dataframe.loc[15,'fractal_dimension_worst'])
#42. Finding first 2 columns
print("Finding first 2 columns :")
print(dataframe.iloc[:,:2].head())
Result of above Exercise
1-Dimensional array of data using Series :
0 5
1 9
2 13
3 8
4 15
dtype: int64
convert series again into list :
[5, 9, 13, 8, 15]
Addition of two pandas series :
0 7
1 17
2 22
3 19
4 19
dtype: int64
Multiplication of two pandas series :
0 10
1 72
2 117
3 88
4 60
dtype: int64
convert Dictionary into pandas :
x 1000
y 2000
z 3000
dtype: int64
convert numpy into pandas :
0 5
1 9
2 13
3 8
4 15
dtype: int32
Adding some new data to existing Series :
0 2
1 8
2 12
3 data
4 series
5 21
0 numpy
1 32
dtype: object
Reindexing or change the order of existing series data :
c 12
e series
f 21
b 8
a 2
b 8
dtype: object
Mean of given data is : 23.818181818181817
Standard deviation of given data is : 10.495886640186066
Minimum value, 25th Percentile, median, 75th percentile, maximum of the given series are :
[ 9. 16.5 23. 29. 45. ]
Random Series values are : 0 3
1 0
2 6
3 8
4 0
5 4
6 1
7 1
8 7
9 3
10 6
11 0
12 7
13 2
14 3
15 9
16 0
17 1
18 3
19 9
dtype: int32
Frequancy of each number are : 3 4
0 4
1 3
6 2
7 2
9 2
8 1
4 1
2 1
dtype: int64
Changed series of date strings to a timeseries :
0 2022-01-25 00:00:00
1 2021-04-08 00:00:00
2 2022-05-12 00:00:00
3 2020-02-21 00:00:00
4 2021-09-11 00:00:00
5 2021-10-10 12:20:00
dtype: datetime64[ns]
Dataframe using list :
Name Organization Age
0 Akansha TCS 32
1 Ashish HCL 32
2 Ankit Canara 35
Dataframe using Series :
num_name num_age
0 Abhishek 5
1 Sahil 9
2 Swati 13
3 Rohan 8
4 Prerna 15
Dataframe using dictionary :
num_name num_age
0 Aakansha 5
1 Rohan 9
2 Isha 13
3 Ashish 8
4 Prerna 15
Dataframe using random values :
C1 C2 C3 C4 C5
I1 0.896202 0.999017 0.374783 0.152988 0.538044
I2 0.326453 0.883428 0.110783 0.508416 0.194583
I3 0.109798 0.545732 0.977045 0.726761 0.032932
I4 0.655264 0.194990 0.429043 0.810116 0.453745
I5 0.137495 0.902280 0.576180 0.931368 0.271951
Dataframe after reindexing : C1 C2 C3 C4 C5
I1 0.896202 0.999017 0.374783 0.152988 0.538044
I2 0.326453 0.883428 0.110783 0.508416 0.194583
I3 0.109798 0.545732 0.977045 0.726761 0.032932
I4 0.655264 0.194990 0.429043 0.810116 0.453745
I5 0.137495 0.902280 0.576180 0.931368 0.271951
Iteration in Dataframe :
num_name Organization num_age Salary
0 Aakansha TCS 5 30000
1 Rohan HCL 9 45000
2 Isha Infosys 13 33000
3 Ashish Sopra 8 44000
4 Prerna TCS 15 20000
num_name Organization num_age Salary
1 Rohan HCL 9 45000
3 Ashish Sopra 8 44000
Iteration using loc[] function :
Aakansha 30000
Rohan 45000
Isha 33000
Ashish 44000
Prerna 20000
Aakansha 5
Rohan 9
Isha 13
Ashish 8
Prerna 15
num_name num_age
1 Rohan 9
2 Isha 13
Filtering Row 1 and all columns using 'loc' function :
num_name Aakansha
Organization TCS
num_age 5
Salary 30000
Name: 0, dtype: object
num_name Organization
1 Rohan HCL
2 Isha Infosys
shape of the dataframe :
(5, 4)
first two rows of any dataframe :
num_name Organization num_age Salary
0 Aakansha TCS 5 30000
1 Rohan HCL 9 45000
maximum value in the particular column of dataframe :
45000
maximum value of all column :
num_name Rohan
Organization TCS
num_age 15
Salary 45000
dtype: object
Finding row which has the maximum salary :
num_name Organization num_age Salary
1 Rohan HCL 9 45000
minimum value in the particular column of dataframe :
20000
Finding row which has the minimum salary :
num_name Organization num_age Salary
4 Prerna TCS 15 20000
Filtering for the 'TCS organization' with 'Aakansha' name :
num_name Organization num_age Salary
0 Aakansha TCS 5 30000
Renaming the column name :
Index(['NAME', 'ORGANIZATION', 'AGE', 'SALARY'], dtype='object')
Fetching unique values of column :
['TCS' 'HCL' 'Infosys' 'Sopra']
Fetching frequancy counts of columns :
TCS 2
HCL 1
Infosys 1
Sopra 1
Name: ORGANIZATION, dtype: int64
Lowercase column name :
NAME ORGANIZATION AGE SALARY
0 Aakansha tcs 5 30000
1 Rohan hcl 9 45000
2 Isha infosys 13 33000
3 Ashish sopra 8 44000
4 Prerna tcs 15 20000
Uppercase column name :
NAME ORGANIZATION AGE SALARY
0 AAKANSHA tcs 5 30000
1 ROHAN hcl 9 45000
2 ISHA infosys 13 33000
3 ASHISH sopra 8 44000
4 PRERNA tcs 15 20000
Capitalize first letter in the column :
NAME ORGANIZATION AGE SALARY
0 AAKANSHA Tcs 5 30000
1 ROHAN Hcl 9 45000
2 ISHA Infosys 13 33000
3 ASHISH Sopra 8 44000
4 PRERNA Tcs 15 20000
Dropping 'NAME and AGE' column :
ORGANIZATION SALARY
0 Tcs 30000
1 Hcl 45000
2 Infosys 33000
3 Sopra 44000
4 Tcs 20000
Reading CSV files using read_csv :
Finding column names in given csv file :
Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
'fractal_dimension_se', 'radius_worst', 'texture_worst',
'perimeter_worst', 'area_worst', 'smoothness_worst',
'compactness_worst', 'concavity_worst', 'concave points_worst',
'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32'],
dtype='object')
Finding first 10 rows in the csv file :
id diagnosis ... fractal_dimension_worst Unnamed: 32
0 842302 M ... 0.11890 NaN
1 842517 M ... 0.08902 NaN
2 84300903 M ... 0.08758 NaN
3 84348301 M ... 0.17300 NaN
4 84358402 M ... 0.07678 NaN
5 843786 M ... 0.12440 NaN
6 844359 M ... 0.08368 NaN
7 84458202 M ... 0.11510 NaN
8 844981 M ... 0.10720 NaN
9 84501001 M ... 0.20750 NaN
[10 rows x 33 columns]
Finding the type of column 'fractal_dimension_worst' :
float64
Finding last 10 rows from csv file :
id diagnosis ... fractal_dimension_worst Unnamed: 32
559 925291 B ... 0.08732 NaN
560 925292 B ... 0.08321 NaN
561 925311 B ... 0.05905 NaN
562 925622 M ... 0.14090 NaN
563 926125 M ... 0.09873 NaN
564 926424 M ... 0.07115 NaN
565 926682 M ... 0.06637 NaN
566 926954 M ... 0.07820 NaN
567 927241 M ... 0.12400 NaN
568 92751 B ... 0.07039 NaN
[10 rows x 33 columns]
Give detail information of csv file :
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 569 non-null int64
1 diagnosis 569 non-null object
2 radius_mean 569 non-null float64
3 texture_mean 569 non-null float64
4 perimeter_mean 569 non-null float64
5 area_mean 569 non-null float64
6 smoothness_mean 569 non-null float64
7 compactness_mean 569 non-null float64
8 concavity_mean 569 non-null float64
9 concave points_mean 569 non-null float64
10 symmetry_mean 569 non-null float64
11 fractal_dimension_mean 569 non-null float64
12 radius_se 569 non-null float64
13 texture_se 569 non-null float64
14 perimeter_se 569 non-null float64
15 area_se 569 non-null float64
16 smoothness_se 569 non-null float64
17 compactness_se 569 non-null float64
18 concavity_se 569 non-null float64
19 concave points_se 569 non-null float64
20 symmetry_se 569 non-null float64
21 fractal_dimension_se 569 non-null float64
22 radius_worst 569 non-null float64
23 texture_worst 569 non-null float64
24 perimeter_worst 569 non-null float64
25 area_worst 569 non-null float64
26 smoothness_worst 569 non-null float64
27 compactness_worst 569 non-null float64
28 concavity_worst 569 non-null float64
29 concave points_worst 569 non-null float64
30 symmetry_worst 569 non-null float64
31 fractal_dimension_worst 569 non-null float64
32 Unnamed: 32 0 non-null float64
dtypes: float64(31), int64(1), object(1)
memory usage: 146.8+ KB
None
Replace NULL values in the particular columns with 75 :
Replacing null values in the column with either mean, median or mode :
To find value less than 0.03 : Empty DataFrame
Columns: [id, diagnosis, radius_mean, texture_mean, perimeter_mean, area_mean, smoothness_mean, compactness_mean, concavity_mean, concave points_mean, symmetry_mean, fractal_dimension_mean, radius_se, texture_se, perimeter_se, area_se, smoothness_se, compactness_se, concavity_se, concave points_se, symmetry_se, fractal_dimension_se, radius_worst, texture_worst, perimeter_worst, area_worst, smoothness_worst, compactness_worst, concavity_worst, concave points_worst, symmetry_worst, fractal_dimension_worst, Unnamed: 32]
Index: []
if you want to replace any particular value, here 15 is the row :
0.1
Finding first 2 columns :
id diagnosis
0 842302 M
1 842517 M
2 84300903 M
3 84348301 M
4 84358402 M
No comments:
Post a Comment