Monday, 31 January 2022

Pandas Practice: Data Science Library in Python

Pandas, an open source data science library of python which is used mainly for the analyses, cleaning and manipulation of data. Pandas allow to load the data, clean and manipulate the complex big data, and  based on this, analysis is done on clean data and make conclusions. Pandas library is very fast and give very high performance on large datasets also. 

Python provide Series (1-D array) and Dataframe (2-D array) in manipulating the data. The data can be loaded in both Series and Dataframe from array, list, dictionary, tuple, CSV files, excel files, SQL database..

Below are the set of  multiple questions which will help to give the basic knowledge of all type of programming of Pandas Data science library in Python. You can visit website https://https://www.w3resource.com/python-exercises/pandas/ or https://www.geeksforgeeks.org/python-pandas-practice-exercises-questions-and-solutions/

Steps to run below programs:

  1. Install Anaconda in your PC/ Laptop.
  2. Open Spyder.
  3. Copy and paste below programs.
  4. Press Run button. You can see the result in Console section at right side.
#1. Create 1-Dimensional array of data using Series in Pandas and then convert that series again into list
import pandas as pd
import numpy as np
ps = pd.Series([5,9,13,8,15])
print("1-Dimensional array of data using Series :")
print(ps)
list =ps.tolist()
print("convert series again into list :")
print(list)

#2. Addition and multiplication of two pandas series
ps1 = pd.Series([5,9,13,8,15])
ps2 = pd.Series([2,8,9,11,4])
ps_add= ps1+ps2
ps_mul= ps1*ps2
print("Addition of two pandas series :")
print(ps_add)
print("Multiplication of two pandas series :")
print(ps_mul)

#3. Change Dicionary and numpy array into pandas Series
dict ={'x':1000, 'y':2000, 'z':3000}
dict_to_series = pd.Series(dict)
print("convert Dictionary into pandas :")
print(dict_to_series)
num_array = np.array([5,9,13,8,15])
numpy_to_Series = pd.Series(num_array)
print("convert numpy into pandas :")
print(numpy_to_Series)

#4. Adding some new data to existing Series
ps = pd.Series([2, 8, 12, 'data', 'series', 21])
new_ps = ps.append(pd.Series(['numpy',32]))
print("Adding some new data to existing Series :")
print(new_ps)

#5. Reindexing or change the order of existing series data
ps = pd.Series(data=[2, 8, 12, 'data', 'series', 21], index=['a','b','c','d','e','f'])
reindex_ps= ps.reindex(index=['c','e','f','b','a','b'])
print("Reindexing or change the order of existing series data :")
print(reindex_ps)

#6. Calculate the mean, Standard deviation, median, minimum value and maximum value of given data
#Note 0 percetile is the minimum value, 50th percentile is median and 100th is maximum value
data =pd.Series([12,23,14,27,9,22,34,19,26,31,45])
print("Mean of given data is :", data.mean())
print("Standard deviation of given data is :", data.std())
print("Minimum value, 25th Percentile, median, 75th percentile, maximum of the given series are :")
data_percentile= np.percentile(data, q=[0,25,50,75,100])
print(data_percentile)

#7. Generate Series from random number and calculate the unique count of numbers from that series
ps = pd.Series(np.random.randint(10,size =20))
print("Random Series values are :", ps)
count =ps.value_counts()
print("Frequancy of each number are :", count)

#8. Change series of date strings to a timeseries
series_date = pd.Series(['25 Jan 2022', '20210408', '2022/05/12', '21-02-2020', '2021-09-11', '2021-10-10T12:20'])
print("Changed series of date strings to a timeseries :")
print(pd.to_datetime(series_date))

#9. Create Dataframe using list
list =[['Akansha', 'TCS', 32],['Ashish','HCL',32],['Ankit','Canara',35]]
df = pd.DataFrame(list,columns=['Name','Organization','Age'])
print("Dataframe using list :")
print(df)

#10.Create Dataframe using Series
num_name = ['Abhishek','Sahil','Swati','Rohan','Prerna']
num_age = [5,9,13,8,15]
num_name_series=pd.Series(num_name)
num_age_series = pd.Series(num_age)
df = {'num_name':num_name_series, 'num_age':num_age_series}
Dataframe = pd.DataFrame(df)
print("Dataframe using Series :")
print(Dataframe)

#11. Create dataframe using dictionary
dict={'num_name':['Aakansha','Rohan','Isha','Ashish','Prerna'], 'num_age':[5,9,13,8,15]}
df = pd.DataFrame(dict)
print("Dataframe using dictionary :")
print(df)

#12. Create dataframe using random values
index =['I1','I2','I3','I4','I5']
Col=['C1','C2','C3','C4','C5']
df = pd.DataFrame(np.random.rand(5,5), index=index, columns=Col)
print("Dataframe using random values :")
print(df)
df.reindex(['C3','C5','C1','C4','C2'])
print("Dataframe after reindexing :", df)

#13. Iteration in Dataframe
dict11={'num_name':['Aakansha','Rohan','Isha','Ashish','Prerna'],'Organization':['TCS','HCL','Infosys','Sopra','TCS'], 'num_age':[5,9,13,8,15],'Salary':[30000,45000,33000,44000,20000]}
df11 = pd.DataFrame(dict11)
print("Iteration in Dataframe :")
print(df11)
Condition_res=df11[df11['Salary']>40000]
print(Condition_res)

#14. Iteration using loc[] function
print("Iteration using loc[] function :")
for i in range(len(df11)):
    print(df11.loc[i,"num_name"],df11.loc[i,"Salary"])

for i in range(len(df11)):
    print(df11.iloc[i,0], df11.iloc[i,2])

print(df11.loc[1:2,['num_name','num_age']])

#15. Filtering Row 1 and all columns using 'loc' function
print("Filtering Row 1 and all columns using 'loc' function :")
print(df11.loc[0,:]) 
print(df11.iloc[1:3,0:2]) #iloc provide both row and column slicing

#16. To find the shape of the dataframe
print("shape of the dataframe :")
print(df11.shape)

#17. To find the first two rows of any dataframe
print("first two rows of any dataframe :")
print(df11.head(2))

#18. Finding the maximum value in the particular column of dataframe
print("maximum value in the particular column of dataframe :") 
print(df11.Salary.max()) #finding maximum salary

#19. To find maximum value of all column
print("maximum value of all column :")
print(df11.max()) #finding maximum in all columns

#20. Finding row which has the maximum salary
print("Finding row which has the maximum salary :")
print(df11[df11.Salary == df11.Salary.max()])

#21. Similar as above for max, Finding the minimum value in the particular column of dataframe 
print("minimum value in the particular column of dataframe  :")
print(df11.Salary.min()) #finding minimum salary

#22. Finding row which has the minimum salary
print("Finding row which has the minimum salary :")
print(df11[df11.Salary == df11.Salary.min()])

#23. Filtering for the 'TCS organization' with 'Aakansha' name
print("Filtering for the 'TCS organization' with 'Aakansha' name :")
Res = df11[df11['Organization'].str.contains("TCS") & df11['num_name'].str.contains("Aakansha")]
print(Res)

#24. Renaming the column name
print("Renaming the column name :")
df11.columns =['NAME', 'ORGANIZATION','AGE','SALARY']
print(df11.columns)

#25. Fetching unique values of column
print("Fetching unique values of column :")
print(df11.ORGANIZATION.unique())

#26. Fetching frequancy counts of columns
print("Fetching frequancy counts of columns :")
print(df11.ORGANIZATION.value_counts())

#27. Lowercase column name
print("Lowercase column name :")
df11['ORGANIZATION']=df11['ORGANIZATION'].str.lower()
print(df11) 

#28. Uppercase column name
print(" Uppercase column name :")
df11['NAME']=df11['NAME'].str.upper()
print(df11) 

#29. Capitalize first letter in the column
print(" Capitalize first letter in the column :")
df11['ORGANIZATION']=df11['ORGANIZATION'].str.capitalize()
print(df11) 

#30. Dropping 'NAME and AGE' column
print("Dropping 'NAME and AGE' column :")
df11.drop(df11.columns[[0, 2]], axis = 1, inplace = True)
print(df11)

#31. Reading CSV files using read_csv
print("Reading CSV files using read_csv :")
dataframe = pd.read_csv("data.csv")

#32. Finding column names in given csv file
print("Finding column names in given csv file :")
print(dataframe.columns)

#33. Finding first 10 rows in the csv file
print("Finding first 10 rows in the csv file :")
print(dataframe.head(10))

#34. Finding the type of column 'fractal_dimension_worst'
print("Finding the type of column 'fractal_dimension_worst' :")
print(dataframe['fractal_dimension_worst'].dtypes)

#35. Finding last 10 rows from csv file
print("Finding last 10 rows from csv file :")
print(dataframe.tail(10))

#36. Give detail information of csv file
print("Give detail information of csv file :")
print(dataframe.info())

#37. dropna(inplace = True) will remove all rows which contain NULL values from the DataFrame without any change in original DF.
#dataframe.dropna(inplace=True)

#38. Replace NULL values in the particular columns with 75:
print("Replace NULL values in the particular columns with 75 :")
dataframe["radius_worst"].fillna(75, inplace = True)

#39. Replacing null values in the column with either mean, median or mode
print("Replacing null values in the column with either mean, median or mode :")
rp = dataframe["radius_worst"].median()
dataframe["radius_worst"].fillna(rp, inplace = True)

#40. If you want to drop rows with some condition
for i in dataframe.index:
    if dataframe.loc[i,"fractal_dimension_worst"]<0.03:
        dataframe.drop(i,inplace =True)
print('To find value less than 0.03 :', dataframe[dataframe['fractal_dimension_worst']<0.03])

#41. if you want to replace any particular value, here 15 is the row
print(" if you want to replace any particular value, here 15 is the row :")
dataframe.loc[15,'fractal_dimension_worst']=0.1
print(dataframe.loc[15,'fractal_dimension_worst'])

#42. Finding first 2 columns
print("Finding first 2 columns :")
print(dataframe.iloc[:,:2].head())

Result of above Exercise

1-Dimensional array of data using Series :
0     5
1     9
2    13
3     8
4    15
dtype: int64

convert series again into list :
[5, 9, 13, 8, 15]

Addition of two pandas series :
0     7
1    17
2    22
3    19
4    19
dtype: int64

Multiplication of two pandas series :
0     10
1     72
2    117
3     88
4     60
dtype: int64

convert Dictionary into pandas :
x    1000
y    2000
z    3000
dtype: int64

convert numpy into pandas :
0     5
1     9
2    13
3     8
4    15
dtype: int32

Adding some new data to existing Series :
0         2
1         8
2        12
3      data
4    series
5        21
0     numpy
1        32
dtype: object

Reindexing or change the order of existing series data :
c        12
e    series
f        21
b         8
a         2
b         8
dtype: object

Mean of given data is : 23.818181818181817

Standard deviation of given data is : 10.495886640186066

Minimum value, 25th Percentile, median, 75th percentile, maximum of the given series are :
[ 9.  16.5 23.  29.  45. ]

Random Series values are : 0     3
1     0
2     6
3     8
4     0
5     4
6     1
7     1
8     7
9     3
10    6
11    0
12    7
13    2
14    3
15    9
16    0
17    1
18    3
19    9
dtype: int32

Frequancy of each number are : 3    4
0    4
1    3
6    2
7    2
9    2
8    1
4    1
2    1
dtype: int64

Changed series of date strings to a timeseries :
0   2022-01-25 00:00:00
1   2021-04-08 00:00:00
2   2022-05-12 00:00:00
3   2020-02-21 00:00:00
4   2021-09-11 00:00:00
5   2021-10-10 12:20:00
dtype: datetime64[ns]

Dataframe using list :
      Name Organization  Age
0  Akansha          TCS   32
1   Ashish          HCL   32
2    Ankit       Canara   35

Dataframe using Series :
   num_name  num_age
0  Abhishek        5
1     Sahil        9
2     Swati       13
3     Rohan        8
4    Prerna       15

Dataframe using dictionary :
   num_name  num_age
0  Aakansha        5
1     Rohan        9
2      Isha       13
3    Ashish        8
4    Prerna       15

Dataframe using random values :
          C1        C2        C3        C4        C5
I1  0.896202  0.999017  0.374783  0.152988  0.538044
I2  0.326453  0.883428  0.110783  0.508416  0.194583
I3  0.109798  0.545732  0.977045  0.726761  0.032932
I4  0.655264  0.194990  0.429043  0.810116  0.453745
I5  0.137495  0.902280  0.576180  0.931368  0.271951

Dataframe after reindexing :           C1        C2        C3        C4        C5
I1  0.896202  0.999017  0.374783  0.152988  0.538044
I2  0.326453  0.883428  0.110783  0.508416  0.194583
I3  0.109798  0.545732  0.977045  0.726761  0.032932
I4  0.655264  0.194990  0.429043  0.810116  0.453745
I5  0.137495  0.902280  0.576180  0.931368  0.271951

Iteration in Dataframe :
   num_name Organization  num_age  Salary
0  Aakansha          TCS        5   30000
1     Rohan          HCL        9   45000
2      Isha      Infosys       13   33000
3    Ashish        Sopra        8   44000
4    Prerna          TCS       15   20000
  num_name Organization  num_age  Salary
1    Rohan          HCL        9   45000
3   Ashish        Sopra        8   44000

Iteration using loc[] function :
Aakansha 30000
Rohan 45000
Isha 33000
Ashish 44000
Prerna 20000
Aakansha 5
Rohan 9
Isha 13
Ashish 8
Prerna 15
  num_name  num_age
1    Rohan        9
2     Isha       13

Filtering Row 1 and all columns using 'loc' function :
num_name        Aakansha
Organization         TCS
num_age                5
Salary             30000
Name: 0, dtype: object
  num_name Organization
1    Rohan          HCL
2     Isha      Infosys

shape of the dataframe :
(5, 4)

first two rows of any dataframe :
   num_name Organization  num_age  Salary
0  Aakansha          TCS        5   30000
1     Rohan          HCL        9   45000

maximum value in the particular column of dataframe :
45000

maximum value of all column :
num_name        Rohan
Organization      TCS
num_age            15
Salary          45000
dtype: object

Finding row which has the maximum salary :
  num_name Organization  num_age  Salary
1    Rohan          HCL        9   45000

minimum value in the particular column of dataframe  :
20000

Finding row which has the minimum salary :
  num_name Organization  num_age  Salary
4   Prerna          TCS       15   20000

Filtering for the 'TCS organization' with 'Aakansha' name :
   num_name Organization  num_age  Salary
0  Aakansha          TCS        5   30000

Renaming the column name :
Index(['NAME', 'ORGANIZATION', 'AGE', 'SALARY'], dtype='object')

Fetching unique values of column :
['TCS' 'HCL' 'Infosys' 'Sopra']

Fetching frequancy counts of columns :
TCS        2
HCL        1
Infosys    1
Sopra      1
Name: ORGANIZATION, dtype: int64

Lowercase column name :
       NAME ORGANIZATION  AGE  SALARY
0  Aakansha          tcs    5   30000
1     Rohan          hcl    9   45000
2      Isha      infosys   13   33000
3    Ashish        sopra    8   44000
4    Prerna          tcs   15   20000

 Uppercase column name :
       NAME ORGANIZATION  AGE  SALARY
0  AAKANSHA          tcs    5   30000
1     ROHAN          hcl    9   45000
2      ISHA      infosys   13   33000
3    ASHISH        sopra    8   44000
4    PRERNA          tcs   15   20000

 Capitalize first letter in the column :
       NAME ORGANIZATION  AGE  SALARY
0  AAKANSHA          Tcs    5   30000
1     ROHAN          Hcl    9   45000
2      ISHA      Infosys   13   33000
3    ASHISH        Sopra    8   44000
4    PRERNA          Tcs   15   20000

Dropping 'NAME and AGE' column :
  ORGANIZATION  SALARY
0          Tcs   30000
1          Hcl   45000
2      Infosys   33000
3        Sopra   44000
4          Tcs   20000

Reading CSV files using read_csv :

Finding column names in given csv file :
Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32'],
      dtype='object')

Finding first 10 rows in the csv file :
         id diagnosis  ...  fractal_dimension_worst  Unnamed: 32
0    842302         M  ...                  0.11890          NaN
1    842517         M  ...                  0.08902          NaN
2  84300903         M  ...                  0.08758          NaN
3  84348301         M  ...                  0.17300          NaN
4  84358402         M  ...                  0.07678          NaN
5    843786         M  ...                  0.12440          NaN
6    844359         M  ...                  0.08368          NaN
7  84458202         M  ...                  0.11510          NaN
8    844981         M  ...                  0.10720          NaN
9  84501001         M  ...                  0.20750          NaN

[10 rows x 33 columns]

Finding the type of column 'fractal_dimension_worst' :
float64

Finding last 10 rows from csv file :
         id diagnosis  ...  fractal_dimension_worst  Unnamed: 32
559  925291         B  ...                  0.08732          NaN
560  925292         B  ...                  0.08321          NaN
561  925311         B  ...                  0.05905          NaN
562  925622         M  ...                  0.14090          NaN
563  926125         M  ...                  0.09873          NaN
564  926424         M  ...                  0.07115          NaN
565  926682         M  ...                  0.06637          NaN
566  926954         M  ...                  0.07820          NaN
567  927241         M  ...                  0.12400          NaN
568   92751         B  ...                  0.07039          NaN

[10 rows x 33 columns]

Give detail information of csv file :
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             569 non-null    float64
 15  area_se                  569 non-null    float64
 16  smoothness_se            569 non-null    float64
 17  compactness_se           569 non-null    float64
 18  concavity_se             569 non-null    float64
 19  concave points_se        569 non-null    float64
 20  symmetry_se              569 non-null    float64
 21  fractal_dimension_se     569 non-null    float64
 22  radius_worst             569 non-null    float64
 23  texture_worst            569 non-null    float64
 24  perimeter_worst          569 non-null    float64
 25  area_worst               569 non-null    float64
 26  smoothness_worst         569 non-null    float64
 27  compactness_worst        569 non-null    float64
 28  concavity_worst          569 non-null    float64
 29  concave points_worst     569 non-null    float64
 30  symmetry_worst           569 non-null    float64
 31  fractal_dimension_worst  569 non-null    float64
 32  Unnamed: 32              0 non-null      float64
dtypes: float64(31), int64(1), object(1)
memory usage: 146.8+ KB
None

Replace NULL values in the particular columns with 75 :

Replacing null values in the column with either mean, median or mode :

To find value less than 0.03 : Empty DataFrame
Columns: [id, diagnosis, radius_mean, texture_mean, perimeter_mean, area_mean, smoothness_mean, compactness_mean, concavity_mean, concave points_mean, symmetry_mean, fractal_dimension_mean, radius_se, texture_se, perimeter_se, area_se, smoothness_se, compactness_se, concavity_se, concave points_se, symmetry_se, fractal_dimension_se, radius_worst, texture_worst, perimeter_worst, area_worst, smoothness_worst, compactness_worst, concavity_worst, concave points_worst, symmetry_worst, fractal_dimension_worst, Unnamed: 32]
Index: []

 if you want to replace any particular value, here 15 is the row :
0.1

Finding first 2 columns :
         id diagnosis
0    842302         M
1    842517         M
2  84300903         M
3  84348301         M
4  84358402         M


No comments:

Post a Comment