Learn Basic Concepts of Machine Learning and Azure: Pandas Practice: Data Science Library in Python

Pandas, an open source data science library of python which is used mainly for the analyses, cleaning and manipulation of data. Pandas allow to load the data, clean and manipulate the complex big data, and based on this, analysis is done on clean data and make conclusions. Pandas library is very fast and give very high performance on large datasets also.

Python provide Series (1-D array) and Dataframe (2-D array) in manipulating the data. The data can be loaded in both Series and Dataframe from array, list, dictionary, tuple, CSV files, excel files, SQL database..

Below are the set of multiple questions which will help to give the basic knowledge of all type of programming of Pandas Data science library in Python. You can visit website https://https://www.w3resource.com/python-exercises/pandas/ or https://www.geeksforgeeks.org/python-pandas-practice-exercises-questions-and-solutions/

Steps to run below programs:

Install Anaconda in your PC/ Laptop.
Open Spyder.
Copy and paste below programs.
Press Run button. You can see the result in Console section at right side.

#1. Create 1-Dimensional array of data using Series in Pandas and then convert that series again into list

import pandas as pd

import numpy as np

ps = pd.Series([5,9,13,8,15])

print("1-Dimensional array of data using Series :")

print(ps)

list =ps.tolist()

print("convert series again into list :")

print(list)

#2. Addition and multiplication of two pandas series

ps1 = pd.Series([5,9,13,8,15])

ps2 = pd.Series([2,8,9,11,4])

ps_add= ps1+ps2

ps_mul= ps1*ps2

print("Addition of two pandas series :")

print(ps_add)

print("Multiplication of two pandas series :")

print(ps_mul)

#3. Change Dicionary and numpy array into pandas Series

dict ={'x':1000, 'y':2000, 'z':3000}

dict_to_series = pd.Series(dict)

print("convert Dictionary into pandas :")

print(dict_to_series)

num_array = np.array([5,9,13,8,15])

numpy_to_Series = pd.Series(num_array)

print("convert numpy into pandas :")

print(numpy_to_Series)

#4. Adding some new data to existing Series

ps = pd.Series([2, 8, 12, 'data', 'series', 21])

new_ps = ps.append(pd.Series(['numpy',32]))

print("Adding some new data to existing Series :")

print(new_ps)

#5. Reindexing or change the order of existing series data

ps = pd.Series(data=[2, 8, 12, 'data', 'series', 21], index=['a','b','c','d','e','f'])

reindex_ps= ps.reindex(index=['c','e','f','b','a','b'])

print("Reindexing or change the order of existing series data :")

print(reindex_ps)

#6. Calculate the mean, Standard deviation, median, minimum value and maximum value of given data

#Note 0 percetile is the minimum value, 50th percentile is median and 100th is maximum value

data =pd.Series([12,23,14,27,9,22,34,19,26,31,45])

print("Mean of given data is :", data.mean())

print("Standard deviation of given data is :", data.std())

print("Minimum value, 25th Percentile, median, 75th percentile, maximum of the given series are :")

data_percentile= np.percentile(data, q=[0,25,50,75,100])

print(data_percentile)

#7. Generate Series from random number and calculate the unique count of numbers from that series

ps = pd.Series(np.random.randint(10,size =20))

print("Random Series values are :", ps)

count =ps.value_counts()

print("Frequancy of each number are :", count)

#8. Change series of date strings to a timeseries

series_date = pd.Series(['25 Jan 2022', '20210408', '2022/05/12', '21-02-2020', '2021-09-11', '2021-10-10T12:20'])

print("Changed series of date strings to a timeseries :")

print(pd.to_datetime(series_date))

#9. Create Dataframe using list

list =[['Akansha', 'TCS', 32],['Ashish','HCL',32],['Ankit','Canara',35]]

df = pd.DataFrame(list,columns=['Name','Organization','Age'])

print("Dataframe using list :")

print(df)

#10.Create Dataframe using Series

num_name = ['Abhishek','Sahil','Swati','Rohan','Prerna']

num_age = [5,9,13,8,15]

num_name_series=pd.Series(num_name)

num_age_series = pd.Series(num_age)

df = {'num_name':num_name_series, 'num_age':num_age_series}

Dataframe = pd.DataFrame(df)

print("Dataframe using Series :")

print(Dataframe)

#11. Create dataframe using dictionary

dict={'num_name':['Aakansha','Rohan','Isha','Ashish','Prerna'], 'num_age':[5,9,13,8,15]}

df = pd.DataFrame(dict)

print("Dataframe using dictionary :")

print(df)

#12. Create dataframe using random values

index =['I1','I2','I3','I4','I5']

Col=['C1','C2','C3','C4','C5']

df = pd.DataFrame(np.random.rand(5,5), index=index, columns=Col)

print("Dataframe using random values :")

print(df)

df.reindex(['C3','C5','C1','C4','C2'])

print("Dataframe after reindexing :", df)

#13. Iteration in Dataframe

dict11={'num_name':['Aakansha','Rohan','Isha','Ashish','Prerna'],'Organization':['TCS','HCL','Infosys','Sopra','TCS'], 'num_age':[5,9,13,8,15],'Salary':[30000,45000,33000,44000,20000]}

df11 = pd.DataFrame(dict11)

print("Iteration in Dataframe :")

print(df11)

Condition_res=df11[df11['Salary']>40000]

print(Condition_res)

#14. Iteration using loc[] function

print("Iteration using loc[] function :")

for i in range(len(df11)):

print(df11.loc[i,"num_name"],df11.loc[i,"Salary"])

for i in range(len(df11)):

print(df11.iloc[i,0], df11.iloc[i,2])

print(df11.loc[1:2,['num_name','num_age']])

#15. Filtering Row 1 and all columns using 'loc' function

print("Filtering Row 1 and all columns using 'loc' function :")

print(df11.loc[0,:])

print(df11.iloc[1:3,0:2]) #iloc provide both row and column slicing

#16. To find the shape of the dataframe

print("shape of the dataframe :")

print(df11.shape)

#17. To find the first two rows of any dataframe

print("first two rows of any dataframe :")

print(df11.head(2))

#18. Finding the maximum value in the particular column of dataframe

print("maximum value in the particular column of dataframe :")

print(df11.Salary.max()) #finding maximum salary

#19. To find maximum value of all column

print("maximum value of all column :")

print(df11.max()) #finding maximum in all columns

#20. Finding row which has the maximum salary

print("Finding row which has the maximum salary :")

print(df11[df11.Salary == df11.Salary.max()])

#21. Similar as above for max, Finding the minimum value in the particular column of dataframe

print("minimum value in the particular column of dataframe :")

print(df11.Salary.min()) #finding minimum salary

#22. Finding row which has the minimum salary

print("Finding row which has the minimum salary :")

print(df11[df11.Salary == df11.Salary.min()])

#23. Filtering for the 'TCS organization' with 'Aakansha' name

print("Filtering for the 'TCS organization' with 'Aakansha' name :")

Res = df11[df11['Organization'].str.contains("TCS") & df11['num_name'].str.contains("Aakansha")]

print(Res)

#24. Renaming the column name

print("Renaming the column name :")

df11.columns =['NAME', 'ORGANIZATION','AGE','SALARY']

print(df11.columns)

#25. Fetching unique values of column

print("Fetching unique values of column :")

print(df11.ORGANIZATION.unique())

#26. Fetching frequancy counts of columns

print("Fetching frequancy counts of columns :")

print(df11.ORGANIZATION.value_counts())

#27. Lowercase column name

print("Lowercase column name :")

df11['ORGANIZATION']=df11['ORGANIZATION'].str.lower()

print(df11)

#28. Uppercase column name

print(" Uppercase column name :")

df11['NAME']=df11['NAME'].str.upper()

print(df11)

#29. Capitalize first letter in the column

print(" Capitalize first letter in the column :")

df11['ORGANIZATION']=df11['ORGANIZATION'].str.capitalize()

print(df11)

#30. Dropping 'NAME and AGE' column

print("Dropping 'NAME and AGE' column :")

df11.drop(df11.columns[[0, 2]], axis = 1, inplace = True)

print(df11)

#31. Reading CSV files using read_csv

print("Reading CSV files using read_csv :")

dataframe = pd.read_csv("data.csv")

#32. Finding column names in given csv file

print("Finding column names in given csv file :")

print(dataframe.columns)

#33. Finding first 10 rows in the csv file

print("Finding first 10 rows in the csv file :")

print(dataframe.head(10))

#34. Finding the type of column 'fractal_dimension_worst'

print("Finding the type of column 'fractal_dimension_worst' :")

print(dataframe['fractal_dimension_worst'].dtypes)

#35. Finding last 10 rows from csv file

print("Finding last 10 rows from csv file :")

print(dataframe.tail(10))

#36. Give detail information of csv file

print("Give detail information of csv file :")

print(dataframe.info())

#37. dropna(inplace = True) will remove all rows which contain NULL values from the DataFrame without any change in original DF.

#dataframe.dropna(inplace=True)

#38. Replace NULL values in the particular columns with 75:

print("Replace NULL values in the particular columns with 75 :")

dataframe["radius_worst"].fillna(75, inplace = True)

#39. Replacing null values in the column with either mean, median or mode

print("Replacing null values in the column with either mean, median or mode :")

rp = dataframe["radius_worst"].median()

dataframe["radius_worst"].fillna(rp, inplace = True)

#40. If you want to drop rows with some condition

for i in dataframe.index:

if dataframe.loc[i,"fractal_dimension_worst"]<0.03:

dataframe.drop(i,inplace =True)

print('To find value less than 0.03 :', dataframe[dataframe['fractal_dimension_worst']<0.03])

#41. if you want to replace any particular value, here 15 is the row

print(" if you want to replace any particular value, here 15 is the row :")

dataframe.loc[15,'fractal_dimension_worst']=0.1

print(dataframe.loc[15,'fractal_dimension_worst'])

#42. Finding first 2 columns

print("Finding first 2 columns :")

print(dataframe.iloc[:,:2].head())

Result of above Exercise

1-Dimensional array of data using Series :

0 5

1 9

2 13

3 8

4 15

dtype: int64

convert series again into list :

[5, 9, 13, 8, 15]

Addition of two pandas series :

0 7

1 17

2 22

3 19

4 19

dtype: int64

Multiplication of two pandas series :

0 10

1 72

2 117

3 88

4 60

dtype: int64

convert Dictionary into pandas :

x 1000

y 2000

z 3000

dtype: int64

convert numpy into pandas :

0 5

1 9

2 13

3 8

4 15

dtype: int32

Adding some new data to existing Series :

0 2

1 8

2 12

3 data

4 series

5 21

0 numpy

1 32

dtype: object

Reindexing or change the order of existing series data :

c 12

e series

f 21

b 8

a 2

b 8

dtype: object

Mean of given data is : 23.818181818181817

Standard deviation of given data is : 10.495886640186066

Minimum value, 25th Percentile, median, 75th percentile, maximum of the given series are :

[ 9. 16.5 23. 29. 45. ]

Random Series values are : 0 3

1 0

2 6

3 8

4 0

5 4

6 1

7 1

8 7

9 3

10 6

11 0

12 7

13 2

14 3

15 9

16 0

17 1

18 3

19 9

dtype: int32

Frequancy of each number are : 3 4

0 4

1 3

6 2

7 2

9 2

8 1

4 1

2 1

dtype: int64

Changed series of date strings to a timeseries :

0 2022-01-25 00:00:00

1 2021-04-08 00:00:00

2 2022-05-12 00:00:00

3 2020-02-21 00:00:00

4 2021-09-11 00:00:00

5 2021-10-10 12:20:00

dtype: datetime64[ns]

Dataframe using list :

Name Organization Age

0 Akansha TCS 32

1 Ashish HCL 32

2 Ankit Canara 35

Dataframe using Series :

num_name num_age

0 Abhishek 5

1 Sahil 9

2 Swati 13

3 Rohan 8

4 Prerna 15

Dataframe using dictionary :

num_name num_age

0 Aakansha 5

1 Rohan 9

2 Isha 13

3 Ashish 8

4 Prerna 15

Dataframe using random values :

C1 C2 C3 C4 C5

I1 0.896202 0.999017 0.374783 0.152988 0.538044

I2 0.326453 0.883428 0.110783 0.508416 0.194583

I3 0.109798 0.545732 0.977045 0.726761 0.032932

I4 0.655264 0.194990 0.429043 0.810116 0.453745

I5 0.137495 0.902280 0.576180 0.931368 0.271951

Dataframe after reindexing : C1 C2 C3 C4 C5

I1 0.896202 0.999017 0.374783 0.152988 0.538044

I2 0.326453 0.883428 0.110783 0.508416 0.194583

I3 0.109798 0.545732 0.977045 0.726761 0.032932

I4 0.655264 0.194990 0.429043 0.810116 0.453745

I5 0.137495 0.902280 0.576180 0.931368 0.271951

Iteration in Dataframe :

num_name Organization num_age Salary

0 Aakansha TCS 5 30000

1 Rohan HCL 9 45000

2 Isha Infosys 13 33000

3 Ashish Sopra 8 44000

4 Prerna TCS 15 20000

num_name Organization num_age Salary

1 Rohan HCL 9 45000

3 Ashish Sopra 8 44000

Iteration using loc[] function :

Aakansha 30000

Rohan 45000

Isha 33000

Ashish 44000

Prerna 20000

Aakansha 5

Rohan 9

Isha 13

Ashish 8

Prerna 15

num_name num_age

1 Rohan 9

2 Isha 13

Filtering Row 1 and all columns using 'loc' function :

num_name Aakansha

Organization TCS

num_age 5

Salary 30000

Name: 0, dtype: object

num_name Organization

1 Rohan HCL

2 Isha Infosys

shape of the dataframe :

(5, 4)

first two rows of any dataframe :

num_name Organization num_age Salary

0 Aakansha TCS 5 30000

1 Rohan HCL 9 45000

maximum value in the particular column of dataframe :

45000

maximum value of all column :

num_name Rohan

Organization TCS

num_age 15

Salary 45000

dtype: object

Finding row which has the maximum salary :

num_name Organization num_age Salary

1 Rohan HCL 9 45000

minimum value in the particular column of dataframe :

20000

Finding row which has the minimum salary :

num_name Organization num_age Salary

4 Prerna TCS 15 20000

Filtering for the 'TCS organization' with 'Aakansha' name :

num_name Organization num_age Salary

0 Aakansha TCS 5 30000

Renaming the column name :

Index(['NAME', 'ORGANIZATION', 'AGE', 'SALARY'], dtype='object')

Fetching unique values of column :

['TCS' 'HCL' 'Infosys' 'Sopra']

Fetching frequancy counts of columns :

TCS 2

HCL 1

Infosys 1

Sopra 1

Name: ORGANIZATION, dtype: int64

Lowercase column name :

NAME ORGANIZATION AGE SALARY

0 Aakansha tcs 5 30000

1 Rohan hcl 9 45000

2 Isha infosys 13 33000

3 Ashish sopra 8 44000

4 Prerna tcs 15 20000

Uppercase column name :

NAME ORGANIZATION AGE SALARY

0 AAKANSHA tcs 5 30000

1 ROHAN hcl 9 45000

2 ISHA infosys 13 33000

3 ASHISH sopra 8 44000

4 PRERNA tcs 15 20000

Capitalize first letter in the column :

NAME ORGANIZATION AGE SALARY

0 AAKANSHA Tcs 5 30000

1 ROHAN Hcl 9 45000

2 ISHA Infosys 13 33000

3 ASHISH Sopra 8 44000

4 PRERNA Tcs 15 20000

Dropping 'NAME and AGE' column :

ORGANIZATION SALARY

0 Tcs 30000

1 Hcl 45000

2 Infosys 33000

3 Sopra 44000

4 Tcs 20000

Reading CSV files using read_csv :

Finding column names in given csv file :

Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',

'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',

'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',

'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',

'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',

'fractal_dimension_se', 'radius_worst', 'texture_worst',

'perimeter_worst', 'area_worst', 'smoothness_worst',

'compactness_worst', 'concavity_worst', 'concave points_worst',

'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32'],

dtype='object')

Finding first 10 rows in the csv file :

id diagnosis ... fractal_dimension_worst Unnamed: 32

0 842302 M ... 0.11890 NaN

1 842517 M ... 0.08902 NaN

2 84300903 M ... 0.08758 NaN

3 84348301 M ... 0.17300 NaN

4 84358402 M ... 0.07678 NaN

5 843786 M ... 0.12440 NaN

6 844359 M ... 0.08368 NaN

7 84458202 M ... 0.11510 NaN

8 844981 M ... 0.10720 NaN

9 84501001 M ... 0.20750 NaN

[10 rows x 33 columns]

Finding the type of column 'fractal_dimension_worst' :

float64

Finding last 10 rows from csv file :

id diagnosis ... fractal_dimension_worst Unnamed: 32

559 925291 B ... 0.08732 NaN

560 925292 B ... 0.08321 NaN

561 925311 B ... 0.05905 NaN

562 925622 M ... 0.14090 NaN

563 926125 M ... 0.09873 NaN

564 926424 M ... 0.07115 NaN

565 926682 M ... 0.06637 NaN

566 926954 M ... 0.07820 NaN

567 927241 M ... 0.12400 NaN

568 92751 B ... 0.07039 NaN

[10 rows x 33 columns]

Give detail information of csv file :

RangeIndex: 569 entries, 0 to 568

Data columns (total 33 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 id 569 non-null int64

1 diagnosis 569 non-null object

2 radius_mean 569 non-null float64

3 texture_mean 569 non-null float64

4 perimeter_mean 569 non-null float64

5 area_mean 569 non-null float64

6 smoothness_mean 569 non-null float64

7 compactness_mean 569 non-null float64

8 concavity_mean 569 non-null float64

9 concave points_mean 569 non-null float64

10 symmetry_mean 569 non-null float64

11 fractal_dimension_mean 569 non-null float64

12 radius_se 569 non-null float64

13 texture_se 569 non-null float64

14 perimeter_se 569 non-null float64

15 area_se 569 non-null float64

16 smoothness_se 569 non-null float64

17 compactness_se 569 non-null float64

18 concavity_se 569 non-null float64

19 concave points_se 569 non-null float64

20 symmetry_se 569 non-null float64

21 fractal_dimension_se 569 non-null float64

22 radius_worst 569 non-null float64

23 texture_worst 569 non-null float64

24 perimeter_worst 569 non-null float64

25 area_worst 569 non-null float64

26 smoothness_worst 569 non-null float64

27 compactness_worst 569 non-null float64

28 concavity_worst 569 non-null float64

29 concave points_worst 569 non-null float64

30 symmetry_worst 569 non-null float64

31 fractal_dimension_worst 569 non-null float64

32 Unnamed: 32 0 non-null float64

dtypes: float64(31), int64(1), object(1)

memory usage: 146.8+ KB

None

Replace NULL values in the particular columns with 75 :

Replacing null values in the column with either mean, median or mode :

To find value less than 0.03 : Empty DataFrame

Columns: [id, diagnosis, radius_mean, texture_mean, perimeter_mean, area_mean, smoothness_mean, compactness_mean, concavity_mean, concave points_mean, symmetry_mean, fractal_dimension_mean, radius_se, texture_se, perimeter_se, area_se, smoothness_se, compactness_se, concavity_se, concave points_se, symmetry_se, fractal_dimension_se, radius_worst, texture_worst, perimeter_worst, area_worst, smoothness_worst, compactness_worst, concavity_worst, concave points_worst, symmetry_worst, fractal_dimension_worst, Unnamed: 32]

Index: []

if you want to replace any particular value, here 15 is the row :

0.1

Finding first 2 columns :

id diagnosis

0 842302 M

1 842517 M

2 84300903 M

3 84348301 M

4 84358402 M

Learn Basic Concepts of Machine Learning and Azure

Monday, 31 January 2022

Pandas Practice: Data Science Library in Python

No comments:

Post a Comment

Popular Posts