Pandas function loc ( pd.DataFrame.loc )
Pandas, if your a data-scientist by profession or aspiring to be it, trust me this library must be in your bucket list for sure.
To install pandas into your python environment type in following command in terminal :
$ pip install pandas
After installation is completed get into your python environment by typing following in your terminal:
$ jupyter notebook
Import pandas as below into your python environment. If you see no error then the installation is correct.
import pandas as pd
Its a conventional way to import pandas as ‘pd’ and most acceptable in data science community. Brief note about Pandas : Pandas stores data in format of “Dataframe” which is nothing but rows ( indexes) and columns. You can read almost all the format of files namely ‘.csv’ , ‘.json’, ‘.xls’, etc. by its corresponding library function.
About Data :
The data what we are using here as example is Google audioset which has youtube ID ( YTID ) as column and has start and end seconds as the audio clip which is annotated as positive labels column . The positive labels are the sounds that are present in the audio clip, it is in encoded format and it can be decodes using the another class_labels_indices.csv file. You can find all the data in the link below .
Link for the data : https://research.google.com/audioset/download.html
#Reading csv file into the python environment
unbalanced_train = pd.read_csv("data/audioset/unbalanced_train_segments.csv", skipinitialspace=True, skiprows=2)
In the above line of code we are reading the ‘.csv’ named as ‘unbalanced_train_segments.csv’ from the path mentioned above, with skipping its initial space if any and skipping first two rows. ( which in my case doesn’t contain any information which is worth reading into Dataframe )
# checking whats there in unbalanced_train using .head() function
unbalanced_train.head()
The output unbalanced_train.head() shows the first five rows of the data with all the columns. Some other usefull methods to know about the data is using the following commands. Lets start the most usefull method in pandas which is used very often in data science community that is loc function. pd.DaraFrame.loc() function is used to retrive the required rows/columns using any conditional statement. You can better understand with the example : Suppose in above data you want the rows with all the columns that have the “start_seconds” as 30.0
# To get all the rows with "start_seconds" as 30.0 using pd.DataFrame.loc()
unbalanced_train.loc[unbalanced_train['start_seconds']==30.0]
# Get the number of rows that has "start_seconds" as 30.0 using pd.DataFrame.loc() and shape
print "Number of rows :", unbalanced_train.loc[unbalanced_train['start_seconds']==30.0].shape[0]
Now we shall have some complex pd.DataFrame.loc() operations. Now we shall get only the ‘# YTID’ column that has ‘start_seconds’ columns as 30.0 and has ‘positive_labels’ column as ‘/m/04rlf’
# To get only '#YTID' column only with start_seconds as 30.0 an has poitive_labels as '/m/04rlf'
unbalanced_train['# YTID'].loc[unbalanced_train['start_seconds'].apply(lambda x: x==30.0) & unbalanced_train['positive_labels'].apply(lambda x: x=='/m/04rlf')]
The above code may looks little complex but its very simple to decipher. We are apply the conditional statement ( pd.DataFrame.loc() ) for every row using the lambda function ( “lambda” ) and performing the logical and ( “&”) operation on every row results