Pandas loc insights (pd.DataFrame.loc)

pandas dataframe ( pd.DataFrame.loc)

Pandas function loc ( pd.DataFrame.loc )

Pandas, if your a data-scientist by profession or aspiring to be it, trust me this library must be in your bucket list for sure.

To install pandas into your python environment type in following command in terminal :

$ pip install pandas

After installation is completed get into your python environment by typing following in your terminal:

$ jupyter notebook

Import pandas as below into your python environment. If you see no error then the installation is correct.

In [1]:
 import pandas as pd 

Its a conventional way to import pandas as ‘pd’ and most acceptable in data science community. Brief note about Pandas : Pandas stores data in format of “Dataframe” which is nothing but rows ( indexes) and columns. You can read almost all the format of files namely ‘.csv’ , ‘.json’, ‘.xls’, etc. by its corresponding library function.

About Data  :

The data what we are using here as example is Google audioset which has youtube ID ( YTID ) as column and has start and end seconds as the audio clip which is annotated as positive labels column . The positive labels are the sounds that are present in the audio clip, it is in encoded format and it can be decodes using the another class_labels_indices.csv file. You can find all the data in the link below .
 Link for the data :  https://research.google.com/audioset/download.html

In [2]:
#Reading csv file into the python environment
 unbalanced_train = pd.read_csv("data/audioset/unbalanced_train_segments.csv", skipinitialspace=True, skiprows=2) 

In the above line of code we are reading the ‘.csv’ named as ‘unbalanced_train_segments.csv’ from the path mentioned above, with skipping its initial space if any and skipping first two rows. ( which in my case doesn’t contain any information which is worth reading into Dataframe )

In [3]:
# checking whats there in unbalanced_train using .head() function
 unbalanced_train.head() 
Out[3]:
# YTID start_seconds end_seconds positive_labels
0 —1_cCGK4M 0.0 10.0 /m/01g50p,/m/0284vy3,/m/06d_3,/m/07jdr,/m/07rwm0c
1 —2_BBVHAA 30.0 40.0 /m/09x0r
2 —B_v8ZoBY 30.0 40.0 /m/04rlf
3 —EDNidJUA 30.0 40.0 /m/02qldy,/m/02zsn,/m/05zppz,/m/09x0r
4 —N4cFAE1A 21.0 31.0 /m/04rlf,/m/09x0r

The output unbalanced_train.head() shows the first five rows of the data with all the columns. Some other usefull methods to know about the data is using the following commands. Lets start the most usefull method in pandas which is used very often in data science community that is loc function. pd.DaraFrame.loc() function is used to retrive the required rows/columns using any conditional statement. You can better understand with the example : Suppose in above data you want the rows with all the columns that have the “start_seconds” as 30.0

In [5]:
# To get all the rows with "start_seconds" as 30.0 using pd.DataFrame.loc()
 unbalanced_train.loc[unbalanced_train['start_seconds']==30.0] 
Out[5]:
# YTID start_seconds end_seconds positive_labels
1 —2_BBVHAA 30.0 40.0 /m/09x0r
2 —B_v8ZoBY 30.0 40.0 /m/04rlf
3 —EDNidJUA 30.0 40.0 /m/02qldy,/m/02zsn,/m/05zppz,/m/09x0r
5 —fcVQUf3E 30.0 40.0 /m/019jd,/m/07yv9
6 —g9OGAhwc 30.0 40.0 /m/04rlf,/m/0c1dj
7 —lTs1dxhU 30.0 40.0 /m/012f08,/m/07yv9,/m/0k4j,/t/dd00134
8 —mO–kRQk 30.0 40.0 /m/04rlf
9 —qub7uxgs 30.0 40.0 /m/09x0r
10 —yQzzLcFU 30.0 40.0 /t/dd00067
14 –0CNhurbZE 30.0 40.0 /m/09x0r
15 –0F7kbzAMA 30.0 40.0 /m/09x0r
16 –0FMNFsVeg 30.0 40.0 /m/018vs,/m/0342h,/m/04rlf,/m/04szw
17 –0MF9K5N30 30.0 40.0 /m/09x0r
18 –0Oh0JxzjQ 30.0 40.0 /m/0342h,/m/04rlf,/m/04szw,/m/0fx80y
19 –0PQM4-hqg 30.0 40.0 /m/07swgks,/m/0j2kx,/m/0j6m2
21 –0_x3T5DQI 30.0 40.0 /m/04rlf
22 –0aJtOMp2M 30.0 40.0 /m/04rlf,/m/09x0r
23 –0bntG9i7E 30.0 40.0 /t/dd00004
24 –0fYwELbpk 30.0 40.0 /m/04rlf,/m/0ggx5q
25 –0fim4-6Ig 30.0 40.0 /m/0342h,/m/04rlf,/m/04szw,/m/0fx80y
26 –0pVlB7mQ8 30.0 40.0 /m/09x0r
27 –0ukMG7yH4 30.0 40.0 /m/09x0r
29 –0w1YA1Hm4 30.0 40.0 /m/01bjv,/m/07yv9,/m/09x0r
30 –11PIhoFjg 30.0 40.0 /m/07qmpdm
31 –12UOziMF0 30.0 40.0 /m/07r04,/m/07yv9
32 –18xICqMoU 30.0 40.0 /m/04rlf,/m/09x0r
33 –1Bi_4aXyU 30.0 40.0 /m/09x0r
34 –1NzL0juik 30.0 40.0 /m/026t6,/m/02hnl,/m/04rlf,/m/04szw,/m/0bm02
35 –1QYEYcVBs 30.0 40.0 /m/04rlf,/m/09x0r
36 –1UKJ18WMQ 30.0 40.0 /m/0342h,/m/042v_gx,/m/04rlf,/m/04szw,/m/0fx80y
2041719 zzpIaP3MECw 30.0 40.0 /m/09x0r
2041721 zzpS4GC2Ras 30.0 40.0 /m/09x0r
2041724 zzqOkb3gs3g 30.0 40.0 /m/04rlf,/m/05rwpb
2041726 zzqgo1EZcUw 30.0 40.0 /m/04rlf
2041728 zzrSHAaKXks 30.0 40.0 /m/04rlf,/m/04wptg
2041730 zzrVZdgnqvE 30.0 35.0 /m/09x0r
2041736 zztVtkqQ05Q 30.0 40.0 /m/04rlf,/t/dd00125
2041737 zztaj6UYKtg 30.0 40.0 /m/09x0r
2041739 zztpLmPdJsk 30.0 40.0 /m/09x0r
2041741 zztvx3WUBss 30.0 40.0 /m/07qv_x_,/m/09x0r
2041744 zzuRyw5THPA 30.0 40.0 /m/09x0r
2041745 zzujgMNqCn8 30.0 40.0 /m/04rlf,/m/09x0r
2041746 zzupdZ6vp5E 30.0 40.0 /m/04rlf,/m/074ft
2041747 zzuvrD-TX5U 30.0 40.0 /m/04rlf
2041748 zzv-5EfGxnw 30.0 40.0 /m/04rlf
2041749 zzvBer2RKbw 30.0 40.0 /m/04rlf,/m/07xzm,/m/09x0r
2041750 zzvE8Wy_Pxw 30.0 40.0 /m/09x0r
2041752 zzvOEksj2V0 30.0 40.0 /m/09x0r
2041753 zzvSVusPPgM 30.0 40.0 /m/025rv6n,/m/07st89h,/m/09b5t
2041758 zzvsKixmx7s 30.0 40.0 /m/01x3z
2041759 zzvtIpzFF9k 30.0 40.0 /m/01g50p,/m/06d_3,/m/07jdr,/m/07yv9
2041760 zzw7ibZwvAk 30.0 40.0 /m/04rlf,/m/09x0r
2041761 zzwBazlj0Oc 30.0 40.0 /m/015p6,/m/0h0rv,/t/dd00125
2041763 zzwEaUOXSZs 30.0 40.0 /m/0jbk
2041768 zzx-lGzV5dE 30.0 40.0 /m/03kmc9,/m/04qvtq,/m/04rlf
2041771 zzxSVJFyAuc 30.0 40.0 /m/01bjv,/m/07yv9,/m/09x0r
2041772 zzxUh44r4Wg 30.0 40.0 /m/09x0r
2041778 zzy_5evgl3Y 30.0 40.0 /m/04rlf,/m/09x0r
2041781 zzyjKSkzyys 30.0 40.0 /t/dd00038
2041786 zzz3PZXRQ_8 30.0 40.0 /m/030rvx,/m/09x0r

1108451 rows × 4 columns

In [7]:
# Get the number of rows that has "start_seconds" as 30.0 using pd.DataFrame.loc() and shape
 print "Number of rows :", unbalanced_train.loc[unbalanced_train['start_seconds']==30.0].shape[0] 
 Out[7]:
Number of rows : 1108451

Now we shall have some complex pd.DataFrame.loc() operations. Now we shall get only the ‘# YTID’ column that has ‘start_seconds’ columns as 30.0 and has ‘positive_labels’ column as ‘/m/04rlf’

In [13]:
# To get only '#YTID' column only with start_seconds as 30.0 an has poitive_labels as '/m/04rlf'
 unbalanced_train['# YTID'].loc[unbalanced_train['start_seconds'].apply(lambda x: x==30.0) & unbalanced_train['positive_labels'].apply(lambda x: x=='/m/04rlf')] 
Out[13]:
2          ---B_v8ZoBY
8          ---mO--kRQk
21         --0_x3T5DQI
53         --2URMA4_Gw
65         --3FNDQ-EfE
76         --3lOAqWlwI
95         --4r-bP8sfE
96         --4rtcXPH0c
101        --59h67y8BI
141        --86_7_wT90
163        --9qaflvptM
169        --ALJdZpYoA
170        --ALMd2UbwA
191        --BnbZUiZ8o
200        --C6HJ_gBPI
206        --CIar_Kl4Y
210        --CZ-HqbveY
221        --DbgPXwLlM
225        --Dqm421zY0
246        --F4laYscYE
247        --F6xCVmhuU
249        --F7pocClyA
254        --FGZJSouKc
256        --FMnzXQLik
257        --FNPPU2uzQ
260        --FidOvUmIc
264        --G3Nkp8kbY
270        --GV991-tmA
279        --GreAs6Iow
285        --GyJoJ02e0
              ...     
2041449    zz80j573SJU
2041462    zz9leeFMyqw
2041478    zzCFMqJiuYs
2041483    zzCs4zWeUGY
2041510    zzIL0eUHAPg
2041543    zzPrQ7spgdI
2041557    zzSObCSwBbk
2041562    zzSuOlmrV3Q
2041578    zzUsPSz-zgs
2041582    zzV_cmNYtJ8
2041589    zzX2tURtVis
2041605    zzZVFbUEnvE
2041611    zz_fxoOHGHI
2041632    zzciWUGP7gM
2041635    zzdxIFzdoDk
2041638    zze9u5mUQ9A
2041639    zzeF42VLtoA
2041641    zzeIm60n7Sw
2041643    zzedHSSqPbk
2041652    zzgXY8eh9Cw
2041672    zzjIFKDKpxo
2041685    zzlBKQxlOK0
2041700    zzml2Gi5z4k
2041704    zznKX1VBgh0
2041711    zznpF3_lJUo
2041713    zzo8fWN8vhw
2041718    zzpAZXdx1gg
2041726    zzqgo1EZcUw
2041747    zzuvrD-TX5U
2041748    zzv-5EfGxnw
Name: # YTID, Length: 220976, dtype: object

The above code may looks little complex but its very simple to decipher. We are apply the conditional statement ( pd.DataFrame.loc() ) for every row using the lambda function ( “lambda” ) and performing the logical and ( “&”) operation on every row results

In [ ]:
Don't miss out!
Subscribe To Our Newsletter

Learn new things. Get an article everyday.

Invalid email address
Give it a try. You can unsubscribe at any time.

Comments

comments