Making your data behave- this is the first and foremost challenge to be tackled by a data scientist. Datasets can be found anywhere and everywhere and in all kinds of formats. Bringing them down to a standard form can be really tricky business. Here we shall have a look at data frames: how to create them and how to read data from different files into them from scratch.
This guide concentrates on applications so that you can get started with data science problems. The bare minimum code is given here.
What is a data frame?
First, we need to know what is a data frame. A data frame is very similar to a matrix or a two-dimensional array- except each column can have data of different class or different data type. To add to that, the first row can be assigned as headings for each column. The columns can be accessed through column number or these headings. Needless to say, they are much easier to handle than Excel files and other tables.
How to create your data frame
Creating your own data frames is very easy. You have to use the data.frames() function in R. It can take a lot of variables as arguments but the essentials are the column names and the corresponding data. The following is the syntax for creating a data frame.
> dframe<-data.frame(name=c("alpha","bravo","charlie","delta"),id=16:19) > dframe name id 1 alpha 16 2 bravo 17 3 charlie 18 4 delta 19
To add to this there can be many more arguments to avail different options. You can check them out here. When you check the class of this dataframe, this is what you get:
> class(dframe)  "data.frame"
Note: To check the first 5 lines of a data frame, we use the head function and for the last 5, we use the tail function.
Creating data frames from a file
For huge data science problems, we do not normally create our own data frame. We use datasets found online or collected from different systems that are stored in .csv or .xlsx format. You can find a lot of helpful datasets on UCI Machine Learning Repositroy and Kaggle. Here, I am showing you the example of the India Water Quality Data file.
> fileframe<-read.csv(file = "C://IndiaAffectedWaterQualityAreas.csv") > head(fileframe) State.Name District.Name Block.Name Panchayat.Name Village.Name Habitation.Name Quality.Parameter Year 1 ANDHRA PRADESH EAST GODAVARI(04) PRATHIPADU(10) GOKAVARAM(04) VANTHADA(014 ) VANTHADA(0404410014010400) Salinity 1/4/2009 2 ANDHRA PRADESH EAST GODAVARI(04) PRATHIPADU(10) GOKAVARAM(04) PANDAVULAPALEM(022 ) PANDAVULAPALEM(0404410022010400) Fluoride 1/4/2009 3 ANDHRA PRADESH EAST GODAVARI(04) PRATHIPADU(10) GAJJANAPUDI(06) G. KOTHURU(023 ) G. KOTHURU(0404410023010600) Salinity 1/4/2009 4 ANDHRA PRADESH EAST GODAVARI(04) PRATHIPADU(10) GAJJANAPUDI(06) GAJJANAPUDI(029 ) GAJJANAPUDI(0404410029010600) Salinity 1/4/2009 5 ANDHRA PRADESH EAST GODAVARI(04) PRATHIPADU(10) CHINTALURU(10) CHINTALURU(028 ) CHINTALURU(0404410028011000) Salinity 1/4/2009 6 ANDHRA PRADESH EAST GODAVARI(04) PRATHIPADU(10) ELURU(16) P. JAGANNADHAPURAM(035 ) P. JAGANNADHAPURAM(0404410035011600) Fluoride 1/4/2009
The read.csv() function reads a csv (Comma Separated Value) file to a data frame in R. It takes the path of the file in the computer. On my PC, it was stored directly under C drive. Like this.
Similarly there are other functions to write from different types of files. To read from an excel file, you need the read_excel() function found in the readxl library. It works in the exact same way as the read.csv() function. Use the following code to install and load the library:
Creating a file from a data frame
To save the dataset as a file in your computer. use the following code. Remember the dframe data frame that we created earlier? Let’s put that into a csv file. It takes only one line of code:
You have to go to your current working directory and you shall find the file with the specified filename. In my case the working directory is Documents. You can find your working directory using the getwd function.
> getwd() "C:/Users/nEW u/Documents"
Now let’s check out the Documents folder.
Simlarly, for writing into an Excel file, we should use the write.xlsx() function found under the xlsx library. It takes the same argument- the only change being the extension name has to be “.xlsx” instead of “.csv”. We can install it through the following code:
Data frames are a lot easier to handle than Excel files- they make data cleaning super easy. Adding or removing columns, creating copies, using formulae and statistical models are all very easy. This article should help you at least load your data into R and then let your code do the talking. You can keep a tab open whenever you are working on a data science problem with R for handling your data.