Gaston Sanchez
- Practicing with the command line
- Navigating the filesystem and managing files
- Practice basic manipulation of data files
- Importing Data Tables in R
- Default reading-table functions
- Write your descriptions, explanations, and code in an
Rmd
(R markdown) file. - Name this file as
lab03-first-last.Rmd
, wherefirst
andlast
are your first and last names (e.g.lab03-gaston-sanchez.Rmd
). - Knit your
Rmd
file as an html document (default option). - Submit your
Rmd
andhtml
files to bCourses, in the corresponding lab assignment. - Due date displayed in the syllabus (see github repo).
The first part of the lab involves navigating the file system and manipulating files (and directories) with the following basic shell commands:
pwd
: print working directoryls
: list files and directoriescd
: change directory (move to another directory)mkdir
: create a new directorytouch
: create a new (empty) filecp
: copy file(s)mv
: rename file(s)rm
: delete file(s)
If you are using git-bash (i.e. your OS is Windows) you don’t have the
man
command to see the manual documentation of other commands. In this
case you can check the man pages online:
http://man7.org/linux/man-pages/index.html
Write your bash commands inside a chunk that is NOT evaluated. One way
to do this is to add the option eval = FALSE
inside the curly braces
of the chunk (see image below)
- Open (or launch) the command line
- Use
mkdir
to create a new directorystat133-lab03
- Change directory to
stat133-lab03
- Use the command
curl
to download the following text file:
# the option is the letter O (Not the number 0)
curl -O http://textfiles.com/food/bread.txt
- Use the command
ls
to list the contents in your current directory - Use the command
curl
to download these other text files: - Use the command
curl
to download the following csv files:- http://archive.ics.uci.edu/ml/machine-learning-databases/forest-fires/forestfires.csv
- http://web.pdx.edu/~gerbing/data/cars.csv
- http://web.pdx.edu/~gerbing/data/color.csv
- http://web.pdx.edu/~gerbing/data/snow.csv
- http://web.pdx.edu/~gerbing/data/mid1.csv
- http://web.pdx.edu/~gerbing/data/mid2.csv
- http://web.pdx.edu/~gerbing/data/minutes1.csv
- http://web.pdx.edu/~gerbing/data/minutes2.csv
- Now try
ls -l
to list the contents in your current directory in long format - Look at the
man
documentation ofls
to find out how to list the contents in reverse order - How would you list the contents in long format arranged by time?
- Find out how to use the wildcard
*
to move list all the files with extension.txt
- Use the wildcard
*
to move list all the files with extension.csv
in reverse order - You can use the character
?
to represent a single character: e.g.ls mid?.csv
. Find out how to use the wilcard?
to list.csv
files with names made of 4 characters (e.g.mid1.csv
,snow.csv
) - The command
ls *[1]*.csv
should list.csv
files with names containing the number 1 (e.g.mid1.csv
,minutes1.csv
). Adapt the command to list.csv
files with names containing the number 2. - Find out how to list files with names containing any number.
- Inside
stat133-lab03
create a directorydata
- Change directory to
data
- Create a directory
txt-files
- Create a directory
csv-files
- Use the command
mv
to move thebread.txt
file to the foldertxt-files
- Use the wildcard
*
to move all the text files to the directorytxt-files
- Use the wildcard
*
to move all the.csv
files to the directorycsv-files
- Go back to the parent directory
stat133-lab03
- Create a directory
copies
- Use the command
cp
to copy thebread.txt
file (the one inside the foldertxt-files
) to thecopies
directory - Use the wildcard
*
to copy all the.txt
files in the directorycopies
- Use the wildcard
*
to copy all the.csv
files in the directorycopies
- Change to the directory
copies
- Use the command
mv
to rename the filebread.txt
asbread-recipe.txt
- Rename the file
Fisher.csv
asiris.csv
- Rename the file
btaco.txt
asbreakfast-taco.txt
- Change to the parent directory (i.e.
stat133-lab03
) - Rename the directory
copies
ascopy-files
- Find out how to use the
rm
command to delete the.csv
files that are incopy-files
- Find out how to use the
rm
command to delete the directorycopy-files
- List the contents of the directory
txt-files
displaying the results in reverse (alphabetical) order
If you are already familiar with the basic bash commands to navigate the
filesystem (or if you want to expand your R skills), use the R functions
to manipulate files and directories to perform the exact same tasks from
within R. See ?files
for more information.
getwd()
setwd()
download.file()
dir.create()
list.files()
list.dirs()
file.create()
file.copy()
file.rename()
file.remove()
The second part of the lab involves importing the Abalone Data Set that is part of the UCI Machine Learning Repository
The location of the data file is:
http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data
The location of the data dictionary (description of the data) is:
http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.names
Look at both the dataset file, and the file with its description, and answer the following questions:
- What’s the character delimiter?
- Is there a row for column names?
- Are there any missing values? If so, how are they codified?
- What is the data type of each column?
One basic way to read this file in R is by passing the url location of
the file directly to any of the read.table()
functions:
url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data"
abalone <- read.table(url, sep = ",")
My suggestion when reading datasets from the Web, is to always try to
get a local copy of the data file in your machine (as long as you have
enough free space to save it in your computer). To do this, you can use
the function download.file()
and specify the url address, and the name
of the file that will be created in your computer. For instance, to save
the abalone data file in your working directory, type the following
commands directly on the R console:
# do NOT include this code in your Rmd file
# download copy to your working directory
origin <- 'http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'
destination <- 'abalone.data'
download.file(origin, destination)
Before describing some of the reading-table functions in R, let’s practice some basic bash commands to inspect the downloaded data file. Include the commands in your Rmd file inside an unevaluated code chunk.
-
Use the
file
command to know what type of file isabalone.data
. -
Use the word count command
wc
to obtain information about: 1) newline count, 2) word count, and 3) byte count, of theabalone.data
file. -
See the
man
documentation ofwc
and learn what option you should use to otabin only the number of lines inabalone.data
. -
Use
head
to take a peek at the first lines (10 lines by default) ofabalone.data
-
See the
man
documentation ofhead
and learn what option you should use to display only the first 5 files inabalone.data
. -
Use
tail
to take a peek at the last lines (10 lines by default) ofabalone.data
-
See the
man
documentation oftail
and learn what option you should use to display only the last 3 files inabalone.data
. -
Use the
less
command to look at the contents ofabalone.data
(this command opens a paginator so you can move up and down the contents of the file). Press the keyq
to exit the paginator.
Now that you have a local copy of the dataset, you can read it in R with
read.table()
like so:
# reading data from your working directory
abalone <- read.table("abalone.data", sep = ",")
Once you read a data table, you may want to start looking at its
contents, usually taking a peek at a few rows. This can be done with
head()
and/or with tail()
:
# take a peek of first rows
head(abalone)
# take a peek of last rows
tail(abalone)
Likewsie, you may also want to examine how R has decided to take care of
the storage details (what data type is used for each column?). Use the
function str()
to check the structure of the data frame:
# check data frame's structure
str(abalone, vec.len = 1)
So far we have been able to read the data file in R. But we are missing a few things. First, we don’t have names for the columns. Second, it would be nice if we could specify the data types of each column instead of letting R guess how to handle each data type.
Look at the data description (see “Attribute information”) in the following link:
http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.names
According to the description of the Abalone data set, we could assign the following data types to each of the columns as:
Name | Data Type |
---|---|
Sex | character |
Length | continuous |
Diameter | continuous |
Height | continuous |
Whole weight | continuous |
Shucked weight | continuous |
Viscera weight | continuous |
Shell weight | continuous |
Rings | integer |
-
Create a vector
column_names
for names of each column. Use the names displayed in the section “7. Attributes Information”. -
Create another vector
column_types
with R data types (e.g.character
,real
,integer
). Match the R data types with the suggested type in “7. Attributes Information” (nominal =character
, continuous =real
, integer =integer
). -
Optionally, you could also specify a type “factor” for the variable
sex
since this is supposed to be in nominal scale (i.e. it is a categorical variable). Also note that the variablerings
is supposed to be integers, therefore we can choose aninteger
vector for this column. -
Look at the documentation of the function
read.table()
and try to read theabalone.data
table in R. Find out which arguments you need to specify so that you pass your vectorscolumn_names
andcolumn_types
toread.table()
. Read in the data asabalone
, and then check its structure withstr()
. -
Now re-read
abalone.data
with theread.csv()
function. Name this data asabalone2
, and check its structure withstr()
. -
How would you read just the first 10 lines in
abalone.data
? Name this data asabalone10
, and check its structure withstr()
. -
How would you skip the first 10 lines in
abalone.data
, in order to read the next 10 lines (lines 11-20)? Name this data asabalone20
, and check its structure withstr()
. -
Use R functions to compute descriptive statistics, and confirm the following statistics. Your output does not have to be in the same format of the table below. The important thing is that you begin learning how to manipulate columns (or vectors) of a data.frame.
Length Diam Height Whole Shucked Viscera Shell Rings
Min 0.075 0.055 0.000 0.002 0.001 0.001 0.002 1
Max 0.815 0.650 1.130 2.826 1.488 0.760 1.005 29
Mean 0.524 0.408 0.140 0.829 0.359 0.181 0.239 9.934
SD 0.120 0.099 0.042 0.490 0.222 0.110 0.139 3.224