description |
---|
Basic operations on data using the command line |
This notebook explores ways to do basic data analysis on files using shell commands. These set of commands are really targeted towards linux or command line users - and who need to perform certain basic data related tasks, that you might be able to perform on excel or pandas
WHY use this? - you may ask
- You may not have access to Excel
- You may not have a UI for the OS (eg: working on a remote server)
- You may not have access to pandas (the python library)
- The file is too large to load on any UI & takes a lot of time
- You might quickly want to glance through what the data contains
This requires you to know some basics about the operating system and how to use the shell / command line and also regular expressions.
For the purpose of this notebook, we are going to be using the flags.csv
data available in this repo.
NOTE - before you begin with the commands, ensure your command line is in the directory that contains the flags.csv file
Also for every command that you see in this notebook, you can always check the --help
or the man
command to know more details about it. Eg: running the head --help
will provide the necessary details for the head
command.
Firstly, you want to open the file and see what's in it. Here's a way to view the first n
lines to get an idea about what the data contains.
head -1 flags.csv
will display the first two lines from the file. The first line would be the column titles (if the data contains them), else it would be the first row of the data file.
Each column would be separated by the delimiter that is contained in the file. This could be a tab
or a ,
or anything else that is distinct enough and not contained in the data itself. Every row would be separated by the same delimiter.
If the delimiter is different than a space
, for example, a :
, you can print the column titles, each in a new line. To do that, pipe the sed
command like
# you can replace the `:` char, with any other delimiter
head -1 flags.csv | sed -e s/:/\\n/g
If you want to know the number of columns that are present, append the wc -l
command at the end of the above command
# assuming that the delimiter is a `:` char
head -1 flags.csv | sed -e s/:/\\n/g | wc -l
We still need to understand what the data contains
Display top n
& bottom n
rows
head -n10 flags.csv
tail -n10 flags.csv
As the name head
and tail
suggests, you can view the top and bottom rows.
Now that we have viewed the file the way we want, we would want to get some basic information about the file - like the total number of rows, the data types of each column etc.
Row Count
If you want to know the total number of rows in the file. the -l
parameter stands for lines.
wc -l flags.csv
NOTE - wc
stands for word count & you'll see this being used often
Word Count
wc flags.csv
Specific Columns
Let's view a specific column in a little more in detail
# displays head of column 10
awk -F: '{print $10}' flags.csv | head -5
In the above command, we searched the column by the index, i.e. the number 10. But how did we get to know that.
Get index of a column by it's name
Let's say you want to know the index of the column Continent
, you would do
# assuming that your delimeter is the `:` char
head -1 flags.csv | sed -e s/:/\\n/g | grep -n Continent | cut -c1-2
If the column name does not exist, or there's a typo, then it will not return anything. Now, using the index, you can display only that particular column. You can print multiple columns separated by a tab, (denoted by "\t"
)
awk -F: '{print $10 "\t" $11}' flags.csv | head -5
Once we view the data, we might feel some of the columns to be unnecessary and don't need to view them. To view only the required columns, we can run the following command.
cut -d: -f1,10,11 flags.csv
This will only show the 1st, 10th and 11th column.
The next basic activity one would perform on a data file is sorting. You can sort a file using the sort
command.
# -r is for reverse/descending order
sort -r flags.csv
By default, this will sort based on the first column. But you can even sort based on a specific column.
Let's sort the data by the 10th column i.e. by Name
- which is the country name in descending order
# -k is for column #, -r is for reverse
sort -k10 -r
We can apply this command to only on the necessary columns that we need by
cut -d: -f1,10,11 flags.csv | sort -k2 -r
Note, once we apply the first cut
command, we only have 3 columns to view i.e. the ID
, the Name
and the Continent
. Here, the index also gets updated i.e. 1, 2, and 3 respectively. The part following the |
in the above command, refers to Name
as -k2
.
Let's filter by the Continent column and view only Europe continent
cut -d: -f1,10,11 flags.csv | grep Europe
and if you want to see the number of countries within Europe, you can
cut -d: -f1,10,11 flags.csv | grep Europe | wc -l
# 45 = the number of rows filtered by `Europe`
Finally, let's say you want the sum total of the 2nd column after filtering for Europe
. We'll do this in two steps
Step 1 - Sum of a column
awk -F: '{sum += $2} END {print sum}' flags.csv
Step 2 - Filtering & summing
We already saw how to filter by a column. Let's combine that with the summing command using the |
(pipe) operator
cut -d: -f1,2,10,11 flags.csv | grep Europe | awk -F: '{sum += $2} END {print sum}'
NOTE - we need to include the column to be summed i.e. 2
NOT
Similarly, if you want to view all the Continent's other than Europe, then simply add the -v
option in the grep
cut -d: -f1,10,11 flags.csv | grep -cv Europe
the -c
in the grep
command will only return the count
by defult instead of adding wc -l
at the end of the line.
For example, you want to view all unique values of the Continents
column
# 10th column points to the Continent column
awk -F: '{print $10}' flags.csv | sort -u
But if you want to sort the column numerically, say column 2, then you also add the -n
flag like
# 2nd column is a numeric column
awk -F: '{print $2}' flags.csv | sort -un
Appending the wc -l
command to the previous one will give you the total number of unique values (all together)
# counts the distinct count of `Name` column
awk -F: '{print $11}' flags.csv | sort -u | wc -l
Value Counts
This is different from the command above. This command will give you a total count of all the unique values combined. Eg: If you want to see, how many values with "Africa", along with how many values with "Europe", then run the following command
# Continent is the 11th column
awk -F: '{print $11}' flags.csv | sort | uniq -c
An even improved way of doing it is by using cut
like
cut -f11 -d: flags.csv | sort | uniq -c | sort -k1,1nr
will sort the value counts in descending order for the Continent
column i.e. column 11
For this purposes, consider two files containing the following texts
file1.txt
Nawazuddin grapes 27
Sardar wasseypur 4
Piyush dhanbad 8
Brad fury 2
DiNero casino 5
Duvall godfather 12
Medson purple 30
Flintstone dude 7
and file2.txt containing
Nawazuddin <pink>
Sardar <blue>
Piyush <orange>
Brad <blue>
DiNero <red>
Duvall <purple>
Medson <blue>
Flintstone <orange>
You can merge two files like
# delimeter as a space
join file1.txt file2.txt
will give you the output merged !!
Nawazuddin <pink> grapes 27
Sardar <blue> wasseypur 4
Piyush <orange> dhanbad 8
Brad <blue> fury 2
DiNero <red> casino 5
Duvall <purple> godfather 12
Medson <blue> purple 30
Flintstone <orange> dude 7
Preview a large file
less large_file.csv
will help you load only the section of the file and then load the remaining as you navigate - saves memory usage for really large files - without reading the full file.
Now, if you want to view only the rows containing a particular pattern and view it partly
strings large_file.csv | grep some_string | less
Similary,
view the count occurances of some_string
strings large_file.csv | grep some_string | wc -l
For this section prepare with a sample file which you download by running the following command
wget https://www.data36.com/demo1.csv --no-check-certificate
will save a file called demo1.csv
Sort numeric values
by running the sort demo1.csv
will sort the values assuming that it contains text values, but you will need to add the -n
parameter to make it sort like numbers
# the `u` will only return unique values
sort -nu demo1.csv
Min & Max
So if you had to get the max, min values you would
# min
sort -nu demo1.csv | head -1
# max
sort -nu demo1.csv | tail -1