index.json

[{"content":"SQL is pretty much a required skill in all the Data Analytics and/or Data Science jobs, and one of the technical assessment in some of those job interviews is to write SQL query to complete given data tasks. In this blog, I am compiling some common SQL techniques based on my learning and working experiences. This will be the Part 1 of a series which focuses on some big statements/operations/commands for data manipulation in SQL ; the 2nd part will focus more on specific details of manipulating different types of variables.\n1. Select #\rSELECT is the very basic command in SQL, you basically use it to select columns that you want from a table. If you want to select all columns from the table, you can use SELECT *, but this is not suggested due to the cost of memory and some other various reasons. Check this\rblog to learn more about why you should avoid select * when not necessary. Basically query format for the select command:\nselect col1, col2 -- use \u0026#39;select DISTINCT\u0026#39; if you want to drop duplicates from data_base.table_name 2. Join #\rJoin is used when you are trying to extract information from another table for the data in a given table based on one or multiple related columns - One join happens between two tables (left table and right table, they could be the same in self join) and both tables need to have related column(s) for them to be able to join. There are different types of joins.\nleft join is used when you want to keep all the data/rows in the left table no matter whether the information you want to extract from the right table is available for those rows or not.\nOpposite from left join, right join will keep all the data/rows in the right table\ninner join will keep all the matched data - data/rows that exist in both left and right tables.\nouter join will keep all the rows from both left and right tables - you should expect the number of rows in the resulting table greater or equal to the number of rows in the left or right table. Basic query format for join:\nselect l.col1, r.col2 from data_base.left_table l left join data_base.right_table r -- can change left join to right join, inner join or outer join on l.id = r.id and l.name=r.name --here we are joining two tables based on id and name columns In the above query example, the join condition is equal. We can also join two tables with unequal condition(s) like follwing:\nselect l.col1, r.col2 from data_base.left_table l left join data_base.right_table r -- can change left join to right join, inner join or outer join where l.start_date \u0026gt; r.start_date -- here the \u0026gt; can be other unequal operations such as \u0026lt;\u0026gt;, \u0026lt;, between -- here we are joining two tables based on id and name columns Note that instead of using ON clause, with the unequal condition, we use WHERE clause. Whenever the WHERE clause is omitted, the default result is CROSS JOIN or a Cartesian product. This means the query will return every combination of rows from the left table with every combination of rows from the right table, which you should probably avoid.\nHere we will skip covering too much details of self join and cross join since they are less common.\n3. Union #\rIf join is combining data horizontally, then union is to concatenate/append data vertically. Again, one union operator happens between two tables (left and right, can be the same table) - you should expect the number of rows in the resulting table equal or greater than the number of rows in either left or right table. Basic query format:\nselect col1, col2 from data_base.left_table UNION -- or UNION ALL select col3, col4 from data_base.right_table A few things to be noted here:\nthe number of selected columns from left table needs to be the same as that from the right table; the name of those columns can be different (the resulting table will keep the name from the left table) but the data type of the corresponding columns need to be the same. UNION will remove duplicate rows during the operation UNION ALL will not remove duplicate rows 4. Aggregation and grouping #\rIn SQL, we can use GROUP BY clause to perform any aggregation over the grouped column(s) - whenever you select other columns that are not being aggregated and perform some aggregation the same time, you have to use GROUP BY and have those non-aggregated columns specified after the GROUP BY (as grouping columns). Common aggregation functions include:\ncount(), sum() avg(), min(), max() first_value(), last_value() Basic query format for aggregation and grouping\nselect col1, col2, sum(col3) as col3_sum from database.table_name group by 1, 2 -- here 1,2 refers to col1, col2 for simplicity, you can use group by col1, col2 order by 1, 2 -- ordering the resulting table by col1 and col2, order by should be used after group by Some things to be aware when you use the COUNT() aggregation function\nselect count(1) will not exclude null values because 1 is a non-null expression count(*) will not exclude null values, it will return the same result as count(1) count(specified_col) will exclude null values count(distinct specified_col) will exclude duplicates counting the unique values 5. Case statement #\rCase statement is a very commonly used statement in real life. It is to create new column based on some conditions using existing columns in the table. This statement is very easy to use and the basic query format is follwing:\n/*Example 1*/ select col1, -- existing column case when existing_col1 = 1 then 1 when existing_col1 = 2 then 2 else 3 end as new_col2 -- new_col2 will have 3 values: 1,2,3 from data_base.table_name /*Example 2*/ select count(case when a=1 then a end) as a1_count -- here the count() will count the occurence of 1 in column a 6. WHERE \u0026amp; HAVING #\rBoth WHERE and HAVING are used for filtering with specified conditions. HAVING can be used with aggregation while WHERE cannot.\n/*WHERE*/ select col1, col2 from database.table_name where col_A between a1 and a2 --between is inclusive and col_B is not NULL and col_C in (1,2,3) -- NOTE: no aggregation in this query, so we use where for filtering /*HAVING*/ select col1, count(col2) as col2_cnt from database.table_name group by col1 where having count(col2) \u0026gt;4 --order by -- optional sorting 7. Subqueries \u0026amp; CTEs #\rCTE here refers to Common Table Expressions. Both subqueries and CTE are used to create some intermediate querying result/table. CTE is always used in conjunction with WITH clause. Sample queries are as following:\n/*CTE*/ with cte_tbl1 as ( select col1, col2 from tbl1 where col2 \u0026gt; 2 ), -- this comma is necessary cte_tbl2 as ( select col3, col4 from tbl2 where col3 = 4 ) -- no comma after the last CTE select * from cte_tbl1 union select * from cte_tbl2; /*Subquery*/ -- example 1 select * from (select col1, col2 from tbl1 where col2 \u0026gt; 2) union select * from (select col3, col4 from tbl2 where col3 = 4); -- example 2 select col1, col2 from table1 where col3 \u0026gt; (select avg(col3) from tbl1) As we can see, both CTE and Subqueries can be used to perform the same tasks, but there are still some differences between them.\nCTE is defined at the front of the query while subquery are used inline (wherever needed) CTE must be named while subquery does not have to CTE is more readable/cleaner than subquery especially with big query A CTE can be used many times in the query while subquery can only used wherever you defined it - CTE would be much better in queries where you need to use same intermediate data multiple times subquery can be used with WHERE clause while CTE cannot (see the example 2 above) 8. Window function #\rwindow function is probably one the most advanced techniques in SQL. Window function\nwill not aggregate the data into smaller table even it is sometimes used with aggregation function, instead it will keep all the rows and give each row the (aggregated) summary value/statistic. is commonly used with over (PARTITION BY) Other than the typical aggregate functions, a few common SQL tasks that are usually using window function:\nRanking row_number() simply returns the row number rank() will skip some ranks if there is a tie. dense_rank() does not skip any ranks if there is a tie between the ranks of the preceding Extact prior value or next value lead() to get the next value lag() to get the lagged/prior value /*Example 1*/ select col_a, ROW_NUMBER() over (partition BY col_a order BY col_b) as new_col -- get the ranking of col_b within each group of col_a from table_name /*Example 2*/ select lag(col_c) over (partition BY col_a order BY col_b) as new_col -- get the ranking of col_b within each group of col_a from table_name /*Example 3*/ select sum(col_c) over (partition BY col_a order BY col_b), --running/rolling sum of col_c (ordered by col_b) within each group of col_a avg(col_c) over (partition BY col_a) --average of m within each group of col_a from table_name References #\rWindow function: window function in sql\rsql window functions\rPractice problems sql interview questions you must prepare - the ultimate guide\rtop 30 sql query interview questions\r","date":"22 March 2023","permalink":"/posts/sql1/","section":"Blog posts","summary":"SQL is pretty much a required skill in all the Data Analytics and/or Data Science jobs, and one of the technical assessment in some of those job interviews is to write SQL query to complete given data tasks.","title":"Common SQL techniques Part 1"},{"content":"Python is a popular programming language and often times we need to import external libraries to help us perform various tasks. However, Python and those external libraries have many different versions, sometimes one python version may not support the libraries that we need, or some of our projects just require a different version of external libraries than another one. When we only have one python installation in our computer and have only that one place to install our packages, we cannot perform all the tasks that have different version requirements for Python or the packages.\nOne way to handle this is to create a virtual environment, where the Python (interpreter), libraries and scripts installed into it can be isolated from those installed in other virtual environments. pyenv can help us install different versions of python and easily switch between them; pyenv-virtualenv is a pyenv plugin that provides features to manage virtualenvs (and conda environments) for Python on UNIX-like systems.\nNote there are other ways to create virtual environments, this is just one way.\nStep 1: Install pyenv and pyenv-virtualenv #\rFor MacOS users, you can install pyenv (and pyenv-virtualenv) with Homebrew. If you do not have Homebrew on your computer yet, run the following command in your terminal to install Homebrew first (check here\rfor more information).\n/bin/bash -c \u0026#34;$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)\u0026#34; After that, install pyenv by running the following commands in your terminal:\nbrew update brew install pyenv install pyenv-virtualenv by running either of the the following 2 commands:\nbrew install pyenv-virtualenv brew install --HEAD pyenv-virtualenv # install the latest development version Step 2: Set up shell environment for Pyenv #\rDepends on your shell type, you need set up the shell environment differently. To check what shell type you have on your computer, open your terminal and it will show up on the top part of the terminal window. The following setup should work for the majority of users for common use cases(see below).\nFor Zsh echo \u0026#39;export PYENV_ROOT=\u0026#34;$HOME/.pyenv\u0026#34;\u0026#39; \u0026gt;\u0026gt; ~/.zshrc echo \u0026#39;command -v pyenv \u0026gt;/dev/null || export PATH=\u0026#34;$PYENV_ROOT/bin:$PATH\u0026#34;\u0026#39; \u0026gt;\u0026gt; ~/.zshrc echo \u0026#39;eval \u0026#34;$(pyenv init -)\u0026#34;\u0026#39; \u0026gt;\u0026gt; ~/.zshrc For bash: # add the commands to ~/.bashrc echo \u0026#39;export PYENV_ROOT=\u0026#34;$HOME/.pyenv\u0026#34;\u0026#39; \u0026gt;\u0026gt; ~/.bashrc echo \u0026#39;command -v pyenv \u0026gt;/dev/null || export PATH=\u0026#34;$PYENV_ROOT/bin:$PATH\u0026#34;\u0026#39; \u0026gt;\u0026gt; ~/.bashrc echo \u0026#39;eval \u0026#34;$(pyenv init -)\u0026#34;\u0026#39; \u0026gt;\u0026gt; ~/.bashrc # add the commands to ~/.profile echo \u0026#39;export PYENV_ROOT=\u0026#34;$HOME/.pyenv\u0026#34;\u0026#39; \u0026gt;\u0026gt; ~/.profile echo \u0026#39;command -v pyenv \u0026gt;/dev/null || export PATH=\u0026#34;$PYENV_ROOT/bin:$PATH\u0026#34;\u0026#39; \u0026gt;\u0026gt; ~/.profile echo \u0026#39;eval \u0026#34;$(pyenv init -)\u0026#34;\u0026#39; \u0026gt;\u0026gt; ~/.profile # add the commands to ~/.bash_profile echo \u0026#39;export PYENV_ROOT=\u0026#34;$HOME/.pyenv\u0026#34;\u0026#39; \u0026gt;\u0026gt; ~/.bash_profile echo \u0026#39;command -v pyenv \u0026gt;/dev/null || export PATH=\u0026#34;$PYENV_ROOT/bin:$PATH\u0026#34;\u0026#39; \u0026gt;\u0026gt; ~/.bash_profile echo \u0026#39;eval \u0026#34;$(pyenv init -)\u0026#34;\u0026#39; \u0026gt;\u0026gt; ~/.bash_profile In order to make pyenv-virtualenv work, you need to further add the following:\neval \u0026#34;$(pyenv virtualenv-init -)\u0026#34; After the shell setup steps, remember to restart the shell for the changes to take effect by running\nexec \u0026#34;$SHELL\u0026#34; Step 3: Use pyenv and pyenv-virtualenv #\r1. pyenv usage #\rBefore installing any new python version with pyenv, first install Python build dependencies following the instructions here\runder Suggested build environment session.\nBy simply typing pyenv in the terminal, you will get a preview of some useful pyenv commands there. The following are some common commands.\nInstall/uninstall additional python version\npyenv install \u0026lt;version\u0026gt; # install e.g. pyenv install 3.7.14 pyenv uninstall \u0026lt;version\u0026gt; # uninstall You can check all the available versions by running the following command:\npyenv install --list | grep \u0026#34; 3\\.[67891]\u0026#34; # This will list all the available python versions of 3.6 above Check Python versions installed under pyenv\nthe one with the * in front is the global/default python version. Note that if you have conda installed, the python version under conda environment will not show here.\npyenv versions Switch versions/deactivate pyenv\npyenv shell \u0026lt;version\u0026gt; # select version for current shell session pyenv local \u0026lt;version\u0026gt; # Set default version for the current directory (or its subdirectories) pyenv global \u0026lt;version\u0026gt; # set global version for user account pyenv shell --unset # deactivate pyenv 2. pyenv-virtualenv usage #\rCreate a virtual environment with specific python version\npyenv virtualenv \u0026lt;python_version\u0026gt; \u0026lt;venv_example\u0026gt; # e.g. pyenv virtualenv 3.10.4 venv_test Install packages in the created virual environment\npyenv local \u0026lt;venv_example\u0026gt; # first go to the virtual environment pip install \u0026lt;package\u0026gt; # install packages using pip install Use the virtual environment\nThere are multiple ways to use the virtual environment.\nIf you want to automatically activate the virtual environment for a project folder, you can either\ncreate a .python-version file in the project folder by running the following command in the terminal: echo \u0026lt;venv_example\u0026gt; \u0026gt; \u0026lt;project_folder_path\u0026gt;/.python-version # e.g. echo venv_test \u0026gt; \u0026#39;/Users/yanhe/Desktop/test-env\u0026#39;/.python-version or\ngo to the project from a terminal, and run: pyenv local \u0026lt;venv_example\u0026gt; Then everytime you enter the project folder, the virtual environment will be automatically activated, and your codes under that project folder will automatically run via the virtual environment.\nIf you want to (permanently) change the (default) virtual environment for the project folder, you can achieve that by running:\npyenv local \u0026lt;new_venv\u0026gt; Note that\nif you use the virual environment by creating the .python-version file, you will notice that the .python-version file also changes automatically whenever you change the default virtual environment, it always corresponds to the name of the virtual environment that you set. You need to re-enter the project folder everytime after you run pyenv local in order to make the new virtual environment take effect. You can always check the (default/permenant) virtual environment for your project folder by opening a terminal under that folder and running:\npyenv local If you are working in VS Code\ryou can activate a virtual environment by pressing control+shift+P and select Python: Select Interpreter, refresh by clicking the refreshing bar on the top right (this should give you all available choices in your computer) and choose the virtual environment that you want, this will not change the default virtual environment. If you want to temporarily switch to a different virtual environment in current terminal, you can open a terminal and run the following command:\npyenv shell \u0026lt;new_venv\u0026gt; Collaborating with others #\rThe best part about working with virtual environments is how easy it becomes to work with others on projects. All you need to do is let them know what python version your using and give them a snapshot of your environment using pip freeze \u0026gt; requriements.txt. Now they are able to recreate your environment on their laptop without changing their global settings.\nIt is good practice to have a base environment as your starting shell and then create a new environment for each project you work on.\nHave fun with your Python virtual environments :)\nSome other useful links:\nhttps://github.com/pyenv/pyenv#set-up-your-shell-environment-for-pyenv\rhttps://github.com/pyenv/pyenv-virtualenv\r","date":"18 October 2022","permalink":"/posts/pyenv/","section":"Blog posts","summary":"Python is a popular programming language and often times we need to import external libraries to help us perform various tasks.","title":"Virtual Environment with pyenv"},{"content":"google-site-verification: google02ca83301041d850.html","date":"1 January 0001","permalink":"/google02ca83301041d850/","section":"Welcome to My Homepage! 🎉","summary":"google-site-verification: google02ca83301041d850.","title":""}]