deploy: 7f675a6

florian-huber · Apr 8, 2024 · 543f897 · 543f897
1 parent e9895e3
commit 543f897
Show file tree

Hide file tree

Showing 38 changed files with 252 additions and 589 deletions.
diff --git a/_images/04e429cdc80cd7d85036f51d71b23c4d12aadff0cf0ca8553a96162faf25dd0b.png b/_images/04e429cdc80cd7d85036f51d71b23c4d12aadff0cf0ca8553a96162faf25dd0b.png
diff --git a/_images/06024a58864cfe29533c7ec3ed0ccd2739403f748e1c6234f133de9525c2630e.png b/_images/06024a58864cfe29533c7ec3ed0ccd2739403f748e1c6234f133de9525c2630e.png
diff --git a/_images/07a9d4df7d2ab46069624fabefaee733888277541168dd6d9d5d99d76179de32.png b/_images/07a9d4df7d2ab46069624fabefaee733888277541168dd6d9d5d99d76179de32.png
diff --git a/_images/1895d3a0f2f5865bb896d161ffa5fbc19fd5bb1941b14debd0f1ef23cd303713.png b/_images/1895d3a0f2f5865bb896d161ffa5fbc19fd5bb1941b14debd0f1ef23cd303713.png
diff --git a/_images/3c1de44e9360937c6b54552b156fc9d7fd77a3ee22bfe6329e698170e9a206a1.png b/_images/3c1de44e9360937c6b54552b156fc9d7fd77a3ee22bfe6329e698170e9a206a1.png
diff --git a/_images/518cfa668c3f4908dbc2085615e7ca5d2fb99173f397a3ba9a059cc201a88fee.png b/_images/518cfa668c3f4908dbc2085615e7ca5d2fb99173f397a3ba9a059cc201a88fee.png
diff --git a/_images/54f77698a5f139916e40b8ea5cc70a0eb581841b93019391a5286ecb992f2484.png b/_images/54f77698a5f139916e40b8ea5cc70a0eb581841b93019391a5286ecb992f2484.png
diff --git a/_images/5d15928ba6c544377adddc3bb445e920669b0f5bf44c63bd224271f4dec04e17.png b/_images/5d15928ba6c544377adddc3bb445e920669b0f5bf44c63bd224271f4dec04e17.png
diff --git a/_images/6047ed22a9b71c8bbb1ef68f85332795ebd4fcd9bb596c3cf33d35931da27dcb.png b/_images/6047ed22a9b71c8bbb1ef68f85332795ebd4fcd9bb596c3cf33d35931da27dcb.png
diff --git a/_images/640db3ccb20ca9302a9a1d45756a1b790d4fd2e743ae8e488377a2a28741ec80.png b/_images/640db3ccb20ca9302a9a1d45756a1b790d4fd2e743ae8e488377a2a28741ec80.png
diff --git a/_images/6be6e2212df4c4bc380ccbc7608c3f124ba3feb5c2e705b5462285d4a1d91a13.png b/_images/6be6e2212df4c4bc380ccbc7608c3f124ba3feb5c2e705b5462285d4a1d91a13.png
diff --git a/_images/720c46faac27ee4b4cb068c200977fe8726dec79cbab900d124865c0322a2d23.png b/_images/720c46faac27ee4b4cb068c200977fe8726dec79cbab900d124865c0322a2d23.png
diff --git a/_images/83409752890f8e489f897161e5dff7cafe3e847ffdb24524de649f4a5a1036d8.png b/_images/83409752890f8e489f897161e5dff7cafe3e847ffdb24524de649f4a5a1036d8.png
diff --git a/_images/88858083f19b30cb66fc70c614d6b99a3d4d0fad62f3ee13566627c67f58bd53.png b/_images/88858083f19b30cb66fc70c614d6b99a3d4d0fad62f3ee13566627c67f58bd53.png
diff --git a/_images/8d0df88a2b68434ba3ace5c78afb12e8aaec9295fd09244a5ec86a99a819b506.png b/_images/8d0df88a2b68434ba3ace5c78afb12e8aaec9295fd09244a5ec86a99a819b506.png
diff --git a/_images/9529ca07a9b973b895034b1282629df71091bfff0e74dd937f5c63829865a8e2.png b/_images/9529ca07a9b973b895034b1282629df71091bfff0e74dd937f5c63829865a8e2.png
diff --git a/_images/9fde8b4371a82b5ed6d21954f0fbce2b58f6c3dd6c85f4b48472da74b9e6037e.png b/_images/9fde8b4371a82b5ed6d21954f0fbce2b58f6c3dd6c85f4b48472da74b9e6037e.png
diff --git a/_images/b48b14d68eed6ee78969c0dfd9a1898166b4542a0b9469c027109eb3e982f1c6.png b/_images/b48b14d68eed6ee78969c0dfd9a1898166b4542a0b9469c027109eb3e982f1c6.png
diff --git a/_images/fig_data_merging_types.png b/_images/fig_data_merging_types.png
diff --git a/_sources/book/02_data_science_ethics_society.md b/_sources/book/02_data_science_ethics_society.md
@@ -87,15 +87,15 @@ How do you feel now? Sounds like a heavy burden? Sure. But remember the wisdom o
 
 There are many great resources to dig deeper into the topic of ethics and data science. Here are just a few suggestions.
 
-- **"Fairness and Machine Learning: Limitations and Opportunities" by Solon Barocas, Moritz Hardt, and Arvind Narayanan**
+- **"Fairness and Machine Learning: Limitations and Opportunities" by Solon Barocas, Moritz Hardt, and Arvind Narayanan**  
   This accessible online resource (freely available book, but also videos and materials, see [here](https://fairmlbook.org/)) provides an in-depth look at the challenges and opportunities for ensuring fairness in machine learning systems. It's an excellent primer for data scientists interested in developing algorithms that avoid perpetuating biases.
   {cite}`barocas-hardt-narayanan`
 
-- **"Weapons of Math Destruction" by Cathy O'Neil** 
+- **"Weapons of Math Destruction" by Cathy O'Neil**   
   This book provides a critical look at how big data algorithms can increase inequality and threaten democracy. O'Neil explores a variety of case studies where algorithms have had profound negative effects on people's lives, making it an essential read for understanding the societal impacts of data science.
   {cite}`oneil2017weapons`
 
-- **Ethics and Data Science (Mike Loukides, Hilary Mason, and DJ Patil on O'Reilly Media)**
+- **Ethics and Data Science (Mike Loukides, Hilary Mason, and DJ Patil on O'Reilly Media)**  
   A concise, practical handbook that offers a framework for ethical decision making in data science projects. It includes case studies, guidelines, and exercises to help practitioners incorporate ethical considerations into their workflows.
   {cite}`loukides_ethics_2018`
 

diff --git a/_sources/book/04_data_and_types.md b/_sources/book/04_data_and_types.md
@@ -62,7 +62,7 @@ A deeper exploration into data reveals various scales on which it can be measure
 
 #### Big Data
 
-Working in data science there is really no way to avoid dealing with the challenges, the promises, or even the (many) definitions of **Big Data**. Since this is not our core concern in this book, I will simply stick to the very simple definition of saying 
+Working in data science there is really no way to avoid dealing with the challenges, the promises, or even the (many) definitions of **Big Data**. Since this is not our core concern in this book, I will simply stick to the very simple definition, roughly following {cite}`russom2011big`, and say:
 
 > **Big Data** ≈ Data that is too large, too complex, or too volatile to be evaluated using manual and traditional data processing methods.
 

diff --git a/_sources/book/07_data_acquisition_and_preparation.md b/_sources/book/07_data_acquisition_and_preparation.md
@@ -15,10 +15,10 @@ Before plunging into data collection, it's imperative to ask: What's the problem
 
 ### Common Data Sources:
 
-- **The Internet I**: A vast ocean of data. Public datasets are easily accessible and can be sourced from governmental agencies, NGOs, or platforms like Kaggle, Zenodo, and UCI.
+- **The Internet I**: A vast ocean of data. Public datasets are easily accessible and can be sourced from governmental agencies, NGOs, or platforms like [Kaggle](https://www.kaggle.com/), [Zenodo](https://zenodo.org/), and [UCI](https://archive.ics.uci.edu/).
 - **The Internet II**: Web scraping is akin to a treasure hunt, extracting valuable data directly from web pages, given that you respect the legal and ethical boundaries.
 - **Internal Corporate Data**: Picture a gold mine that a company sits upon; these are years' worth of data that can unearth invaluable insights if analyzed correctly.
-- **Academic Data**: Scientifically collected and often meticulously maintained, these datasets can be found accompanying research papers.
+- **Academic Data**: Scientifically collected and often meticulously maintained, these datasets can be found accompanying research papers. In many scientific disciplines it is becoming more and more common for authors to now only present and discuss their findings, but also to provide the data and/or the code used to extract those findings from the data (often termed: Open Science).
 - **Data Upon Request**: Sometimes, just asking can open doors. Organizations might share data if approached correctly and for a worthy cause.
 - **Commercial Datasets**: There are instances when investing in a dataset can prove to be more cost-effective than collecting data from scratch.
 
@@ -66,7 +66,23 @@ Missing data is a common issue many data scientists face. While the gaps can man
 - **Data Types**: Ensuring that numeric values aren't masquerading as strings can prevent potential analytical blunders (e.g., "12.5" instead of 12.5).
 - **Decimal Delimiters**: Confusion between comma and dot can change data meaning, e.g., 12,010 becoming 12.01.
 
-### Further Cleaning Steps:
+## Combining datasets
+
+Unlike in most tutorial or course situations, in data science reality we rarely start by simply importing a single data file. Often, we will receive multiple files with different features and/or datapoints. In such cases, we usually want to combine the required parts of the data. This is a common operation in data science which is sometimes refered to as `merging` in aggreement with respective SQL operations.
+
+At first, this seems to a be a rather simple operations. In practice, however, this is often surprisingly complicated and critical. If merging is not done correctly, we might either lose data or create incorrect entries.
+
+```{figure} ../images/fig_data_merging_types.png
+:name: fig_data_merging01
+
+There are different type of merging data. Which one to use is best decided based on the data we have at hand and the types of operations we plan to run with the resulting data. Here are three of the most common types of merges: inner, left, and outer merges.
+```
+
+Figure ({numref}`fig_data_merging03`) shows some common merging types. More information on different ways to combine data using pandas can be found in the [pandas documentation on merging](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html). 
+
+
+
+## Further Cleaning Steps:
 
 - **Unit Conversion**: Ensuring data is in consistent units.
 - **Data Standardization**: This can be done via Min-Max scaling (often termed "normalization") or, frequently more effective, by ensuring data has a mean of 0 and a standard deviation of 1.

diff --git a/_sources/book/cover.md b/_sources/book/cover.md
@@ -7,11 +7,11 @@
 Düsseldorf University of Applied Sciences (HSD)  
 & Centre for Digitalization and Digitality (ZDD)
 
-**v0.8** 2024-03-04
+**v0.9** 2024-03-11
 
 **About me:**
 I work as a professor for Data Science and Visual Analytics at the [Düsseldorf University of Applied Sciences](https://www.hs-duesseldorf.de/). This is also where I teach students the basics of data science, Python programming, machine learning, or where I give unsolicited advice on coffee, chocolate, and all other things that really matter in life.
 
-Until I manage to either find or build a more suitable platform, you can also find me on Mastodon: [mastodon.online/@me_datapoint](https://mastodon.online/@me_datapoint) or Twitter/X: [@me_datapoint](https://twitter.com/me_datapoint).
+Until I manage to either find or build a more suitable platform, you can also find me on Mastodon: [mastodon.online/@me_datapoint](https://mastodon.online/@me_datapoint) or (less and less likely...) on Twitter/X: [@me_datapoint](https://twitter.com/me_datapoint).
 
 This book is licensed under the [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-nc-sa/4.0/).
diff --git a/book/02_data_science_ethics_society.html b/book/02_data_science_ethics_society.html
@@ -514,7 +514,7 @@ <h2><span class="section-number">3.3. </span>Beyond the Job Market: High-Stakes
 <section id="data-science-is-tied-to-ethical-considerations">
 <h2><span class="section-number">3.4. </span>Data Science is tied to ethical considerations<a class="headerlink" href="#data-science-is-tied-to-ethical-considerations" title="Link to this heading">#</a></h2>
 <p>As we journey from the relatively straightforward domain of targeted  advertising to the complex and consequential realms of employment  discrimination and judicial decision-making, it becomes evident that the ethical implications of data science are profound and pervasive. Each  application area brings its own set of ethical challenges, demanding a  nuanced understanding and thoughtful consideration from data scientists.</p>
-<p>It is crucial to understand that, from an ethical standpoint, <strong>data is not neutral</strong> <span id="id7">[<a class="reference internal" href="bibliography.html#id34" title="Catherine D'Ignazio and Lauren F. Klein. Data feminism. &lt;Strong&gt; ideas series. The MIT Press, Cambridge, Massachusetts ; London, England, 2020. ISBN 978-0-262-04400-4.">D'Ignazio and Klein, 2020</a>]</span>, and likewise, <strong>algorithms are not neutral</strong> <span id="id8">[<a class="reference internal" href="bibliography.html#id17" title="Kirsten Martin. Ethical Implications and Accountability of Algorithms. Journal of Business Ethics, 160(4):835–850, December 2019. URL: https://doi.org/10.1007/s10551-018-3921-3 (visited on 2023-06-09), doi:10.1007/s10551-018-3921-3.">Martin, 2019</a>]</span><span id="id9">[<a class="reference internal" href="bibliography.html#id35" title="C. Stinson. Algorithms are not neutral. AI Ethics, 2:763–770, 2022. doi:10.1007/s43681-022-00136-w.">Stinson, 2022</a>]</span><span id="id10">[<a class="reference internal" href="bibliography.html#id10" title="Sina Fazelpour and David Danks. Algorithmic bias: Senses, sources, solutions. Philosophy Compass, 16(8):e12760, 2021. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/phc3.12760. URL: https://onlinelibrary.wiley.com/doi/abs/10.1111/phc3.12760 (visited on 2023-06-09), doi:10.1111/phc3.12760.">Fazelpour and Danks, 2021</a>]</span>. As (future) data scientists, we must exercise the utmost care when working with data and algorithms. The examples provided should illustrate the importance of critically reflecting on the intentions behind our actions as data scientists. This reflection should encompass various perspectives to comprehend the “bigger picture”. For instance, Shoshana Zuboff’s “The Age of Surveillance Capitalism” <span id="id11">[<a class="reference internal" href="bibliography.html#id36" title="Shoshana Zuboff. The Age of Surveillance Capitalism: The Fight for a Human Future at the New Frontier of Power. Profile Books, 1st edition, 2019. ISBN 9781781256848.">Zuboff, 2019</a>]</span> delves into how corporations like Google leverage users’ personal data, or “behavioral surplus”, to customize advertisements, essentially transforming users into products. Zuboff criticizes this practice, asserting that it undermines individuals’ autonomy and capacity to influence their future. She proposes regulations and corporate accountability to ensure that technology benefits users rather than merely serving large corporations. Hence, the fact that we, as data scientists, <em>can</em> potentially extract maximum information or yield the “best” predictions from data does not mean we <em>should</em> do so.</p>
+<p>It is crucial to understand that, from an ethical standpoint, <strong>data is not neutral</strong> <span id="id7">[<a class="reference internal" href="bibliography.html#id35" title="Catherine D'Ignazio and Lauren F. Klein. Data feminism. &lt;Strong&gt; ideas series. The MIT Press, Cambridge, Massachusetts ; London, England, 2020. ISBN 978-0-262-04400-4.">D'Ignazio and Klein, 2020</a>]</span>, and likewise, <strong>algorithms are not neutral</strong> <span id="id8">[<a class="reference internal" href="bibliography.html#id17" title="Kirsten Martin. Ethical Implications and Accountability of Algorithms. Journal of Business Ethics, 160(4):835–850, December 2019. URL: https://doi.org/10.1007/s10551-018-3921-3 (visited on 2023-06-09), doi:10.1007/s10551-018-3921-3.">Martin, 2019</a>]</span><span id="id9">[<a class="reference internal" href="bibliography.html#id36" title="C. Stinson. Algorithms are not neutral. AI Ethics, 2:763–770, 2022. doi:10.1007/s43681-022-00136-w.">Stinson, 2022</a>]</span><span id="id10">[<a class="reference internal" href="bibliography.html#id10" title="Sina Fazelpour and David Danks. Algorithmic bias: Senses, sources, solutions. Philosophy Compass, 16(8):e12760, 2021. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/phc3.12760. URL: https://onlinelibrary.wiley.com/doi/abs/10.1111/phc3.12760 (visited on 2023-06-09), doi:10.1111/phc3.12760.">Fazelpour and Danks, 2021</a>]</span>. As (future) data scientists, we must exercise the utmost care when working with data and algorithms. The examples provided should illustrate the importance of critically reflecting on the intentions behind our actions as data scientists. This reflection should encompass various perspectives to comprehend the “bigger picture”. For instance, Shoshana Zuboff’s “The Age of Surveillance Capitalism” <span id="id11">[<a class="reference internal" href="bibliography.html#id37" title="Shoshana Zuboff. The Age of Surveillance Capitalism: The Fight for a Human Future at the New Frontier of Power. Profile Books, 1st edition, 2019. ISBN 9781781256848.">Zuboff, 2019</a>]</span> delves into how corporations like Google leverage users’ personal data, or “behavioral surplus”, to customize advertisements, essentially transforming users into products. Zuboff criticizes this practice, asserting that it undermines individuals’ autonomy and capacity to influence their future. She proposes regulations and corporate accountability to ensure that technology benefits users rather than merely serving large corporations. Hence, the fact that we, as data scientists, <em>can</em> potentially extract maximum information or yield the “best” predictions from data does not mean we <em>should</em> do so.</p>
 <p>A very important aspect in this context is the question of accountability, so who is in the end responsible for decisions taken by (or based on) an algorithm, for instance <span id="id12">[<a class="reference internal" href="bibliography.html#id17" title="Kirsten Martin. Ethical Implications and Accountability of Algorithms. Journal of Business Ethics, 160(4):835–850, December 2019. URL: https://doi.org/10.1007/s10551-018-3921-3 (visited on 2023-06-09), doi:10.1007/s10551-018-3921-3.">Martin, 2019</a>]</span>. Most data scientists are no trained ethicists and lawyers, and they also don’t have to be. But I hope that the ethical aspects we briefly touch upon in this chapter make clear, that our role as data scientists is not finished when we have working code and a pretty-looking plot in the end. It is not finished if we trained a machine-learning model that predicts with 99.9% accuracy (what does that mean anyway…). It is not finished when we successfully published our results or got a clap on our shoulders from the upper management. Our core job <strong>includes</strong> the ethical component, we have the duty to reflect on the potential consequences of our work. And this means that we should not only learn how to apply so many different data science methods, but also we need to learn enough about those methods to judge why we use a certain method and what common pitfalls are that we need to test and take care of. Luckily, making this extra effort will also technically make us better data scientists.</p>
 <p>How do you feel now? Sounds like a heavy burden? Sure. But remember the wisdom of Spider-Man:</p>
 <blockquote>
@@ -525,13 +525,13 @@ <h2><span class="section-number">3.4. </span>Data Science is tied to ethical con
 <h2><span class="section-number">3.5. </span>Continue reading!<a class="headerlink" href="#continue-reading" title="Link to this heading">#</a></h2>
 <p>There are many great resources to dig deeper into the topic of ethics and data science. Here are just a few suggestions.</p>
 <ul class="simple">
-<li><p><strong>“Fairness and Machine Learning: Limitations and Opportunities” by Solon Barocas, Moritz Hardt, and Arvind Narayanan</strong>
+<li><p><strong>“Fairness and Machine Learning: Limitations and Opportunities” by Solon Barocas, Moritz Hardt, and Arvind Narayanan</strong><br />
 This accessible online resource (freely available book, but also videos and materials, see <a class="reference external" href="https://fairmlbook.org/">here</a>) provides an in-depth look at the challenges and opportunities for ensuring fairness in machine learning systems. It’s an excellent primer for data scientists interested in developing algorithms that avoid perpetuating biases.
 <span id="id13">[<a class="reference internal" href="bibliography.html#id4" title="Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and Machine Learning: Limitations and Opportunities. MIT Press, 2023.">Barocas <em>et al.</em>, 2023</a>]</span></p></li>
-<li><p><strong>“Weapons of Math Destruction” by Cathy O’Neil</strong>
+<li><p><strong>“Weapons of Math Destruction” by Cathy O’Neil</strong><br />
 This book provides a critical look at how big data algorithms can increase inequality and threaten democracy. O’Neil explores a variety of case studies where algorithms have had profound negative effects on people’s lives, making it an essential read for understanding the societal impacts of data science.
 <span id="id14">[<a class="reference internal" href="bibliography.html#id26" title="Cathy O'neil. Weapons of math destruction: How big data increases inequality and threatens democracy. Crown, 2017.">O'neil, 2017</a>]</span></p></li>
-<li><p><strong>Ethics and Data Science (Mike Loukides, Hilary Mason, and DJ Patil on O’Reilly Media)</strong>
+<li><p><strong>Ethics and Data Science (Mike Loukides, Hilary Mason, and DJ Patil on O’Reilly Media)</strong><br />
 A concise, practical handbook that offers a framework for ethical decision making in data science projects. It includes case studies, guidelines, and exercises to help practitioners incorporate ethical considerations into their workflows.
 <span id="id15">[<a class="reference internal" href="bibliography.html#id16" title="Mike Loukides, Hilary Mason, and D. J. Patil. Ethics and Data Science. O'Reilly Media, 1st edition edition, July 2018.">Loukides <em>et al.</em>, 2018</a>]</span></p></li>
 </ul>

diff --git a/book/04_data_and_types.html b/book/04_data_and_types.html
@@ -518,7 +518,7 @@ <h2><span class="section-number">5.2. </span>Data Types<a class="headerlink" hre
 </section>
 <section id="big-data">
 <h2><span class="section-number">5.3. </span>Big Data<a class="headerlink" href="#big-data" title="Link to this heading">#</a></h2>
-<p>Working in data science there is really no way to avoid dealing with the challenges, the promises, or even the (many) definitions of <strong>Big Data</strong>. Since this is not our core concern in this book, I will simply stick to the very simple definition of saying</p>
+<p>Working in data science there is really no way to avoid dealing with the challenges, the promises, or even the (many) definitions of <strong>Big Data</strong>. Since this is not our core concern in this book, I will simply stick to the very simple definition, roughly following <span id="id2">[<a class="reference internal" href="bibliography.html#id32" title="Philip Russom and others. Big data analytics. TDWI best practices report, fourth quarter, 19(4):1–34, 2011.">Russom and others, 2011</a>]</span>, and say:</p>
 <blockquote>
 <div><p><strong>Big Data</strong> ≈ Data that is too large, too complex, or too volatile to be evaluated using manual and traditional data processing methods.</p>
 </div></blockquote>