vpollo11 · vpollo11 · Aug 28, 2025 · Aug 28, 2025 · Aug 28, 2025 · Aug 28, 2025
diff --git a/.github/workflows/ga-image-build-branch.yml b/.github/workflows/ga-image-build-branch.yml
@@ -14,17 +14,17 @@ jobs:
 
     steps:
       - name: 🐧 Checkout
-        uses: actions/checkout@v3
+        uses: actions/checkout@v5
 
       - name: 📦 Login to GitHub Container Registry
-        uses: docker/login-action@v2
+        uses: docker/login-action@v3
         with:
           registry: ghcr.io
           username: ${{ github.actor }}
           password: ${{ secrets.GITHUB_TOKEN }}
 
       - name: 🐳 Login to docker.01-edu.org Registry
-        uses: docker/login-action@v2
+        uses: docker/login-action@v3
         with:
           registry: docker.01-edu.org
           username: ${{ secrets.USER_DOCKER_01EDU_ORG }}

diff --git a/.github/workflows/ga-image-build-master.yml b/.github/workflows/ga-image-build-master.yml
@@ -11,17 +11,17 @@ jobs:
 
     steps:
       - name: 🐧 Checkout
-        uses: actions/checkout@v3
+        uses: actions/checkout@v5
 
       - name: 📦 Login to GitHub Container Registry
-        uses: docker/login-action@v2
+        uses: docker/login-action@v3
         with:
           registry: ghcr.io
           username: ${{ github.actor }}
           password: ${{ secrets.GITHUB_TOKEN }}
 
       - name: 🐳 Login to docker.01-edu.org Registry
-        uses: docker/login-action@v2
+        uses: docker/login-action@v3
         with:
           registry: docker.01-edu.org
           username: ${{ secrets.USER_DOCKER_01EDU_ORG }}

diff --git a/.github/workflows/ga-misc-check-compliance.yml b/.github/workflows/ga-misc-check-compliance.yml
@@ -21,7 +21,7 @@ jobs:
 
     steps:
       - name: 🐧 Check out repository code
-        uses: actions/checkout@v4
+        uses: actions/checkout@v5
         with:
           fetch-depth: 0
 

diff --git a/.github/workflows/ga-misc-check-links.yml b/.github/workflows/ga-misc-check-links.yml
@@ -16,7 +16,7 @@ jobs:
 
     steps:
       - name: 🐧 Checkout
-        uses: actions/checkout@v4
+        uses: actions/checkout@v5
         with:
           fetch-depth: 0
 
@@ -27,6 +27,6 @@ jobs:
 
       - name: 🔗 Run Check Links
         if: steps.changed-md.outputs.changed_files != ''
-        uses: harryvasanth/markdown-link-checker@v1.2
+        uses: harryvasanth/markdown-link-checker@main
         with:
           files: ${{ steps.changed-md.outputs.changed_files }}
diff --git a/.github/workflows/ga-misc-check-prettier.yml b/.github/workflows/ga-misc-check-prettier.yml
@@ -13,7 +13,7 @@ jobs:
 
     steps:
       - name: 🐧 Checkout
-        uses: actions/checkout@v4
+        uses: actions/checkout@v5
         with:
           fetch-depth: 0
 

diff --git a/.github/workflows/ga-misc-check-shellcheck.yml b/.github/workflows/ga-misc-check-shellcheck.yml
@@ -14,7 +14,7 @@ jobs:
 
     steps:
       - name: 🐧 Check out repository code
-        uses: actions/checkout@v4
+        uses: actions/checkout@v5
         with:
           fetch-depth: 0
 

diff --git a/js/tests/Dockerfile b/js/tests/Dockerfile
@@ -1,4 +1,4 @@
-FROM docker.01-edu.org/alpine:3.17.0
+FROM alpine:3
 
 # Installs latest Chromium package.
 RUN apk add --no-cache \

diff --git a/sh/tests/Dockerfile b/sh/tests/Dockerfile
@@ -1,4 +1,4 @@
-FROM docker.01-edu.org/debian:10-slim
+FROM debian:stable-slim
 
 RUN apt-get update
 RUN apt-get -y install jq curl tree apt-utils

diff --git a/subjects/ai/backtesting-sp500/README.md b/subjects/ai/backtesting-sp500/README.md
@@ -36,7 +36,6 @@ The input files are:
   data.
 
   The adjusted close price may be unavailable for three main reasons:
-
   - The company doesn't exist at date `d`
   - The company is not publicly traded
   - Its close price hasn't been reported
@@ -68,7 +67,6 @@ There are four parts:
 #### 2. Data wrangling and preprocessing
 
 - Create a Jupyter Notebook to analyze the data sets and perform EDA (Exploratory Data Analysis). This notebook should contain at least:
-
   - Missing values analysis
   - Outliers analysis (there are a lot of outliers)
   - Visualize and analyze the average price for companies over time or compare the price consistency across different companies within the dataset. Save the plot as an image.
@@ -77,11 +75,9 @@ There are four parts:
 _Note: create functions that generate the plots and save them in the `images` directory. Add a parameter `plot` with a default value `False` which doesn't return the plot. This will be useful for the correction to let people run your code without overriding your plots._
 
 - Here is how the `prices` data should be preprocessed:
-
   - Resample data on month and keep the last value
   - Filter prices outliers: Remove prices outside the range 0.1$, 10k$
   - Compute monthly returns:
-
     - Historical returns. **returns(current month) = price(current month) - price(previous month) / price(previous month)**
     - Future returns. **returns(current month) = price(next month) - price(current month) / price(current month)**
 
@@ -102,7 +98,6 @@ At this stage the DataFrame should look like this:
 - Print `prices.isna().sum()`
 
 - Here is how the `sp500.csv` data should be preprocessed:
-
   - Resample data on month and keep the last value
   - Compute historical monthly returns on the adjusted close
 
@@ -183,47 +178,38 @@ project
 ### Tips:
 
 1. Data Quality Management:
-
    - Be prepared to encounter messy data. Financial datasets often contain errors, outliers, and missing values.
    - Develop a systematic approach to identify and handle data quality issues.
 
 2. Memory Optimization:
-
    - When working with large datasets, optimize memory usage by selecting appropriate data types for each column.
    - Consider using smaller data types like np.float32 for floating-point numbers when precision allows.
 
 3. Exploratory Data Analysis:
-
    - Spend time understanding the data through visualization and statistical analysis before diving into strategy development.
    - Pay special attention to outliers and their potential impact on your strategy.
 
 4. Preprocessing Financial Data:
-
    - When resampling time series data, be mindful of which value to keep (e.g., last value for month-end prices).
    - Calculate both historical and future returns to avoid look-ahead bias in your strategy.
 
 5. Handling Outliers:
-
    - Develop a method to identify and handle outliers that is specific to each company's historical data.
    - Be cautious about removing outliers during periods of high market volatility (e.g., 2008-2009 financial crisis).
 
 6. Signal Creation:
-
    - Start with a simple signal (like past 12-month average returns) before exploring more complex strategies.
    - Ensure your signal doesn't use future information that wouldn't have been available at the time of decision.
 
 7. Backtesting:
-
    - Implement your backtesting logic without using loops for better performance.
    - Compare your strategy's performance against a relevant benchmark (in this case, the S&P 500).
 
 8. Visualization:
-
    - Create clear, informative visualizations to communicate your strategy's performance.
    - Include cumulative return plots to show how your strategy performs over time compared to the benchmark.
 
 9. Code Structure:
-
    - Organize your code into modular functions for better readability and reusability.
    - Use a main script to orchestrate the entire process from data loading to results visualization.
 
@@ -232,3 +218,22 @@ project
     - Be prepared to explain any anomalies or unexpected results in your strategy's performance.
 
 Remember, the goal is not just to create a strategy that looks good on paper, but to develop a robust process for analyzing financial data and testing investment ideas.
+
+### Resources
+
+- **Python & Data Analysis**
+  - [pandas Documentation](https://pandas.pydata.org/docs/) – handling time series, resampling, returns.
+  - [NumPy Documentation](https://numpy.org/doc/) – vectorized operations and memory optimization.
+  - [Matplotlib Documentation](https://matplotlib.org/stable/index.html) – plotting cumulative returns and EDA visuals.
+
+- **Finance & Backtesting**
+  - [Investopedia – Backtesting](https://www.investopedia.com/terms/b/backtesting.asp) – introduction to strategy testing.
+  - [QuantStart – What is Backtesting?](https://corporatefinanceinstitute.com/resources/data-science/backtesting/#:~:text=Backtesting%20involves%20applying%20a%20strategy,employ%20and%20tweak%20successful%20strategies.) – practical overview of backtesting logic.
+  - [S&P 500 Index (Wikipedia)](https://en.wikipedia.org/wiki/S%26P_500) – background on the index and its historical changes.
+
+- **Data Cleaning & Outliers**
+  - [Handling Missing Data in Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html).
+
+- **Quantitative Strategies**
+  - [Momentum Investing (Investopedia)](https://www.investopedia.com/terms/m/momentum_investing.asp) – theory behind using past returns as a signal.
+  - [Risk and Return Basics (CFA Institute)](https://www.investopedia.com/terms/r/riskadjustedreturn.asp) – risk-adjusted performance understanding.
diff --git a/subjects/ai/classification/README.md b/subjects/ai/classification/README.md
@@ -351,12 +351,10 @@ def predict_one_vs_all(X, clf0, clf1, clf2 ):
 
 ### Resources
 
-- [Logistic regression](https://towardsdatascience.com/understanding-logistic-regression-9b02c2aec102)
+- [Logistic regression](https://www.ibm.com/think/topics/logistic-regression)
 
-- [Logloss](https://www.datacamp.com/tutorial/the-cross-entropy-loss-function-in-machine-learning)
+- [Logloss](https://www.geeksforgeeks.org/machine-learning/what-is-cross-entropy-loss-function/)
 
-- [More on logistic regression](https://medium.com/swlh/what-is-logistic-regression-62807de62efa)
+- [More on logistic regression](https://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf)
 
 - [Logistic regression 1](https://www.kaggle.com/code/rahulrajpandey31/logistic-regression-from-scratch-iris-data-set)
-
-- [Logistic regression 2](https://towardsdatascience.com/logistic-regression-using-python-sklearn-numpy-mnist-handwriting-recognition-matplotlib-a6b31e2b166a)
diff --git a/subjects/ai/credit-scoring/README.md b/subjects/ai/credit-scoring/README.md
@@ -24,21 +24,19 @@ There are 3 expected deliverables associated with the scoring model:
 
 - An exploratory data analysis notebook that describes the insights you find out in the data set.
 - The trained machine learning model with the features engineering pipeline:
-
   - Do not forget: **Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’ is basically feature engineering.**
   - The model is validated if the **AUC on the test set is at minimum 55%, ideally to 62% included (or in best cases higher than 62% if you can !)**.
   - The labelled test data is not publicly available. However, a Kaggle competition uses the same data. The procedure to evaluate test set submission is the same as the one used for the project 1.
   - Here are the [DataSets](https://assets.01-edu.org/ai-branch/project5/home-credit-default-risk.zip).
 
 - A report on model training and evaluation:
-
   - Include learning curves (training and validation scores vs. training set size or epochs) to demonstrate that the model is not overfitting.
   - Explain the measures taken to prevent overfitting, such as early stopping or regularization techniques.
   - Justify your choice of when to stop training based on the learning curves.
 
 #### Kaggle submission
 
-The way the Kaggle platform works is explained in the challenge overview page. If you need more details, I suggest [this resource](https://towardsdatascience.com/getting-started-with-kaggle-f9138b35ae18) that gives detailed explanations.
+The way the Kaggle platform works is explained in the challenge overview page. If you need more details, I suggest [this resource](https://www.kaggle.com/datasets/parisrohan/credit-score-classification) that gives detailed explanations.
 
 - Create a username following that structure: username*01EDU* location_MM_YYYY. Submit the description profile and push it on the Git platform the first day of the week. Do not touch this file anymore.
 
@@ -55,7 +53,7 @@ There are different level of transparency:
 - **Global**: understand important variables in a model. This answers the question: "What are the key variables to the model ? ". In that case it will tell if the revenue is more important than the age to the model for example. This allows to check that the model relies on important variables. No one wants his credit to be refused because of the weather in Lisbon !
 - **Local**: each observation gets its own set of interpretability factors. This greatly increases its transparency. We can explain why a case receives its prediction and the contributions of the predictors. Traditional variable importance algorithms only show the results across the entire population but not on each individual case. The local interpretability enables us to pinpoint and contrast the impacts of the factors.
 
-There are 2 tools you can use to analyse your model and its predictions: - Features importance (available if you use a Scikit Learn model) - [SHAP library](https://towardsdatascience.com/explain-your-model-with-the-shap-values-bc36aac4de3d)
+There are 2 tools you can use to analyse your model and its predictions: - Features importance (available if you use a Scikit Learn model) - [SHAP library](https://shap.readthedocs.io/en/latest/)
 
 Implement a program that takes as input the trained model, the customer id ... and returns:
 
@@ -121,4 +119,4 @@ Remember, creating a great credit scoring model is like baking a perfect cake -
 
 ### Resources
 
-- [Interpreting machine learning models](https://towardsdatascience.com/interpretability-in-machine-learning-70c30694a05f)
+- [Interpreting machine learning models](https://neptune.ai/blog/ml-model-interpretation-tools)
diff --git a/subjects/ai/data-wrangling/README.md b/subjects/ai/data-wrangling/README.md
@@ -309,4 +309,4 @@ The first 3 rows of the DataFrame should like this:
 
 - [Pandas tutorial](https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/)
 
-- [Pandas iteration](https://towardsdatascience.com/different-ways-to-iterate-over-rows-in-a-pandas-dataframe-performance-comparison-dc0d5dcef8fe)
+- [Pandas iteration](https://www.geeksforgeeks.org/pandas/different-ways-to-iterate-over-rows-in-pandas-dataframe/)
diff --git a/subjects/ai/document-categorization/README.md b/subjects/ai/document-categorization/README.md
@@ -22,7 +22,6 @@ The project aims to develop skills in:
 #### Data Loading and Preprocessing
 
 1. **Dataset Preparation**:
-
    - Load a dataset containing various document types across multiple categories and languages.
    - Preprocess the data, including text normalization, tokenization, and handling multi-language support.
 
@@ -33,12 +32,10 @@ The project aims to develop skills in:
 #### Model Development
 
 1. **Text Classification Model**:
-
    - Implement a **text classification model** using **TensorFlow** or **Keras**, starting with a baseline architecture.
    - Use **transfer learning** to enhance the model’s domain adaptability, incorporating pre-trained language models such as **BERT** or **DistilBERT**.
 
 2. **Tagging with NLP Libraries**:
-
    - Leverage **SpaCy** to develop an intelligent tagging system that can assign tags based on the document's content and context.
    - Ensure the tagging system supports multi-language functionality, utilizing language models for effective tagging in different languages.
 
@@ -49,7 +46,6 @@ The project aims to develop skills in:
 #### Real-Time Document Categorization and Tagging
 
 1. **Real-Time Processing Pipeline**:
-
    - Develop a pipeline to handle real-time document classification and tagging, ensuring minimal latency.
    - Set up batching or streaming mechanisms to manage high-volume document input and optimize throughput.
 
@@ -60,7 +56,6 @@ The project aims to develop skills in:
 #### Transfer Learning and Model Optimization
 
 1. **Transfer Learning for Domain-Specific Contexts**:
-
    - Fine-tune the pre-trained language models to specialize in specific document types or industry contexts.
    - Implement training routines to adapt the model to new domains without extensive retraining on each dataset.
 
@@ -71,7 +66,6 @@ The project aims to develop skills in:
 #### Visualization and Monitoring
 
 1. **Real-Time Dashboard**:
-
    - Develop a **Streamlit** or **Flask** app to display real-time categorization and tagging results.
    - Include visualizations of category distributions, tag counts, and language breakdowns.
 
@@ -107,22 +101,48 @@ document-categorization-tagging/
 └── requirements.txt
 ```
 
-### Timeline (2-3 weeks)
+### Tips
+
+1. **Data Quality & Preprocessing**
+   - Pay attention to encoding, text cleaning, and normalization, especially with multi-language data.
+   - Always remove unwanted characters, duplicated text, or formatting artifacts before training.
+
+2. **Multi-Language Handling**
+   - Use automatic language detection to route documents to the right SpaCy or Hugging Face model.
+   - Keep tokenization language-specific to avoid poor segmentation.
+
+3. **Model Training**
+   - Start with a small pre-trained model (e.g., DistilBERT) before moving to larger models like BERT.
+   - Regularly save checkpoints during fine-tuning to avoid losing progress.
+
+4. **Context-Aware Tagging**
+   - Use **Named Entity Recognition (NER)** results to enrich tag generation.
+   - Combine rule-based and machine learning approaches for higher tagging precision.
+
+5. **Real-Time Performance**
+   - Batch incoming documents to improve processing speed.
+   - Consider using asynchronous calls if you implement real-time tagging with Flask or Streamlit.
 
-**Week 1**:
+6. **Evaluation**
+   - Evaluate your model using precision, recall, and F1-score.
+   - Test the tagging accuracy separately from classification accuracy.
 
-- **Days 1-3**: Dataset loading, EDA, and project structure setup.
-- **Days 4-7**: Implement baseline text classification and tagging models with transfer learning.
+7. **Visualization**
+   - Display model performance metrics in the dashboard (accuracy, latency, language stats).
+   - Visualize the frequency of categories and tags over time.
 
-**Week 2**:
+8. **Code Quality**
+   - Keep your scripts modular and well-documented.
+   - Use functions for data loading, preprocessing, and inference to simplify debugging and reusability.
 
-- **Days 1-3**: Develop context-aware tagging and real-time processing pipeline.
-- **Days 4-7**: Add multi-language support and optimize for high-volume document processing.
+9. **Scalability**
+   - Plan for deployment — ensure the pipeline can handle large volumes of documents.
+   - Optimize models with pruning or quantization to reduce latency.
 
-**Week 3**:
+10. **Interpretability**
 
-- **Days 1-4**: Develop the Streamlit/Flask app and integrate visualization and monitoring tools.
-- **Days 5-7**: Document the project and prepare the README with usage instructions.
+- Log top keywords or entities that influence categorization decisions.
+- Make your dashboard explain how and why each document was categorized.
 
 ### Resources
Original file line number	Diff line number	Diff line change
Expand Up		@@ -309,4 +309,4 @@ The first 3 rows of the DataFrame should like this:

		- [Pandas tutorial](https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/)

		- [Pandas iteration](https://towardsdatascience.com/different-ways-to-iterate-over-rows-in-a-pandas-dataframe-performance-comparison-dc0d5dcef8fe)
		- [Pandas iteration](https://www.geeksforgeeks.org/pandas/different-ways-to-iterate-over-rows-in-pandas-dataframe/)