Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
aeaa80f
fix(box_recursion): fix expected code (#3027)
sadiqui Aug 28, 2025
2582638
docs(rust-piscine): fix small typos (#3018)
sadiqui Aug 28, 2025
543cba7
refactor(rust-piscine): fix highlights in notions links (#3015)
sadiqui Aug 28, 2025
441fb88
[EXTERNAL] fix(numpy): clarify CSV header exclusion in instructions (…
vpollo11 Aug 28, 2025
281b62b
docs(printifnot): less ambiguous instructions (#2996)
sadiqui Aug 28, 2025
b51859b
fix(ci/cd): bump `checkout`, `login-action` & `markdown-link-checker`…
HarryVasanth Aug 29, 2025
25232a8
fix(sh/tests): bump `debian:stable-slim`
HarryVasanth Aug 29, 2025
d69f6cb
fix(js/tests): bump `alpine:3`
HarryVasanth Aug 29, 2025
5742b16
[EXTERNAL] fix(keras-2): correct typo in README (no → not and go → go…
vpollo11 Aug 31, 2025
977e708
[EXTERNAL] docs(keras2): use categorical_crossentropy for single-labe…
vpollo11 Aug 31, 2025
53cb846
fix(rust-piscine): correct `banner` and `drawing` subjects (#3016)
madaghaxx Sep 2, 2025
00b9342
fix(`rust-piscine`): fix checkpoint subjects (#3024)
sadiqui Sep 2, 2025
14a2f26
fix(markdown-link-checker): run on latest
HarryVasanth Sep 2, 2025
ebfceba
fix(markdown-link-checker): run on main
HarryVasanth Sep 2, 2025
a71ed85
remove allow dead_code (#3032)
lfarssi Sep 3, 2025
fe131e5
docs(drawing/audit): fix small typo, correct verb form (#3014)
sadiqui Sep 3, 2025
0bf16c5
fix(`chaikin`): update broken link + grammar correction (#3021)
sadiqui Sep 3, 2025
cab37c4
docs: sycn with module repo java checkpoint
Sep 19, 2025
1a1390c
docs: sync the java checkpoint with the module
Sep 22, 2025
b487e4d
fix(rust): reflect changes on module (#3063)
pedrodesu Sep 22, 2025
4930a26
CON-3602 Fix `quest-08` (#3009)
pedrodesu Sep 23, 2025
e249dc8
docs(day-of-week): fix typo in the ExerciseRunner
Sep 25, 2025
2857bb0
fix(rust-checkpoints): improved remaining checkpoints (#3068)
pedrodesu Oct 2, 2025
e4cdefa
CON-3676 Fix `quest-09` (#3070)
pedrodesu Oct 3, 2025
d3ea787
[EXTERNAL] feat(backtesting-sp500): Add resources part for the subject
vpollo11 Oct 4, 2025
2633928
chore :fix prettier fmt
vpollo11 Oct 5, 2025
7af796c
[EXTERNAL] docs: fix and update resource links in documentation
vpollo11 Oct 4, 2025
47d5d41
chore :fix prettier fmt
vpollo11 Oct 5, 2025
c457661
[EXTERNAL] docs (classification): fix and update resource links in do…
vpollo11 Oct 4, 2025
42ca8ea
Updates README.md
vpollo11 Oct 4, 2025
f7c8afb
[EXTERNAL] docs (data-wrangling): fix and update resource links in do…
vpollo11 Oct 4, 2025
7548fdc
Updates README.md
vpollo11 Oct 4, 2025
4da70b1
[EXTERNAL] docs (document-categorization): replace timeline part with…
vpollo11 Oct 4, 2025
a641a0e
chore :fix prettier fmt
vpollo11 Oct 5, 2025
439c5ca
[EXTERNAL] docs (emotions-detector): fix and update resource links in…
vpollo11 Oct 4, 2025
811e00d
Updates README.md
vpollo11 Oct 4, 2025
489b15a
chore :fix prettier fmt
vpollo11 Oct 5, 2025
ad89747
[EXTERNAL] docs (matrix-factorization): fix and update resource links…
vpollo11 Oct 5, 2025
b0d72d9
Update README.md
vpollo11 Oct 5, 2025
c90dbff
[EXTERNAL] docs (nlp-spicy): fix and update resource links in documen…
vpollo11 Oct 5, 2025
c78cc4f
[EXTERNAL] docs (pipeline): fix and update resource links in document…
vpollo11 Oct 6, 2025
3d7f9ea
[EXTERNAL] docs (pipeline): fix and update resource links in document…
vpollo11 Oct 6, 2025
4df2a4c
Update README.md
vpollo11 Oct 6, 2025
9c81543
[EXTERNAL] docs (sp500): fix and update resource links in documentation
vpollo11 Oct 6, 2025
7e513dd
[EXTERNAL] docs (time-series): fix and update resource links in docum…
vpollo11 Oct 6, 2025
68d07be
chore : fix prettier fmt
vpollo11 Oct 6, 2025
cdece5b
docs (vision-track) : remove timeline part from the subj
vpollo11 Oct 6, 2025
a262e5f
chore (visualizations) : fix prettier fmt
vpollo11 Oct 6, 2025
8ed3b06
[EXTERNAL] fix(pandas): replace broken "Ultimate Pandas Resource" lin…
vpollo11 Sep 22, 2025
4565a56
Update README.md
vpollo11 Oct 6, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .github/workflows/ga-image-build-branch.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,17 +14,17 @@ jobs:

steps:
- name: 🐧 Checkout
uses: actions/checkout@v3
uses: actions/checkout@v5

- name: 📦 Login to GitHub Container Registry
uses: docker/login-action@v2
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}

- name: 🐳 Login to docker.01-edu.org Registry
uses: docker/login-action@v2
uses: docker/login-action@v3
with:
registry: docker.01-edu.org
username: ${{ secrets.USER_DOCKER_01EDU_ORG }}
Expand Down
6 changes: 3 additions & 3 deletions .github/workflows/ga-image-build-master.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,17 +11,17 @@ jobs:

steps:
- name: 🐧 Checkout
uses: actions/checkout@v3
uses: actions/checkout@v5

- name: 📦 Login to GitHub Container Registry
uses: docker/login-action@v2
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}

- name: 🐳 Login to docker.01-edu.org Registry
uses: docker/login-action@v2
uses: docker/login-action@v3
with:
registry: docker.01-edu.org
username: ${{ secrets.USER_DOCKER_01EDU_ORG }}
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/ga-misc-check-compliance.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ jobs:

steps:
- name: 🐧 Check out repository code
uses: actions/checkout@v4
uses: actions/checkout@v5
with:
fetch-depth: 0

Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/ga-misc-check-links.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ jobs:

steps:
- name: 🐧 Checkout
uses: actions/checkout@v4
uses: actions/checkout@v5
with:
fetch-depth: 0

Expand All @@ -27,6 +27,6 @@ jobs:

- name: 🔗 Run Check Links
if: steps.changed-md.outputs.changed_files != ''
uses: harryvasanth/markdown-link-checker@v1.2
uses: harryvasanth/markdown-link-checker@main
with:
files: ${{ steps.changed-md.outputs.changed_files }}
2 changes: 1 addition & 1 deletion .github/workflows/ga-misc-check-prettier.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ jobs:

steps:
- name: 🐧 Checkout
uses: actions/checkout@v4
uses: actions/checkout@v5
with:
fetch-depth: 0

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/ga-misc-check-shellcheck.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ jobs:

steps:
- name: 🐧 Check out repository code
uses: actions/checkout@v4
uses: actions/checkout@v5
with:
fetch-depth: 0

Expand Down
2 changes: 1 addition & 1 deletion js/tests/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM docker.01-edu.org/alpine:3.17.0
FROM alpine:3

# Installs latest Chromium package.
RUN apk add --no-cache \
Expand Down
2 changes: 1 addition & 1 deletion sh/tests/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM docker.01-edu.org/debian:10-slim
FROM debian:stable-slim

RUN apt-get update
RUN apt-get -y install jq curl tree apt-utils
Expand Down
33 changes: 19 additions & 14 deletions subjects/ai/backtesting-sp500/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,6 @@ The input files are:
data.

The adjusted close price may be unavailable for three main reasons:

- The company doesn't exist at date `d`
- The company is not publicly traded
- Its close price hasn't been reported
Expand Down Expand Up @@ -68,7 +67,6 @@ There are four parts:
#### 2. Data wrangling and preprocessing

- Create a Jupyter Notebook to analyze the data sets and perform EDA (Exploratory Data Analysis). This notebook should contain at least:

- Missing values analysis
- Outliers analysis (there are a lot of outliers)
- Visualize and analyze the average price for companies over time or compare the price consistency across different companies within the dataset. Save the plot as an image.
Expand All @@ -77,11 +75,9 @@ There are four parts:
_Note: create functions that generate the plots and save them in the `images` directory. Add a parameter `plot` with a default value `False` which doesn't return the plot. This will be useful for the correction to let people run your code without overriding your plots._

- Here is how the `prices` data should be preprocessed:

- Resample data on month and keep the last value
- Filter prices outliers: Remove prices outside the range 0.1$, 10k$
- Compute monthly returns:

- Historical returns. **returns(current month) = price(current month) - price(previous month) / price(previous month)**
- Future returns. **returns(current month) = price(next month) - price(current month) / price(current month)**

Expand All @@ -102,7 +98,6 @@ At this stage the DataFrame should look like this:
- Print `prices.isna().sum()`

- Here is how the `sp500.csv` data should be preprocessed:

- Resample data on month and keep the last value
- Compute historical monthly returns on the adjusted close

Expand Down Expand Up @@ -183,47 +178,38 @@ project
### Tips:

1. Data Quality Management:

- Be prepared to encounter messy data. Financial datasets often contain errors, outliers, and missing values.
- Develop a systematic approach to identify and handle data quality issues.

2. Memory Optimization:

- When working with large datasets, optimize memory usage by selecting appropriate data types for each column.
- Consider using smaller data types like np.float32 for floating-point numbers when precision allows.

3. Exploratory Data Analysis:

- Spend time understanding the data through visualization and statistical analysis before diving into strategy development.
- Pay special attention to outliers and their potential impact on your strategy.

4. Preprocessing Financial Data:

- When resampling time series data, be mindful of which value to keep (e.g., last value for month-end prices).
- Calculate both historical and future returns to avoid look-ahead bias in your strategy.

5. Handling Outliers:

- Develop a method to identify and handle outliers that is specific to each company's historical data.
- Be cautious about removing outliers during periods of high market volatility (e.g., 2008-2009 financial crisis).

6. Signal Creation:

- Start with a simple signal (like past 12-month average returns) before exploring more complex strategies.
- Ensure your signal doesn't use future information that wouldn't have been available at the time of decision.

7. Backtesting:

- Implement your backtesting logic without using loops for better performance.
- Compare your strategy's performance against a relevant benchmark (in this case, the S&P 500).

8. Visualization:

- Create clear, informative visualizations to communicate your strategy's performance.
- Include cumulative return plots to show how your strategy performs over time compared to the benchmark.

9. Code Structure:

- Organize your code into modular functions for better readability and reusability.
- Use a main script to orchestrate the entire process from data loading to results visualization.

Expand All @@ -232,3 +218,22 @@ project
- Be prepared to explain any anomalies or unexpected results in your strategy's performance.

Remember, the goal is not just to create a strategy that looks good on paper, but to develop a robust process for analyzing financial data and testing investment ideas.

### Resources

- **Python & Data Analysis**
- [pandas Documentation](https://pandas.pydata.org/docs/) – handling time series, resampling, returns.
- [NumPy Documentation](https://numpy.org/doc/) – vectorized operations and memory optimization.
- [Matplotlib Documentation](https://matplotlib.org/stable/index.html) – plotting cumulative returns and EDA visuals.

- **Finance & Backtesting**
- [Investopedia – Backtesting](https://www.investopedia.com/terms/b/backtesting.asp) – introduction to strategy testing.
- [QuantStart – What is Backtesting?](https://corporatefinanceinstitute.com/resources/data-science/backtesting/#:~:text=Backtesting%20involves%20applying%20a%20strategy,employ%20and%20tweak%20successful%20strategies.) – practical overview of backtesting logic.
- [S&P 500 Index (Wikipedia)](https://en.wikipedia.org/wiki/S%26P_500) – background on the index and its historical changes.

- **Data Cleaning & Outliers**
- [Handling Missing Data in Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html).

- **Quantitative Strategies**
- [Momentum Investing (Investopedia)](https://www.investopedia.com/terms/m/momentum_investing.asp) – theory behind using past returns as a signal.
- [Risk and Return Basics (CFA Institute)](https://www.investopedia.com/terms/r/riskadjustedreturn.asp) – risk-adjusted performance understanding.
8 changes: 3 additions & 5 deletions subjects/ai/classification/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -351,12 +351,10 @@ def predict_one_vs_all(X, clf0, clf1, clf2 ):

### Resources

- [Logistic regression](https://towardsdatascience.com/understanding-logistic-regression-9b02c2aec102)
- [Logistic regression](https://www.ibm.com/think/topics/logistic-regression)

- [Logloss](https://www.datacamp.com/tutorial/the-cross-entropy-loss-function-in-machine-learning)
- [Logloss](https://www.geeksforgeeks.org/machine-learning/what-is-cross-entropy-loss-function/)

- [More on logistic regression](https://medium.com/swlh/what-is-logistic-regression-62807de62efa)
- [More on logistic regression](https://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf)

- [Logistic regression 1](https://www.kaggle.com/code/rahulrajpandey31/logistic-regression-from-scratch-iris-data-set)

- [Logistic regression 2](https://towardsdatascience.com/logistic-regression-using-python-sklearn-numpy-mnist-handwriting-recognition-matplotlib-a6b31e2b166a)
8 changes: 3 additions & 5 deletions subjects/ai/credit-scoring/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,21 +24,19 @@ There are 3 expected deliverables associated with the scoring model:

- An exploratory data analysis notebook that describes the insights you find out in the data set.
- The trained machine learning model with the features engineering pipeline:

- Do not forget: **Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’ is basically feature engineering.**
- The model is validated if the **AUC on the test set is at minimum 55%, ideally to 62% included (or in best cases higher than 62% if you can !)**.
- The labelled test data is not publicly available. However, a Kaggle competition uses the same data. The procedure to evaluate test set submission is the same as the one used for the project 1.
- Here are the [DataSets](https://assets.01-edu.org/ai-branch/project5/home-credit-default-risk.zip).

- A report on model training and evaluation:

- Include learning curves (training and validation scores vs. training set size or epochs) to demonstrate that the model is not overfitting.
- Explain the measures taken to prevent overfitting, such as early stopping or regularization techniques.
- Justify your choice of when to stop training based on the learning curves.

#### Kaggle submission

The way the Kaggle platform works is explained in the challenge overview page. If you need more details, I suggest [this resource](https://towardsdatascience.com/getting-started-with-kaggle-f9138b35ae18) that gives detailed explanations.
The way the Kaggle platform works is explained in the challenge overview page. If you need more details, I suggest [this resource](https://www.kaggle.com/datasets/parisrohan/credit-score-classification) that gives detailed explanations.

- Create a username following that structure: username*01EDU* location_MM_YYYY. Submit the description profile and push it on the Git platform the first day of the week. Do not touch this file anymore.

Expand All @@ -55,7 +53,7 @@ There are different level of transparency:
- **Global**: understand important variables in a model. This answers the question: "What are the key variables to the model ? ". In that case it will tell if the revenue is more important than the age to the model for example. This allows to check that the model relies on important variables. No one wants his credit to be refused because of the weather in Lisbon !
- **Local**: each observation gets its own set of interpretability factors. This greatly increases its transparency. We can explain why a case receives its prediction and the contributions of the predictors. Traditional variable importance algorithms only show the results across the entire population but not on each individual case. The local interpretability enables us to pinpoint and contrast the impacts of the factors.

There are 2 tools you can use to analyse your model and its predictions: - Features importance (available if you use a Scikit Learn model) - [SHAP library](https://towardsdatascience.com/explain-your-model-with-the-shap-values-bc36aac4de3d)
There are 2 tools you can use to analyse your model and its predictions: - Features importance (available if you use a Scikit Learn model) - [SHAP library](https://shap.readthedocs.io/en/latest/)

Implement a program that takes as input the trained model, the customer id ... and returns:

Expand Down Expand Up @@ -121,4 +119,4 @@ Remember, creating a great credit scoring model is like baking a perfect cake -

### Resources

- [Interpreting machine learning models](https://towardsdatascience.com/interpretability-in-machine-learning-70c30694a05f)
- [Interpreting machine learning models](https://neptune.ai/blog/ml-model-interpretation-tools)
2 changes: 1 addition & 1 deletion subjects/ai/data-wrangling/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -309,4 +309,4 @@ The first 3 rows of the DataFrame should like this:

- [Pandas tutorial](https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/)

- [Pandas iteration](https://towardsdatascience.com/different-ways-to-iterate-over-rows-in-a-pandas-dataframe-performance-comparison-dc0d5dcef8fe)
- [Pandas iteration](https://www.geeksforgeeks.org/pandas/different-ways-to-iterate-over-rows-in-pandas-dataframe/)
52 changes: 36 additions & 16 deletions subjects/ai/document-categorization/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@ The project aims to develop skills in:
#### Data Loading and Preprocessing

1. **Dataset Preparation**:

- Load a dataset containing various document types across multiple categories and languages.
- Preprocess the data, including text normalization, tokenization, and handling multi-language support.

Expand All @@ -33,12 +32,10 @@ The project aims to develop skills in:
#### Model Development

1. **Text Classification Model**:

- Implement a **text classification model** using **TensorFlow** or **Keras**, starting with a baseline architecture.
- Use **transfer learning** to enhance the model’s domain adaptability, incorporating pre-trained language models such as **BERT** or **DistilBERT**.

2. **Tagging with NLP Libraries**:

- Leverage **SpaCy** to develop an intelligent tagging system that can assign tags based on the document's content and context.
- Ensure the tagging system supports multi-language functionality, utilizing language models for effective tagging in different languages.

Expand All @@ -49,7 +46,6 @@ The project aims to develop skills in:
#### Real-Time Document Categorization and Tagging

1. **Real-Time Processing Pipeline**:

- Develop a pipeline to handle real-time document classification and tagging, ensuring minimal latency.
- Set up batching or streaming mechanisms to manage high-volume document input and optimize throughput.

Expand All @@ -60,7 +56,6 @@ The project aims to develop skills in:
#### Transfer Learning and Model Optimization

1. **Transfer Learning for Domain-Specific Contexts**:

- Fine-tune the pre-trained language models to specialize in specific document types or industry contexts.
- Implement training routines to adapt the model to new domains without extensive retraining on each dataset.

Expand All @@ -71,7 +66,6 @@ The project aims to develop skills in:
#### Visualization and Monitoring

1. **Real-Time Dashboard**:

- Develop a **Streamlit** or **Flask** app to display real-time categorization and tagging results.
- Include visualizations of category distributions, tag counts, and language breakdowns.

Expand Down Expand Up @@ -107,22 +101,48 @@ document-categorization-tagging/
└── requirements.txt
```

### Timeline (2-3 weeks)
### Tips

1. **Data Quality & Preprocessing**
- Pay attention to encoding, text cleaning, and normalization, especially with multi-language data.
- Always remove unwanted characters, duplicated text, or formatting artifacts before training.

2. **Multi-Language Handling**
- Use automatic language detection to route documents to the right SpaCy or Hugging Face model.
- Keep tokenization language-specific to avoid poor segmentation.

3. **Model Training**
- Start with a small pre-trained model (e.g., DistilBERT) before moving to larger models like BERT.
- Regularly save checkpoints during fine-tuning to avoid losing progress.

4. **Context-Aware Tagging**
- Use **Named Entity Recognition (NER)** results to enrich tag generation.
- Combine rule-based and machine learning approaches for higher tagging precision.

5. **Real-Time Performance**
- Batch incoming documents to improve processing speed.
- Consider using asynchronous calls if you implement real-time tagging with Flask or Streamlit.

**Week 1**:
6. **Evaluation**
- Evaluate your model using precision, recall, and F1-score.
- Test the tagging accuracy separately from classification accuracy.

- **Days 1-3**: Dataset loading, EDA, and project structure setup.
- **Days 4-7**: Implement baseline text classification and tagging models with transfer learning.
7. **Visualization**
- Display model performance metrics in the dashboard (accuracy, latency, language stats).
- Visualize the frequency of categories and tags over time.

**Week 2**:
8. **Code Quality**
- Keep your scripts modular and well-documented.
- Use functions for data loading, preprocessing, and inference to simplify debugging and reusability.

- **Days 1-3**: Develop context-aware tagging and real-time processing pipeline.
- **Days 4-7**: Add multi-language support and optimize for high-volume document processing.
9. **Scalability**
- Plan for deployment — ensure the pipeline can handle large volumes of documents.
- Optimize models with pruning or quantization to reduce latency.

**Week 3**:
10. **Interpretability**

- **Days 1-4**: Develop the Streamlit/Flask app and integrate visualization and monitoring tools.
- **Days 5-7**: Document the project and prepare the README with usage instructions.
- Log top keywords or entities that influence categorization decisions.
- Make your dashboard explain how and why each document was categorized.

### Resources

Expand Down
Loading