Stress Detection: High-Performance Users Analysis

Overview

Analysis of 13 high-performance users selected from validation/test metrics ≥80% in at least one metric.

Selected Users

P124, P135, P045, P050, P052, P071, P133, P046, P056, P048, P024, P094, P105

Directory Structure

`/selected_users_dataset/`

Main Datasets:

full_216features.pkl - Full dataset (13 users, 216 features)
reduced_49features.pkl - Reduced dataset (13 users, 49 features)
feature_mapping.pkl - Feature mapping info

Results Analysis:

results/ - Feature importance analysis results
- feature_summary.csv - All users feature importance summary
- performance_features.csv - Performance + top features per user
- P{ID}_feature_importance.csv - Individual user feature importance

Key Findings

Feature Reduction

77% reduction: 216 → 49 features
98% performance retention on average
Top users maintain >100% performance with reduced features

Performance (Test AUROC)

Mean: 0.818 ± 0.075 (full), 0.799 ± 0.080 (reduced)
Best users: P135 (0.927), P124 (0.922), P045 (0.917)
Feature efficiency: 49 features capture 89% of importance

Most Important Features (All Users)

feature_63, feature_212, feature_64 (top 3)
17 features important for all 13 users
49 features cover 99% of user-specific needs

Usage

Environment: Always use conda activate navsim

Load datasets:

# Full dataset
with open('selected_users_dataset/full_216features.pkl', 'rb') as f:
    X, y, users, timestamps, feature_names = pickle.load(f)

# Reduced dataset
with open('selected_users_dataset/reduced_49features.pkl', 'rb') as f:
    X_reduced, y, users, timestamps, feature_names = pickle.load(f)

Workflow

Follow these steps whenever you refresh datasets, models, feature importances, or cross-user analysis assets:

0. Rebuild the selected-users dataset (optional)

If you want to choose the user subset dynamically—e.g., drop users whose class imbalance exceeds 3× or restrict to an explicit user list—run:

python scripts/preprocessing/create_dataset.py \
  --source-pkl ~/Archived/stress_binary_personal-current.pkl \
  --selected-users-file selected_ids.txt \
  --max-label-ratio 3.0

(omit --selected-users-file to keep every user in the source dataset)

create_dataset.py loads the raw dataset, optionally filters by user ID and label imbalance, and rewrites both selected_users_dataset/full_216features.pkl and full_216features_normalized.pkl with the surviving users.

1. Train per-user models on the full feature set

python scripts/training/train_models.py \
  --dataset-path selected_users_dataset/full_216features_normalized.pkl \
  --raw-dataset-path selected_users_dataset/full_216features.pkl \
  --selection-threshold 0.8 \
  --save-models \
  --user-jobs 4

Outputs
- selected_users_dataset/results/*.csv (per-user & summary importances)
- selected_users_dataset/models/full_216features_normalized/<USER>/ (fold LightGBM models, written in parallel)
- Datasets rewritten to include only users achieving val_auroc ≥ selection-threshold
- Use --threads-per-model if you prefer to pin LightGBM thread count manually (default derives from CPU/user-jobs)

2. Reduce features and retrain on the 49-feature subset

python scripts/preprocessing/reduce_features.py

Builds reduced_49features.pkl and reuses the pre-normalized full dataset to emit reduced_49features_normalized.pkl
Retrains per-user LightGBM models on the reduced data (saved under selected_users_dataset/models/reduced_49features_normalized/<USER>/)
Stores updated CSVs in selected_users_dataset/results/reduced/

(Optional) 2b. Generate leaf-index embeddings for OTDD

python scripts/preprocessing/generate_leaf_embeddings.py \
  --dataset-path selected_users_dataset/reduced_49features_normalized.pkl \
  --output-path selected_users_dataset/reduced_49features_leaf.pkl \
  --num-boost-round 500 \
  --num-leaves 64 \
  --num-threads 8

Trains (or loads) a global LightGBM model and converts each sample into a one-hot leaf embedding.
The resulting *_leaf.pkl files can be loaded via load_selected_dataset('reduced_leaf') when computing OTDD.

3. OTDD distances & utilities

utility.py exposes helpers that consume the saved models/importances:

from utility import load_selected_dataset, calculate_user_similarity_ranking

X, y, users, _, feature_names = load_selected_dataset('reduced')
features_df = pd.DataFrame(X, columns=feature_names)

ranking = calculate_user_similarity_ranking(
    train_user='P124',
    features=features_df,
    labels=y,
    group_indices=users,
    feature_list=feature_names,
    results_dir='selected_users_dataset/results',
    results_subdir='reduced',
    cache_path='otdd_distances_P124.npy'
)

4. Cross-user evaluation using saved models

python cross_user_evaluation.py P124 Ubicomp/otdd_distances_P124.npy reduced_49features_normalized

train_user (arg 1): source user to transfer from
distance_matrix_path (arg 2): OTDD matrix produced via utility.py
model_dataset_tag (arg 3): dataset tag matching the saved model bundle (e.g., reduced_49features_normalized)

Cross-user evaluation loads the persisted boosters + scaler, reports metrics against all other users, and writes P124_cross_user_evaluation.csv.

실행 요약 (한글)

선택 사용자 재구성(선택) – create_dataset.py를 실행해 원본 데이터를 정규화 버전까지 생성하고, 필요하다면 사용자 ID나 라벨 비율로만 필터링한다.
풀 피처 학습 – train_models.py --selection-threshold <값> --user-jobs <병렬수>를 실행하면 Optuna 학습 → AUROC 기준 사용자 선별 → 필터링된 데이터셋 재생성이 한 번에 진행되고, 최종 모델/결과가 results/와 models/full_216features_normalized/에 저장된다 (--threads-per-model로 LightGBM 스레드 수 조절 가능).
피처 축소/재학습 – python scripts/preprocessing/reduce_features.py 실행 → 49개 피처 데이터(정규화 포함)를 생성하고 재학습 결과가 results/reduced/, models/reduced_49features_normalized/에 저장. (선택) scripts/preprocessing/generate_leaf_embeddings.py로 Leaf 임베딩(*_leaf.pkl)을 생성하면 OTDD 계산 시 모델 기반 표현을 사용할 수 있음.
OTDD 계산 – utility.py의 calculate_user_similarity_ranking 등을 사용하여 OTDD 거리 행렬(*.npy) 생성. 이때 필요한 사용자별 중요도/모델은 위에서 저장한 리소스를 사용.
교차 사용자 평가 – python cross_user_evaluation.py <훈련유저> <OTDD행렬경로> reduced_49features_normalized 형태로 실행하여 저장된 모델을 불러와 transfer 성능을 평가.

OTDD Distance Calculation

IMPORTANT: The OTDD distance calculation in utility.py has NO FALLBACK mechanisms. If OTDD computation fails for any reason (memory issues, PyTorch version incompatibility, etc.), the function will RAISE AN EXCEPTION and terminate. This is by design.

No Wasserstein fallback
No default distance values
Fail fast approach: If OTDD doesn't work, fix the underlying issue rather than masking it
OTDD 라이브러리는 navsim 환경에 설치된 microsoft/otdd 레포지토리 버전을 사용하며, 코드에서는 from otdd.pytorch.distance import DatasetDistance 형태로 임포트한다.

Supported configurations:

49-feature datasets with 49-feature importance weights
216-feature datasets with 216-feature importance weights
NO cross-dimensional mapping (e.g., no 216→49 feature mapping)

Domain Adaptation Pretraining Pipeline

The new domain_adaptation package bundles utilities to pretrain on two datasets, fine-tune on a target dataset, and sweep fine-tuning ratios. Key pieces:

domain_adaptation/data_utils.py — dataset loading and feature intersection alignment (cached under .cache/).
domain_adaptation/models/ — LightGBM and transformer training pipelines with staged pretrain → fine-tune → adapt routines.
domain_adaptation/pipeline.py — experiment orchestration and result aggregation.
scripts/experiments/pretrain_transfer.py — CLI entrypoint.

Requirements

Activate the navsim conda 환경 before running any scripts, and ensure the following packages are available in that environment:

lightgbm
torch (CUDA optional)

Default scenarios

The CLI evaluates the requested combinations, mapping D-1/2/3 to ~/minseo/Archived/ pickles by default:

D-1 + D-3 → D-2
D-1 + D-2 → D-3
D-2 + D-3 → D-1

모든 데이터셋은 공통 피처 교집합으로 정렬되며, 그 공간에서 학습 및 평가가 진행된다. 결과는 지정된 미세조정 비율(0–60%)과 네 개의 시드에 대해 LightGBM과 트랜스포머 모델을 모두 기록한다.

결과 CSV에는 AUROC뿐 아니라 Accuracy, AUPRC, 학습 단계별 소요 시간 및 LightGBM best iteration 메타데이터가 포함된다.

Running the sweep

python scripts/experiments/pretrain_transfer.py \
  --cache-dir .cache/domain_adaptation \
  --output-csv results/domain_adaptation_results.csv \
  --fine-tune-ratios 0.0 0.2 0.4 0.6 \
  --fine-tune-val-ratio 0.2 \
  --seeds 42 43 44 45 \
  --model-types tree transformer \
  --pretrain-val-ratio 0.1 \
  --split-strategy stratified_group_kfold \
  --group-folds 5

Override dataset locations if needed with --dataset-override (D-1=/custom/path.pkl). Every run records both the pretrain+fine-tune workflow and the target-only baseline for comparison.

--split-strategy는 stratified_shuffle, loso, stratified_group_kfold 중 하나를 선택해 타깃 분할 방식을 제어한다.
--group-folds는 stratified_group_kfold 사용 시 폴드 개수를 지정하며, --val-size, --test-size는 stratified_shuffle 전략에서만 사용된다.
--pretrain-val-ratio로 소스(Pretrain) 데이터 검증 비율을 직접 지정할 수 있고(기본 10%), --fine-tune-val-ratio는 타깃 적응 데이터 내부에서 validation으로 떼어둘 비율을 제어한다. 필요하면 --target-train-ratio / --target-val-ratio / --target-test-ratio로 타깃 전체의 분할 비율을 명시적으로 고정할 수 있다(명시되지 않으면 --val-size, --test-size 기반 기본 분할 사용).

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
EDA		EDA
docs		docs
domain_adaptation		domain_adaptation
legacy		legacy
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
PLANNING.md		PLANNING.md
PROGRESS.md		PROGRESS.md
README.md		README.md
UbiComp_Doc_Personalization.docx		UbiComp_Doc_Personalization.docx
cross_user_evaluation.py		cross_user_evaluation.py
requirements.txt		requirements.txt
utility.py		utility.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Stress Detection: High-Performance Users Analysis

Overview

Selected Users

Directory Structure

`/selected_users_dataset/`

Key Findings

Feature Reduction

Performance (Test AUROC)

Most Important Features (All Users)

Usage

Workflow

0. Rebuild the selected-users dataset (optional)

1. Train per-user models on the full feature set

2. Reduce features and retrain on the 49-feature subset

(Optional) 2b. Generate leaf-index embeddings for OTDD

3. OTDD distances & utilities

4. Cross-user evaluation using saved models

실행 요약 (한글)

OTDD Distance Calculation

Domain Adaptation Pretraining Pipeline

Requirements

Default scenarios

Running the sweep

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

License

Kaist-ICLab/CrossUserDataset

Folders and files

Latest commit

History

Repository files navigation

Stress Detection: High-Performance Users Analysis

Overview

Selected Users

Directory Structure

/selected_users_dataset/

Key Findings

Feature Reduction

Performance (Test AUROC)

Most Important Features (All Users)

Usage

Workflow

0. Rebuild the selected-users dataset (optional)

1. Train per-user models on the full feature set

2. Reduce features and retrain on the 49-feature subset

(Optional) 2b. Generate leaf-index embeddings for OTDD

3. OTDD distances & utilities

4. Cross-user evaluation using saved models

실행 요약 (한글)

OTDD Distance Calculation

Domain Adaptation Pretraining Pipeline

Requirements

Default scenarios

Running the sweep

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

`/selected_users_dataset/`

Packages