Videos are preprocessed and converted to prompt such as this
<frame> <loc_x1>364<loc_x1> <loc_y1>24<loc_y1> <loc_x2>649<loc_x2> <loc_y2>697<loc_y2> <action>demonstrating to the patient<action> <|endoftext|>
<frame> <frame> <frame> <frame> <frame> <frame> <frame> <frame> <frame> <frame> <action>shoulder examination<action> <|endoftext|>
<action>demonstrating to the patient<action> 1245 times <action>knee examination<action> 1139 times <action>foot examination<action> 271 times <action>pelvis check<action> 414 times <action>patient standing up<action> 48 times <action>shoulder examination<action> 699 times <action>look at the patient<action> 202 times <label>Good<label><|endoftext|>
MultimodalTrainer(
(model): PhysiotherapyModel(
(model): TrasformerForCausalLM(
(model): TrasformerModel(
(embed_tokens): Embedding(50264, 2048)
(layers): ModuleList(
(0-31): 32 x TrasformerDecoderLayer(
(self_attn): TrasformerAttention(
(q_proj): Linear(in_features=2048, out_features=8192, bias=False)
(k_proj): Linear(in_features=2048, out_features=2048, bias=False)
(v_proj): Linear(in_features=2048, out_features=2048, bias=False)
(o_proj): Linear(in_features=8192, out_features=2048, bias=False)
(rotary_emb): TrasformerRotaryEmbedding()
)
(mlp): TrasformerMLP(
(gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
(up_proj): Linear(in_features=2048, out_features=8192, bias=False)
(down_proj): Linear(in_features=8192, out_features=2048, bias=False)
)
(input_layernorm): TrasformerRMSNorm()
(post_attention_layernorm): TrasformerRMSNorm()
)
)
(norm): TrasformerRMSNorm()
)
(lm_head): Linear(in_features=2048, out_features=50264, bias=False)
)
(processor): MultimodalPreprocessor(
(vit): ViTModel(
(embeddings): ViTEmbeddings(
(patch_embeddings): ViTPatchEmbeddings(
(projection): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
)
(dropout): Dropout(p=0.0, inplace=False)
)
(encoder): ViTEncoder(
(layer): ModuleList(
(0-11): 12 x ViTLayer(
(attention): ViTSdpaAttention(
(attention): ViTSdpaSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(output): ViTSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
)
(intermediate): ViTIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): ViTOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(layernorm_before): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(layernorm_after): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
)
)
(layernorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(pooler): ViTPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)
(model): TrasformerForCausalLM(
(model): TrasformerModel(
(embed_tokens): Embedding(50264, 2048)
(layers): ModuleList(
(0-31): 32 x TrasformerDecoderLayer(
(self_attn): TrasformerAttention(
(q_proj): Linear(in_features=2048, out_features=8192, bias=False)
(k_proj): Linear(in_features=2048, out_features=2048, bias=False)
(v_proj): Linear(in_features=2048, out_features=2048, bias=False)
(o_proj): Linear(in_features=8192, out_features=2048, bias=False)
(rotary_emb): TrasformerRotaryEmbedding()
)
(mlp): TrasformerMLP(
(gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
(up_proj): Linear(in_features=2048, out_features=8192, bias=False)
(down_proj): Linear(in_features=8192, out_features=2048, bias=False)
)
(input_layernorm): TrasformerRMSNorm()
(post_attention_layernorm): TrasformerRMSNorm()
)
)
(norm): TrasformerRMSNorm()
)
(lm_head): Linear(in_features=2048, out_features=50264, bias=False)
)
(image_projection): Linear(in_features=768, out_features=2048, bias=True)
)
)
(loss): MultimodalLoss()
)
- Visit Supervisely to download videos and annotations, or you could create some and download
- Download the video path and annotations to data folder and make adjustments to the train and test csv as it suits you
pip install -r requirements.txt
- Run
python setup.py
- Make sure to have access to kaggle hub paligemma model - Copy the path, find
config/physionet.yaml
and replace the value ofmodel_id
with the path - Then
cd ai
andpip install -r requirements.txt
- Also run
sudo apt update && sudo apt install ffmpeg
- Run
python main.py
- See logs here WandDB Logs
- Run
python eval.py
- Run
mkdir -p outputs
- Run
python inference.py
- Visit the localhost printed. In our case
http://127.0.0.1:7860