diff --git a/CITATION.cff b/CITATION.cff
index 62b75a422a..47f743ddd9 100644
--- a/CITATION.cff
+++ b/CITATION.cff
@@ -1,4 +1,4 @@
-cff-version: 1.2.0
+cff-version: 1.3.0
 message: "If you use this software, please cite it as below."
 authors:
   - name: "MMPose Contributors"
diff --git a/README.md b/README.md
index e79b81efe0..6d0fcb2134 100644
--- a/README.md
+++ b/README.md
@@ -98,21 +98,18 @@ https://user-images.githubusercontent.com/15977946/124654387-0fd3c500-ded1-11eb-
 
 ## What's New
 
-- We have added support for two new datasets:
+- Release [RTMO](/projects/rtmo), a state-of-the-art real-time method for multi-person pose estimation.
 
-  - (CVPR 2023) [UBody](https://mmpose.readthedocs.io/zh_CN/latest/model_zoo_papers/datasets.html#ubody-cvpr-2023)
-  - [300W-LP](https://github.com/open-mmlab/mmpose/tree/main/configs/face_2d_keypoint/topdown_heatmap/300wlp)
+  ![rtmo](https://github.com/open-mmlab/mmpose/assets/26127467/54d5555a-23e5-4308-89d1-f0c82a6734c2)
 
-- Support for four new algorithms:
+- Release [RTMW](/configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw_cocktail14.md) models in various sizes ranging from RTMW-m to RTMW-x. The input sizes include `256x192` and `384x288`. This provides flexibility to select the right model for different speed and accuracy requirements.
 
-  - (ICCV 2023) [MotionBERT](https://github.com/open-mmlab/mmpose/tree/main/configs/body_3d_keypoint/motionbert)
-  - (ICCVW 2023) [DWPose](https://github.com/open-mmlab/mmpose/tree/main/configs/wholebody_2d_keypoint/dwpose)
-  - (ICLR 2023) [EDPose](https://mmpose.readthedocs.io/zh_CN/latest/model_zoo/body_2d_keypoint.html#edpose-edpose-on-coco)
-  - (ICLR 2022) [Uniformer](https://github.com/open-mmlab/mmpose/tree/main/projects/uniformer)
+- Support inference of [PoseAnything](/projects/pose_anything). Web demo is available [here](https://openxlab.org.cn/apps/detail/orhir/Pose-Anything).
 
-- Released the first whole-body pose estimation model, RTMW, with accuracy exceeding 70 AP on COCO-Wholebody. For details, refer to [RTMPose](/projects/rtmpose/). [Try it now!](https://openxlab.org.cn/apps/detail/mmpose/RTMPose)
+- Support for two new datasets:
 
-![rtmw](https://github.com/open-mmlab/mmpose/assets/13503330/635c4618-c459-45e8-84a5-eb68cf338d00)
+  - (CVPR 2023) [ExLPose](https://mmpose.readthedocs.io/en/latest/dataset_zoo/2d_body_keypoint.html#exlpose-dataset)
+  - (ICCV 2023) [H3WB](/docs/en/dataset_zoo/3d_wholebody_keypoint.md)
 
 - Welcome to use the [*MMPose project*](/projects/README.md). Here, you can discover the latest features and algorithms in MMPose and quickly share your ideas and code implementations with the community. Adding new features to MMPose has become smoother:
 
@@ -121,6 +118,8 @@ https://user-images.githubusercontent.com/15977946/124654387-0fd3c500-ded1-11eb-
   - Utilize the powerful capabilities of MMPose in the form of independent projects without being constrained by the code framework.
   - Newly added projects include:
     - [RTMPose](/projects/rtmpose/)
+    - [RTMO](/projects/rtmo/)
+    - [PoseAnything](/projects/pose_anything/)
     - [YOLOX-Pose](/projects/yolox_pose/)
     - [MMPose4AIGC](/projects/mmpose4aigc/)
     - [Simple Keypoints](/projects/skps/)
@@ -130,15 +129,14 @@ https://user-images.githubusercontent.com/15977946/124654387-0fd3c500-ded1-11eb-
 
 <br/>
 
-- October 12, 2023: MMPose [v1.2.0](https://github.com/open-mmlab/mmpose/releases/tag/v1.2.0) has been officially released, with major updates including:
+- January 4, 2024: MMPose [v1.3.0](https://github.com/open-mmlab/mmpose/releases/tag/v1.3.0) has been officially released, with major updates including:
 
-  - Support for new datasets: UBody, 300W-LP.
-  - Support for new algorithms: MotionBERT, DWPose, EDPose, Uniformer.
-  - Migration of Associate Embedding, InterNet, YOLOX-Pose algorithms.
-  - Migration of the DeepFashion2 dataset.
-  - Support for Badcase visualization analysis, multi-dataset evaluation, and keypoint visibility prediction features.
+  - Support for new datasets: ExLPose, H3WB
+  - Release of new RTMPose series models: RTMO, RTMW
+  - Support for new algorithm PoseAnything
+  - Enhanced Inferencer with optional progress bar and improved affinity for one-stage methods
 
-  Please check the complete [release notes](https://github.com/open-mmlab/mmpose/releases/tag/v1.2.0) for more details on the updates brought by MMPose v1.2.0!
+  Please check the complete [release notes](https://github.com/open-mmlab/mmpose/releases/tag/v1.3.0) for more details on the updates brought by MMPose v1.3.0!
 
 ## 0.x / 1.x Migration
 
diff --git a/README_CN.md b/README_CN.md
index 1fe1a50a43..3acb01abca 100644
--- a/README_CN.md
+++ b/README_CN.md
@@ -96,21 +96,18 @@ https://user-images.githubusercontent.com/15977946/124654387-0fd3c500-ded1-11eb-
 
 ## 最新进展
 
-- 我们支持了两个新的数据集:
+- 发布了单阶段实时多人姿态估计模型 [RTMO](/projects/rtmo)。相比 RTMPose 在多人场景下性能更优
 
-  - (CVPR 2023) [UBody](https://mmpose.readthedocs.io/zh_CN/latest/model_zoo_papers/datasets.html#ubody-cvpr-2023)
-  - [300W-LP](https://github.com/open-mmlab/mmpose/tree/main/configs/face_2d_keypoint/topdown_heatmap/300wlp)
+  ![rtmo](https://github.com/open-mmlab/mmpose/assets/26127467/54d5555a-23e5-4308-89d1-f0c82a6734c2)
 
-- 支持四个新算法：
+- 发布了不同尺寸的 [RTMW](/configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw_cocktail14.md) 模型，满足不同的使用场景。模型尺寸覆盖从 RTMW-m 到 RTMW-x 的模型，输入图像尺寸包含 256x192 和 384x288
 
-  - (ICCV 2023) [MotionBERT](https://github.com/open-mmlab/mmpose/tree/main/configs/body_3d_keypoint/motionbert)
-  - (ICCVW 2023) [DWPose](https://github.com/open-mmlab/mmpose/tree/main/configs/wholebody_2d_keypoint/dwpose)
-  - (ICLR 2023) [EDPose](https://mmpose.readthedocs.io/zh_CN/latest/model_zoo/body_2d_keypoint.html#edpose-edpose-on-coco)
-  - (ICLR 2022) [Uniformer](https://github.com/open-mmlab/mmpose/tree/main/projects/uniformer)
+- 支持了 [PoseAnything](/projects/pose_anything) 的推理。[在线试玩](https://openxlab.org.cn/apps/detail/orhir/Pose-Anything)
 
-- 发布首个在 COCO-Wholebody 上精度超过 70 AP 的全身姿态估计模型 RTMW，具体请参考 [RTMPose](/projects/rtmpose/)。[在线试玩](https://openxlab.org.cn/apps/detail/mmpose/RTMPose)
+- 我们支持了两个新的数据集:
 
-![rtmw](https://github.com/open-mmlab/mmpose/assets/13503330/635c4618-c459-45e8-84a5-eb68cf338d00)
+  - (CVPR 2023) [ExLPose](https://mmpose.readthedocs.io/en/latest/dataset_zoo/2d_body_keypoint.html#exlpose-dataset)
+  - (ICCV 2023) [H3WB](/docs/en/dataset_zoo/3d_wholebody_keypoint.md)
 
 - 欢迎使用 [*MMPose 项目*](/projects/README.md)。在这里，您可以发现 MMPose 中的最新功能和算法，并且可以通过最快的方式与社区分享自己的创意和代码实现。向 MMPose 中添加新功能从此变得简单丝滑：
 
@@ -119,6 +116,8 @@ https://user-images.githubusercontent.com/15977946/124654387-0fd3c500-ded1-11eb-
   - 通过独立项目的形式，利用 MMPose 的强大功能，同时不被代码框架所束缚
   - 最新添加的项目包括：
     - [RTMPose](/projects/rtmpose/)
+    - [RTMO](/projects/rtmo/)
+    - [PoseAnything](/projects/pose_anything/)
     - [YOLOX-Pose](/projects/yolox_pose/)
     - [MMPose4AIGC](/projects/mmpose4aigc/)
     - [Simple Keypoints](/projects/skps/)
@@ -128,15 +127,14 @@ https://user-images.githubusercontent.com/15977946/124654387-0fd3c500-ded1-11eb-
 
 <br/>
 
-- 2023-10-12：MMPose [v1.2.0](https://github.com/open-mmlab/mmpose/releases/tag/v1.2.0) 正式发布了，主要更新包括:
+- 2024-01-04：MMPose [v1.3.0](https://github.com/open-mmlab/mmpose/releases/tag/v1.3.0) 正式发布了，主要更新包括:
 
-  - 支持新数据集：UBody、300W-LP。
-  - 支持新算法：MotionBERT、DWPose、EDPose、Uniformer
-  - 迁移 Associate Embedding、InterNet、YOLOX-Pose 算法。
-  - 迁移 DeepFashion2 数据集。
-  - 支持 Badcase 可视化分析、多数据集评测、关键点可见性预测功能。
+  - 支持新数据集：ExLPose、H3WB
+  - 发布 RTMPose 系列新模型：RTMO、RTMW
+  - 支持新算法 PoseAnything
+  - 推理器 Inferencer 支持可选的进度条、提升与单阶段模型的适配性
 
-  请查看完整的 [版本说明](https://github.com/open-mmlab/mmpose/releases/tag/v1.2.0) 以了解更多 MMPose v1.2.0 带来的更新!
+  请查看完整的 [版本说明](https://github.com/open-mmlab/mmpose/releases/tag/v1.3.0) 以了解更多 MMPose v1.3.0 带来的更新!
 
 ## 0.x / 1.x 迁移
 
diff --git a/configs/_base_/datasets/exlpose.py b/configs/_base_/datasets/exlpose.py
new file mode 100644
index 0000000000..29b758aa21
--- /dev/null
+++ b/configs/_base_/datasets/exlpose.py
@@ -0,0 +1,125 @@
+dataset_info = dict(
+    dataset_name='exlpose',
+    paper_info=dict(
+        author='Sohyun Lee, Jaesung Rim, Boseung Jeong, Geonu Kim,'
+        'ByungJu Woo, Haechan Lee, Sunghyun Cho, Suha Kwak',
+        title='Human Pose Estimation in Extremely Low-Light Conditions',
+        container='arXiv',
+        year='2023',
+        homepage='https://arxiv.org/abs/2303.15410',
+    ),
+    keypoint_info={
+        0:
+        dict(
+            name='left_shoulder',
+            id=0,
+            color=[0, 255, 0],
+            type='upper',
+            swap='right_shoulder'),
+        1:
+        dict(
+            name='right_shoulder',
+            id=1,
+            color=[255, 128, 0],
+            type='upper',
+            swap='left_shoulder'),
+        2:
+        dict(
+            name='left_elbow',
+            id=2,
+            color=[0, 255, 0],
+            type='upper',
+            swap='right_elbow'),
+        3:
+        dict(
+            name='right_elbow',
+            id=3,
+            color=[255, 128, 0],
+            type='upper',
+            swap='left_elbow'),
+        4:
+        dict(
+            name='left_wrist',
+            id=4,
+            color=[0, 255, 0],
+            type='upper',
+            swap='right_wrist'),
+        5:
+        dict(
+            name='right_wrist',
+            id=5,
+            color=[255, 128, 0],
+            type='upper',
+            swap='left_wrist'),
+        6:
+        dict(
+            name='left_hip',
+            id=6,
+            color=[0, 255, 0],
+            type='lower',
+            swap='right_hip'),
+        7:
+        dict(
+            name='right_hip',
+            id=7,
+            color=[255, 128, 0],
+            type='lower',
+            swap='left_hip'),
+        8:
+        dict(
+            name='left_knee',
+            id=8,
+            color=[0, 255, 0],
+            type='lower',
+            swap='right_knee'),
+        9:
+        dict(
+            name='right_knee',
+            id=9,
+            color=[255, 128, 0],
+            type='lower',
+            swap='left_knee'),
+        10:
+        dict(
+            name='left_ankle',
+            id=10,
+            color=[0, 255, 0],
+            type='lower',
+            swap='right_ankle'),
+        11:
+        dict(
+            name='right_ankle',
+            id=11,
+            color=[255, 128, 0],
+            type='lower',
+            swap='left_ankle'),
+        12:
+        dict(name='head', id=12, color=[51, 153, 255], type='upper', swap=''),
+        13:
+        dict(name='neck', id=13, color=[51, 153, 255], type='upper', swap='')
+    },
+    skeleton_info={
+        0: dict(link=('head', 'neck'), id=0, color=[51, 153, 255]),
+        1: dict(link=('neck', 'left_shoulder'), id=1, color=[51, 153, 255]),
+        2: dict(link=('neck', 'right_shoulder'), id=2, color=[51, 153, 255]),
+        3: dict(link=('left_shoulder', 'left_elbow'), id=3, color=[0, 255, 0]),
+        4: dict(link=('left_elbow', 'left_wrist'), id=4, color=[0, 255, 0]),
+        5: dict(
+            link=('right_shoulder', 'right_elbow'), id=5, color=[255, 128, 0]),
+        6:
+        dict(link=('right_elbow', 'right_wrist'), id=6, color=[255, 128, 0]),
+        7: dict(link=('neck', 'right_hip'), id=7, color=[51, 153, 255]),
+        8: dict(link=('neck', 'left_hip'), id=8, color=[51, 153, 255]),
+        9: dict(link=('right_hip', 'right_knee'), id=9, color=[255, 128, 0]),
+        10:
+        dict(link=('right_knee', 'right_ankle'), id=10, color=[255, 128, 0]),
+        11: dict(link=('left_hip', 'left_knee'), id=11, color=[0, 255, 0]),
+        12: dict(link=('left_knee', 'left_ankle'), id=12, color=[0, 255, 0]),
+    },
+    joint_weights=[
+        0.2, 0.2, 0.2, 1.3, 1.5, 0.2, 1.3, 1.5, 0.2, 0.2, 0.5, 0.2, 0.2, 0.5
+    ],
+    sigmas=[
+        0.079, 0.079, 0.072, 0.072, 0.062, 0.062, 0.107, 0.107, 0.087, 0.087,
+        0.089, 0.089, 0.079, 0.079
+    ])
diff --git a/configs/_base_/datasets/h3wb.py b/configs/_base_/datasets/h3wb.py
new file mode 100644
index 0000000000..bb47a1b3f5
--- /dev/null
+++ b/configs/_base_/datasets/h3wb.py
@@ -0,0 +1,1151 @@
+dataset_info = dict(
+    dataset_name='h3wb',
+    paper_info=dict(
+        author='Yue Zhu, Nermin Samet, David Picard',
+        title='H3WB: Human3.6M 3D WholeBody Dataset and Benchmark',
+        container='International Conf. on Computer Vision (ICCV)',
+        year='2023',
+        homepage='https://github.com/wholebody3d/wholebody3d',
+    ),
+    keypoint_info={
+        0:
+        dict(name='nose', id=0, color=[51, 153, 255], type='upper', swap=''),
+        1:
+        dict(
+            name='left_eye',
+            id=1,
+            color=[51, 153, 255],
+            type='upper',
+            swap='right_eye'),
+        2:
+        dict(
+            name='right_eye',
+            id=2,
+            color=[51, 153, 255],
+            type='upper',
+            swap='left_eye'),
+        3:
+        dict(
+            name='left_ear',
+            id=3,
+            color=[51, 153, 255],
+            type='upper',
+            swap='right_ear'),
+        4:
+        dict(
+            name='right_ear',
+            id=4,
+            color=[51, 153, 255],
+            type='upper',
+            swap='left_ear'),
+        5:
+        dict(
+            name='left_shoulder',
+            id=5,
+            color=[0, 255, 0],
+            type='upper',
+            swap='right_shoulder'),
+        6:
+        dict(
+            name='right_shoulder',
+            id=6,
+            color=[255, 128, 0],
+            type='upper',
+            swap='left_shoulder'),
+        7:
+        dict(
+            name='left_elbow',
+            id=7,
+            color=[0, 255, 0],
+            type='upper',
+            swap='right_elbow'),
+        8:
+        dict(
+            name='right_elbow',
+            id=8,
+            color=[255, 128, 0],
+            type='upper',
+            swap='left_elbow'),
+        9:
+        dict(
+            name='left_wrist',
+            id=9,
+            color=[0, 255, 0],
+            type='upper',
+            swap='right_wrist'),
+        10:
+        dict(
+            name='right_wrist',
+            id=10,
+            color=[255, 128, 0],
+            type='upper',
+            swap='left_wrist'),
+        11:
+        dict(
+            name='left_hip',
+            id=11,
+            color=[0, 255, 0],
+            type='lower',
+            swap='right_hip'),
+        12:
+        dict(
+            name='right_hip',
+            id=12,
+            color=[255, 128, 0],
+            type='lower',
+            swap='left_hip'),
+        13:
+        dict(
+            name='left_knee',
+            id=13,
+            color=[0, 255, 0],
+            type='lower',
+            swap='right_knee'),
+        14:
+        dict(
+            name='right_knee',
+            id=14,
+            color=[255, 128, 0],
+            type='lower',
+            swap='left_knee'),
+        15:
+        dict(
+            name='left_ankle',
+            id=15,
+            color=[0, 255, 0],
+            type='lower',
+            swap='right_ankle'),
+        16:
+        dict(
+            name='right_ankle',
+            id=16,
+            color=[255, 128, 0],
+            type='lower',
+            swap='left_ankle'),
+        17:
+        dict(
+            name='left_big_toe',
+            id=17,
+            color=[255, 128, 0],
+            type='lower',
+            swap='right_big_toe'),
+        18:
+        dict(
+            name='left_small_toe',
+            id=18,
+            color=[255, 128, 0],
+            type='lower',
+            swap='right_small_toe'),
+        19:
+        dict(
+            name='left_heel',
+            id=19,
+            color=[255, 128, 0],
+            type='lower',
+            swap='right_heel'),
+        20:
+        dict(
+            name='right_big_toe',
+            id=20,
+            color=[255, 128, 0],
+            type='lower',
+            swap='left_big_toe'),
+        21:
+        dict(
+            name='right_small_toe',
+            id=21,
+            color=[255, 128, 0],
+            type='lower',
+            swap='left_small_toe'),
+        22:
+        dict(
+            name='right_heel',
+            id=22,
+            color=[255, 128, 0],
+            type='lower',
+            swap='left_heel'),
+        23:
+        dict(
+            name='face-0',
+            id=23,
+            color=[255, 255, 255],
+            type='',
+            swap='face-16'),
+        24:
+        dict(
+            name='face-1',
+            id=24,
+            color=[255, 255, 255],
+            type='',
+            swap='face-15'),
+        25:
+        dict(
+            name='face-2',
+            id=25,
+            color=[255, 255, 255],
+            type='',
+            swap='face-14'),
+        26:
+        dict(
+            name='face-3',
+            id=26,
+            color=[255, 255, 255],
+            type='',
+            swap='face-13'),
+        27:
+        dict(
+            name='face-4',
+            id=27,
+            color=[255, 255, 255],
+            type='',
+            swap='face-12'),
+        28:
+        dict(
+            name='face-5',
+            id=28,
+            color=[255, 255, 255],
+            type='',
+            swap='face-11'),
+        29:
+        dict(
+            name='face-6',
+            id=29,
+            color=[255, 255, 255],
+            type='',
+            swap='face-10'),
+        30:
+        dict(
+            name='face-7',
+            id=30,
+            color=[255, 255, 255],
+            type='',
+            swap='face-9'),
+        31:
+        dict(name='face-8', id=31, color=[255, 255, 255], type='', swap=''),
+        32:
+        dict(
+            name='face-9',
+            id=32,
+            color=[255, 255, 255],
+            type='',
+            swap='face-7'),
+        33:
+        dict(
+            name='face-10',
+            id=33,
+            color=[255, 255, 255],
+            type='',
+            swap='face-6'),
+        34:
+        dict(
+            name='face-11',
+            id=34,
+            color=[255, 255, 255],
+            type='',
+            swap='face-5'),
+        35:
+        dict(
+            name='face-12',
+            id=35,
+            color=[255, 255, 255],
+            type='',
+            swap='face-4'),
+        36:
+        dict(
+            name='face-13',
+            id=36,
+            color=[255, 255, 255],
+            type='',
+            swap='face-3'),
+        37:
+        dict(
+            name='face-14',
+            id=37,
+            color=[255, 255, 255],
+            type='',
+            swap='face-2'),
+        38:
+        dict(
+            name='face-15',
+            id=38,
+            color=[255, 255, 255],
+            type='',
+            swap='face-1'),
+        39:
+        dict(
+            name='face-16',
+            id=39,
+            color=[255, 255, 255],
+            type='',
+            swap='face-0'),
+        40:
+        dict(
+            name='face-17',
+            id=40,
+            color=[255, 255, 255],
+            type='',
+            swap='face-26'),
+        41:
+        dict(
+            name='face-18',
+            id=41,
+            color=[255, 255, 255],
+            type='',
+            swap='face-25'),
+        42:
+        dict(
+            name='face-19',
+            id=42,
+            color=[255, 255, 255],
+            type='',
+            swap='face-24'),
+        43:
+        dict(
+            name='face-20',
+            id=43,
+            color=[255, 255, 255],
+            type='',
+            swap='face-23'),
+        44:
+        dict(
+            name='face-21',
+            id=44,
+            color=[255, 255, 255],
+            type='',
+            swap='face-22'),
+        45:
+        dict(
+            name='face-22',
+            id=45,
+            color=[255, 255, 255],
+            type='',
+            swap='face-21'),
+        46:
+        dict(
+            name='face-23',
+            id=46,
+            color=[255, 255, 255],
+            type='',
+            swap='face-20'),
+        47:
+        dict(
+            name='face-24',
+            id=47,
+            color=[255, 255, 255],
+            type='',
+            swap='face-19'),
+        48:
+        dict(
+            name='face-25',
+            id=48,
+            color=[255, 255, 255],
+            type='',
+            swap='face-18'),
+        49:
+        dict(
+            name='face-26',
+            id=49,
+            color=[255, 255, 255],
+            type='',
+            swap='face-17'),
+        50:
+        dict(name='face-27', id=50, color=[255, 255, 255], type='', swap=''),
+        51:
+        dict(name='face-28', id=51, color=[255, 255, 255], type='', swap=''),
+        52:
+        dict(name='face-29', id=52, color=[255, 255, 255], type='', swap=''),
+        53:
+        dict(name='face-30', id=53, color=[255, 255, 255], type='', swap=''),
+        54:
+        dict(
+            name='face-31',
+            id=54,
+            color=[255, 255, 255],
+            type='',
+            swap='face-35'),
+        55:
+        dict(
+            name='face-32',
+            id=55,
+            color=[255, 255, 255],
+            type='',
+            swap='face-34'),
+        56:
+        dict(name='face-33', id=56, color=[255, 255, 255], type='', swap=''),
+        57:
+        dict(
+            name='face-34',
+            id=57,
+            color=[255, 255, 255],
+            type='',
+            swap='face-32'),
+        58:
+        dict(
+            name='face-35',
+            id=58,
+            color=[255, 255, 255],
+            type='',
+            swap='face-31'),
+        59:
+        dict(
+            name='face-36',
+            id=59,
+            color=[255, 255, 255],
+            type='',
+            swap='face-45'),
+        60:
+        dict(
+            name='face-37',
+            id=60,
+            color=[255, 255, 255],
+            type='',
+            swap='face-44'),
+        61:
+        dict(
+            name='face-38',
+            id=61,
+            color=[255, 255, 255],
+            type='',
+            swap='face-43'),
+        62:
+        dict(
+            name='face-39',
+            id=62,
+            color=[255, 255, 255],
+            type='',
+            swap='face-42'),
+        63:
+        dict(
+            name='face-40',
+            id=63,
+            color=[255, 255, 255],
+            type='',
+            swap='face-47'),
+        64:
+        dict(
+            name='face-41',
+            id=64,
+            color=[255, 255, 255],
+            type='',
+            swap='face-46'),
+        65:
+        dict(
+            name='face-42',
+            id=65,
+            color=[255, 255, 255],
+            type='',
+            swap='face-39'),
+        66:
+        dict(
+            name='face-43',
+            id=66,
+            color=[255, 255, 255],
+            type='',
+            swap='face-38'),
+        67:
+        dict(
+            name='face-44',
+            id=67,
+            color=[255, 255, 255],
+            type='',
+            swap='face-37'),
+        68:
+        dict(
+            name='face-45',
+            id=68,
+            color=[255, 255, 255],
+            type='',
+            swap='face-36'),
+        69:
+        dict(
+            name='face-46',
+            id=69,
+            color=[255, 255, 255],
+            type='',
+            swap='face-41'),
+        70:
+        dict(
+            name='face-47',
+            id=70,
+            color=[255, 255, 255],
+            type='',
+            swap='face-40'),
+        71:
+        dict(
+            name='face-48',
+            id=71,
+            color=[255, 255, 255],
+            type='',
+            swap='face-54'),
+        72:
+        dict(
+            name='face-49',
+            id=72,
+            color=[255, 255, 255],
+            type='',
+            swap='face-53'),
+        73:
+        dict(
+            name='face-50',
+            id=73,
+            color=[255, 255, 255],
+            type='',
+            swap='face-52'),
+        74:
+        dict(name='face-51', id=74, color=[255, 255, 255], type='', swap=''),
+        75:
+        dict(
+            name='face-52',
+            id=75,
+            color=[255, 255, 255],
+            type='',
+            swap='face-50'),
+        76:
+        dict(
+            name='face-53',
+            id=76,
+            color=[255, 255, 255],
+            type='',
+            swap='face-49'),
+        77:
+        dict(
+            name='face-54',
+            id=77,
+            color=[255, 255, 255],
+            type='',
+            swap='face-48'),
+        78:
+        dict(
+            name='face-55',
+            id=78,
+            color=[255, 255, 255],
+            type='',
+            swap='face-59'),
+        79:
+        dict(
+            name='face-56',
+            id=79,
+            color=[255, 255, 255],
+            type='',
+            swap='face-58'),
+        80:
+        dict(name='face-57', id=80, color=[255, 255, 255], type='', swap=''),
+        81:
+        dict(
+            name='face-58',
+            id=81,
+            color=[255, 255, 255],
+            type='',
+            swap='face-56'),
+        82:
+        dict(
+            name='face-59',
+            id=82,
+            color=[255, 255, 255],
+            type='',
+            swap='face-55'),
+        83:
+        dict(
+            name='face-60',
+            id=83,
+            color=[255, 255, 255],
+            type='',
+            swap='face-64'),
+        84:
+        dict(
+            name='face-61',
+            id=84,
+            color=[255, 255, 255],
+            type='',
+            swap='face-63'),
+        85:
+        dict(name='face-62', id=85, color=[255, 255, 255], type='', swap=''),
+        86:
+        dict(
+            name='face-63',
+            id=86,
+            color=[255, 255, 255],
+            type='',
+            swap='face-61'),
+        87:
+        dict(
+            name='face-64',
+            id=87,
+            color=[255, 255, 255],
+            type='',
+            swap='face-60'),
+        88:
+        dict(
+            name='face-65',
+            id=88,
+            color=[255, 255, 255],
+            type='',
+            swap='face-67'),
+        89:
+        dict(name='face-66', id=89, color=[255, 255, 255], type='', swap=''),
+        90:
+        dict(
+            name='face-67',
+            id=90,
+            color=[255, 255, 255],
+            type='',
+            swap='face-65'),
+        91:
+        dict(
+            name='left_hand_root',
+            id=91,
+            color=[255, 255, 255],
+            type='',
+            swap='right_hand_root'),
+        92:
+        dict(
+            name='left_thumb1',
+            id=92,
+            color=[255, 128, 0],
+            type='',
+            swap='right_thumb1'),
+        93:
+        dict(
+            name='left_thumb2',
+            id=93,
+            color=[255, 128, 0],
+            type='',
+            swap='right_thumb2'),
+        94:
+        dict(
+            name='left_thumb3',
+            id=94,
+            color=[255, 128, 0],
+            type='',
+            swap='right_thumb3'),
+        95:
+        dict(
+            name='left_thumb4',
+            id=95,
+            color=[255, 128, 0],
+            type='',
+            swap='right_thumb4'),
+        96:
+        dict(
+            name='left_forefinger1',
+            id=96,
+            color=[255, 153, 255],
+            type='',
+            swap='right_forefinger1'),
+        97:
+        dict(
+            name='left_forefinger2',
+            id=97,
+            color=[255, 153, 255],
+            type='',
+            swap='right_forefinger2'),
+        98:
+        dict(
+            name='left_forefinger3',
+            id=98,
+            color=[255, 153, 255],
+            type='',
+            swap='right_forefinger3'),
+        99:
+        dict(
+            name='left_forefinger4',
+            id=99,
+            color=[255, 153, 255],
+            type='',
+            swap='right_forefinger4'),
+        100:
+        dict(
+            name='left_middle_finger1',
+            id=100,
+            color=[102, 178, 255],
+            type='',
+            swap='right_middle_finger1'),
+        101:
+        dict(
+            name='left_middle_finger2',
+            id=101,
+            color=[102, 178, 255],
+            type='',
+            swap='right_middle_finger2'),
+        102:
+        dict(
+            name='left_middle_finger3',
+            id=102,
+            color=[102, 178, 255],
+            type='',
+            swap='right_middle_finger3'),
+        103:
+        dict(
+            name='left_middle_finger4',
+            id=103,
+            color=[102, 178, 255],
+            type='',
+            swap='right_middle_finger4'),
+        104:
+        dict(
+            name='left_ring_finger1',
+            id=104,
+            color=[255, 51, 51],
+            type='',
+            swap='right_ring_finger1'),
+        105:
+        dict(
+            name='left_ring_finger2',
+            id=105,
+            color=[255, 51, 51],
+            type='',
+            swap='right_ring_finger2'),
+        106:
+        dict(
+            name='left_ring_finger3',
+            id=106,
+            color=[255, 51, 51],
+            type='',
+            swap='right_ring_finger3'),
+        107:
+        dict(
+            name='left_ring_finger4',
+            id=107,
+            color=[255, 51, 51],
+            type='',
+            swap='right_ring_finger4'),
+        108:
+        dict(
+            name='left_pinky_finger1',
+            id=108,
+            color=[0, 255, 0],
+            type='',
+            swap='right_pinky_finger1'),
+        109:
+        dict(
+            name='left_pinky_finger2',
+            id=109,
+            color=[0, 255, 0],
+            type='',
+            swap='right_pinky_finger2'),
+        110:
+        dict(
+            name='left_pinky_finger3',
+            id=110,
+            color=[0, 255, 0],
+            type='',
+            swap='right_pinky_finger3'),
+        111:
+        dict(
+            name='left_pinky_finger4',
+            id=111,
+            color=[0, 255, 0],
+            type='',
+            swap='right_pinky_finger4'),
+        112:
+        dict(
+            name='right_hand_root',
+            id=112,
+            color=[255, 255, 255],
+            type='',
+            swap='left_hand_root'),
+        113:
+        dict(
+            name='right_thumb1',
+            id=113,
+            color=[255, 128, 0],
+            type='',
+            swap='left_thumb1'),
+        114:
+        dict(
+            name='right_thumb2',
+            id=114,
+            color=[255, 128, 0],
+            type='',
+            swap='left_thumb2'),
+        115:
+        dict(
+            name='right_thumb3',
+            id=115,
+            color=[255, 128, 0],
+            type='',
+            swap='left_thumb3'),
+        116:
+        dict(
+            name='right_thumb4',
+            id=116,
+            color=[255, 128, 0],
+            type='',
+            swap='left_thumb4'),
+        117:
+        dict(
+            name='right_forefinger1',
+            id=117,
+            color=[255, 153, 255],
+            type='',
+            swap='left_forefinger1'),
+        118:
+        dict(
+            name='right_forefinger2',
+            id=118,
+            color=[255, 153, 255],
+            type='',
+            swap='left_forefinger2'),
+        119:
+        dict(
+            name='right_forefinger3',
+            id=119,
+            color=[255, 153, 255],
+            type='',
+            swap='left_forefinger3'),
+        120:
+        dict(
+            name='right_forefinger4',
+            id=120,
+            color=[255, 153, 255],
+            type='',
+            swap='left_forefinger4'),
+        121:
+        dict(
+            name='right_middle_finger1',
+            id=121,
+            color=[102, 178, 255],
+            type='',
+            swap='left_middle_finger1'),
+        122:
+        dict(
+            name='right_middle_finger2',
+            id=122,
+            color=[102, 178, 255],
+            type='',
+            swap='left_middle_finger2'),
+        123:
+        dict(
+            name='right_middle_finger3',
+            id=123,
+            color=[102, 178, 255],
+            type='',
+            swap='left_middle_finger3'),
+        124:
+        dict(
+            name='right_middle_finger4',
+            id=124,
+            color=[102, 178, 255],
+            type='',
+            swap='left_middle_finger4'),
+        125:
+        dict(
+            name='right_ring_finger1',
+            id=125,
+            color=[255, 51, 51],
+            type='',
+            swap='left_ring_finger1'),
+        126:
+        dict(
+            name='right_ring_finger2',
+            id=126,
+            color=[255, 51, 51],
+            type='',
+            swap='left_ring_finger2'),
+        127:
+        dict(
+            name='right_ring_finger3',
+            id=127,
+            color=[255, 51, 51],
+            type='',
+            swap='left_ring_finger3'),
+        128:
+        dict(
+            name='right_ring_finger4',
+            id=128,
+            color=[255, 51, 51],
+            type='',
+            swap='left_ring_finger4'),
+        129:
+        dict(
+            name='right_pinky_finger1',
+            id=129,
+            color=[0, 255, 0],
+            type='',
+            swap='left_pinky_finger1'),
+        130:
+        dict(
+            name='right_pinky_finger2',
+            id=130,
+            color=[0, 255, 0],
+            type='',
+            swap='left_pinky_finger2'),
+        131:
+        dict(
+            name='right_pinky_finger3',
+            id=131,
+            color=[0, 255, 0],
+            type='',
+            swap='left_pinky_finger3'),
+        132:
+        dict(
+            name='right_pinky_finger4',
+            id=132,
+            color=[0, 255, 0],
+            type='',
+            swap='left_pinky_finger4')
+    },
+    skeleton_info={
+        0:
+        dict(link=('left_ankle', 'left_knee'), id=0, color=[0, 255, 0]),
+        1:
+        dict(link=('left_knee', 'left_hip'), id=1, color=[0, 255, 0]),
+        2:
+        dict(link=('right_ankle', 'right_knee'), id=2, color=[255, 128, 0]),
+        3:
+        dict(link=('right_knee', 'right_hip'), id=3, color=[255, 128, 0]),
+        4:
+        dict(link=('left_hip', 'right_hip'), id=4, color=[51, 153, 255]),
+        5:
+        dict(link=('left_shoulder', 'left_hip'), id=5, color=[51, 153, 255]),
+        6:
+        dict(link=('right_shoulder', 'right_hip'), id=6, color=[51, 153, 255]),
+        7:
+        dict(
+            link=('left_shoulder', 'right_shoulder'),
+            id=7,
+            color=[51, 153, 255]),
+        8:
+        dict(link=('left_shoulder', 'left_elbow'), id=8, color=[0, 255, 0]),
+        9:
+        dict(
+            link=('right_shoulder', 'right_elbow'), id=9, color=[255, 128, 0]),
+        10:
+        dict(link=('left_elbow', 'left_wrist'), id=10, color=[0, 255, 0]),
+        11:
+        dict(link=('right_elbow', 'right_wrist'), id=11, color=[255, 128, 0]),
+        12:
+        dict(link=('left_eye', 'right_eye'), id=12, color=[51, 153, 255]),
+        13:
+        dict(link=('nose', 'left_eye'), id=13, color=[51, 153, 255]),
+        14:
+        dict(link=('nose', 'right_eye'), id=14, color=[51, 153, 255]),
+        15:
+        dict(link=('left_eye', 'left_ear'), id=15, color=[51, 153, 255]),
+        16:
+        dict(link=('right_eye', 'right_ear'), id=16, color=[51, 153, 255]),
+        17:
+        dict(link=('left_ear', 'left_shoulder'), id=17, color=[51, 153, 255]),
+        18:
+        dict(
+            link=('right_ear', 'right_shoulder'), id=18, color=[51, 153, 255]),
+        19:
+        dict(link=('left_ankle', 'left_big_toe'), id=19, color=[0, 255, 0]),
+        20:
+        dict(link=('left_ankle', 'left_small_toe'), id=20, color=[0, 255, 0]),
+        21:
+        dict(link=('left_ankle', 'left_heel'), id=21, color=[0, 255, 0]),
+        22:
+        dict(
+            link=('right_ankle', 'right_big_toe'), id=22, color=[255, 128, 0]),
+        23:
+        dict(
+            link=('right_ankle', 'right_small_toe'),
+            id=23,
+            color=[255, 128, 0]),
+        24:
+        dict(link=('right_ankle', 'right_heel'), id=24, color=[255, 128, 0]),
+        25:
+        dict(
+            link=('left_hand_root', 'left_thumb1'), id=25, color=[255, 128,
+                                                                  0]),
+        26:
+        dict(link=('left_thumb1', 'left_thumb2'), id=26, color=[255, 128, 0]),
+        27:
+        dict(link=('left_thumb2', 'left_thumb3'), id=27, color=[255, 128, 0]),
+        28:
+        dict(link=('left_thumb3', 'left_thumb4'), id=28, color=[255, 128, 0]),
+        29:
+        dict(
+            link=('left_hand_root', 'left_forefinger1'),
+            id=29,
+            color=[255, 153, 255]),
+        30:
+        dict(
+            link=('left_forefinger1', 'left_forefinger2'),
+            id=30,
+            color=[255, 153, 255]),
+        31:
+        dict(
+            link=('left_forefinger2', 'left_forefinger3'),
+            id=31,
+            color=[255, 153, 255]),
+        32:
+        dict(
+            link=('left_forefinger3', 'left_forefinger4'),
+            id=32,
+            color=[255, 153, 255]),
+        33:
+        dict(
+            link=('left_hand_root', 'left_middle_finger1'),
+            id=33,
+            color=[102, 178, 255]),
+        34:
+        dict(
+            link=('left_middle_finger1', 'left_middle_finger2'),
+            id=34,
+            color=[102, 178, 255]),
+        35:
+        dict(
+            link=('left_middle_finger2', 'left_middle_finger3'),
+            id=35,
+            color=[102, 178, 255]),
+        36:
+        dict(
+            link=('left_middle_finger3', 'left_middle_finger4'),
+            id=36,
+            color=[102, 178, 255]),
+        37:
+        dict(
+            link=('left_hand_root', 'left_ring_finger1'),
+            id=37,
+            color=[255, 51, 51]),
+        38:
+        dict(
+            link=('left_ring_finger1', 'left_ring_finger2'),
+            id=38,
+            color=[255, 51, 51]),
+        39:
+        dict(
+            link=('left_ring_finger2', 'left_ring_finger3'),
+            id=39,
+            color=[255, 51, 51]),
+        40:
+        dict(
+            link=('left_ring_finger3', 'left_ring_finger4'),
+            id=40,
+            color=[255, 51, 51]),
+        41:
+        dict(
+            link=('left_hand_root', 'left_pinky_finger1'),
+            id=41,
+            color=[0, 255, 0]),
+        42:
+        dict(
+            link=('left_pinky_finger1', 'left_pinky_finger2'),
+            id=42,
+            color=[0, 255, 0]),
+        43:
+        dict(
+            link=('left_pinky_finger2', 'left_pinky_finger3'),
+            id=43,
+            color=[0, 255, 0]),
+        44:
+        dict(
+            link=('left_pinky_finger3', 'left_pinky_finger4'),
+            id=44,
+            color=[0, 255, 0]),
+        45:
+        dict(
+            link=('right_hand_root', 'right_thumb1'),
+            id=45,
+            color=[255, 128, 0]),
+        46:
+        dict(
+            link=('right_thumb1', 'right_thumb2'), id=46, color=[255, 128, 0]),
+        47:
+        dict(
+            link=('right_thumb2', 'right_thumb3'), id=47, color=[255, 128, 0]),
+        48:
+        dict(
+            link=('right_thumb3', 'right_thumb4'), id=48, color=[255, 128, 0]),
+        49:
+        dict(
+            link=('right_hand_root', 'right_forefinger1'),
+            id=49,
+            color=[255, 153, 255]),
+        50:
+        dict(
+            link=('right_forefinger1', 'right_forefinger2'),
+            id=50,
+            color=[255, 153, 255]),
+        51:
+        dict(
+            link=('right_forefinger2', 'right_forefinger3'),
+            id=51,
+            color=[255, 153, 255]),
+        52:
+        dict(
+            link=('right_forefinger3', 'right_forefinger4'),
+            id=52,
+            color=[255, 153, 255]),
+        53:
+        dict(
+            link=('right_hand_root', 'right_middle_finger1'),
+            id=53,
+            color=[102, 178, 255]),
+        54:
+        dict(
+            link=('right_middle_finger1', 'right_middle_finger2'),
+            id=54,
+            color=[102, 178, 255]),
+        55:
+        dict(
+            link=('right_middle_finger2', 'right_middle_finger3'),
+            id=55,
+            color=[102, 178, 255]),
+        56:
+        dict(
+            link=('right_middle_finger3', 'right_middle_finger4'),
+            id=56,
+            color=[102, 178, 255]),
+        57:
+        dict(
+            link=('right_hand_root', 'right_ring_finger1'),
+            id=57,
+            color=[255, 51, 51]),
+        58:
+        dict(
+            link=('right_ring_finger1', 'right_ring_finger2'),
+            id=58,
+            color=[255, 51, 51]),
+        59:
+        dict(
+            link=('right_ring_finger2', 'right_ring_finger3'),
+            id=59,
+            color=[255, 51, 51]),
+        60:
+        dict(
+            link=('right_ring_finger3', 'right_ring_finger4'),
+            id=60,
+            color=[255, 51, 51]),
+        61:
+        dict(
+            link=('right_hand_root', 'right_pinky_finger1'),
+            id=61,
+            color=[0, 255, 0]),
+        62:
+        dict(
+            link=('right_pinky_finger1', 'right_pinky_finger2'),
+            id=62,
+            color=[0, 255, 0]),
+        63:
+        dict(
+            link=('right_pinky_finger2', 'right_pinky_finger3'),
+            id=63,
+            color=[0, 255, 0]),
+        64:
+        dict(
+            link=('right_pinky_finger3', 'right_pinky_finger4'),
+            id=64,
+            color=[0, 255, 0])
+    },
+    joint_weights=[1.] * 133,
+    # 'https://github.com/jin-s13/COCO-WholeBody/blob/master/'
+    # 'evaluation/myeval_wholebody.py#L175'
+    sigmas=[
+        0.026, 0.025, 0.025, 0.035, 0.035, 0.079, 0.079, 0.072, 0.072, 0.062,
+        0.062, 0.107, 0.107, 0.087, 0.087, 0.089, 0.089, 0.068, 0.066, 0.066,
+        0.092, 0.094, 0.094, 0.042, 0.043, 0.044, 0.043, 0.040, 0.035, 0.031,
+        0.025, 0.020, 0.023, 0.029, 0.032, 0.037, 0.038, 0.043, 0.041, 0.045,
+        0.013, 0.012, 0.011, 0.011, 0.012, 0.012, 0.011, 0.011, 0.013, 0.015,
+        0.009, 0.007, 0.007, 0.007, 0.012, 0.009, 0.008, 0.016, 0.010, 0.017,
+        0.011, 0.009, 0.011, 0.009, 0.007, 0.013, 0.008, 0.011, 0.012, 0.010,
+        0.034, 0.008, 0.008, 0.009, 0.008, 0.008, 0.007, 0.010, 0.008, 0.009,
+        0.009, 0.009, 0.007, 0.007, 0.008, 0.011, 0.008, 0.008, 0.008, 0.01,
+        0.008, 0.029, 0.022, 0.035, 0.037, 0.047, 0.026, 0.025, 0.024, 0.035,
+        0.018, 0.024, 0.022, 0.026, 0.017, 0.021, 0.021, 0.032, 0.02, 0.019,
+        0.022, 0.031, 0.029, 0.022, 0.035, 0.037, 0.047, 0.026, 0.025, 0.024,
+        0.035, 0.018, 0.024, 0.022, 0.026, 0.017, 0.021, 0.021, 0.032, 0.02,
+        0.019, 0.022, 0.031
+    ])
diff --git a/configs/body_2d_keypoint/rtmo/README.md b/configs/body_2d_keypoint/rtmo/README.md
new file mode 100644
index 0000000000..7480e92ee7
--- /dev/null
+++ b/configs/body_2d_keypoint/rtmo/README.md
@@ -0,0 +1,27 @@
+# RTMO: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation
+
+<!-- [ALGORITHM] -->
+
+<details>
+<summary align="right"><a href="https://arxiv.org/abs/2312.07526">RTMO</a></summary>
+
+```bibtex
+@misc{lu2023rtmo,
+      title={{RTMO}: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation},
+      author={Peng Lu and Tao Jiang and Yining Li and Xiangtai Li and Kai Chen and Wenming Yang},
+      year={2023},
+      eprint={2312.07526},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+```
+
+</details>
+
+RTMO is a one-stage pose estimation model that seamlessly integrates coordinate classification into the YOLO architecture. It introduces a Dynamic Coordinate Classifier (DCC) module that handles keypoint localization through dual 1D heatmaps. The DCC employs dynamic bin allocation, localizing the coordinate bins to each predicted bounding box to improve efficiency. It also uses learnable bin representations based on positional encodings, enabling computation of bin-keypoint similarity for precise localization.
+
+RTMO is trained end-to-end using a multi-task loss, with losses for bounding box regression, keypoint heatmap classification via a novel MLE loss, keypoint coordinate proxy regression, and keypoint visibility classification. The MLE loss models annotation uncertainty and balances optimization between easy and hard samples.
+
+During inference, RTMO employs grid-based dense predictions to simultaneously output human detection boxes and poses in a single pass. It selectively decodes heatmaps only for high-scoring grids after NMS, minimizing computational cost.
+
+Compared to prior one-stage methods that regress keypoint coordinates directly, RTMO achieves higher accuracy through coordinate classification while retaining real-time speeds. It also outperforms lightweight top-down approaches for images with many people, as the latter have inference times that scale linearly with the number of human instances.
diff --git a/configs/body_2d_keypoint/rtmo/body7/rtmo-l_16xb16-600e_body7-640x640.py b/configs/body_2d_keypoint/rtmo/body7/rtmo-l_16xb16-600e_body7-640x640.py
new file mode 100644
index 0000000000..45e4295c6c
--- /dev/null
+++ b/configs/body_2d_keypoint/rtmo/body7/rtmo-l_16xb16-600e_body7-640x640.py
@@ -0,0 +1,533 @@
+_base_ = ['../../../_base_/default_runtime.py']
+
+# runtime
+train_cfg = dict(max_epochs=600, val_interval=20, dynamic_intervals=[(580, 1)])
+
+auto_scale_lr = dict(base_batch_size=256)
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=40, max_keep_ckpts=3))
+
+optim_wrapper = dict(
+    type='OptimWrapper',
+    constructor='ForceDefaultOptimWrapperConstructor',
+    optimizer=dict(type='AdamW', lr=0.004, weight_decay=0.05),
+    paramwise_cfg=dict(
+        norm_decay_mult=0,
+        bias_decay_mult=0,
+        bypass_duplicate=True,
+        force_default_settings=True,
+        custom_keys=dict({'neck.encoder': dict(lr_mult=0.05)})),
+    clip_grad=dict(max_norm=0.1, norm_type=2))
+
+param_scheduler = [
+    dict(
+        type='QuadraticWarmupLR',
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=0.0002,
+        begin=5,
+        T_max=280,
+        end=280,
+        by_epoch=True,
+        convert_to_iter_based=True),
+    # this scheduler is used to increase the lr from 2e-4 to 5e-4
+    dict(type='ConstantLR', by_epoch=True, factor=2.5, begin=280, end=281),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=0.0002,
+        begin=281,
+        T_max=300,
+        end=580,
+        by_epoch=True,
+        convert_to_iter_based=True),
+    dict(type='ConstantLR', by_epoch=True, factor=1, begin=580, end=600),
+]
+
+# data
+input_size = (640, 640)
+metafile = 'configs/_base_/datasets/coco.py'
+codec = dict(type='YOLOXPoseAnnotationProcessor', input_size=input_size)
+
+train_pipeline_stage1 = [
+    dict(type='LoadImage', backend_args=None),
+    dict(
+        type='Mosaic',
+        img_scale=(640, 640),
+        pad_val=114.0,
+        pre_transform=[dict(type='LoadImage', backend_args=None)]),
+    dict(
+        type='BottomupRandomAffine',
+        input_size=(640, 640),
+        shift_factor=0.1,
+        rotate_factor=10,
+        scale_factor=(0.75, 1.0),
+        pad_val=114,
+        distribution='uniform',
+        transform_mode='perspective',
+        bbox_keep_corner=False,
+        clip_border=True,
+    ),
+    dict(
+        type='YOLOXMixUp',
+        img_scale=(640, 640),
+        ratio_range=(0.8, 1.6),
+        pad_val=114.0,
+        pre_transform=[dict(type='LoadImage', backend_args=None)]),
+    dict(type='YOLOXHSVRandomAug'),
+    dict(type='RandomFlip'),
+    dict(type='FilterAnnotations', by_kpt=True, by_box=True, keep_empty=False),
+    dict(type='GenerateTarget', encoder=codec),
+    dict(type='PackPoseInputs'),
+]
+train_pipeline_stage2 = [
+    dict(type='LoadImage'),
+    dict(
+        type='BottomupRandomAffine',
+        input_size=(640, 640),
+        scale_type='long',
+        pad_val=(114, 114, 114),
+        bbox_keep_corner=False,
+        clip_border=True,
+    ),
+    dict(type='YOLOXHSVRandomAug'),
+    dict(type='RandomFlip'),
+    dict(type='BottomupGetHeatmapMask', get_invalid=True),
+    dict(type='FilterAnnotations', by_kpt=True, by_box=True, keep_empty=False),
+    dict(type='GenerateTarget', encoder=codec),
+    dict(type='PackPoseInputs'),
+]
+
+# data settings
+data_mode = 'bottomup'
+data_root = 'data/'
+
+# mapping
+aic_coco = [
+    (0, 6),
+    (1, 8),
+    (2, 10),
+    (3, 5),
+    (4, 7),
+    (5, 9),
+    (6, 12),
+    (7, 14),
+    (8, 16),
+    (9, 11),
+    (10, 13),
+    (11, 15),
+]
+
+crowdpose_coco = [
+    (0, 5),
+    (1, 6),
+    (2, 7),
+    (3, 8),
+    (4, 9),
+    (5, 10),
+    (6, 11),
+    (7, 12),
+    (8, 13),
+    (9, 14),
+    (10, 15),
+    (11, 16),
+]
+
+mpii_coco = [
+    (0, 16),
+    (1, 14),
+    (2, 12),
+    (3, 11),
+    (4, 13),
+    (5, 15),
+    (10, 10),
+    (11, 8),
+    (12, 6),
+    (13, 5),
+    (14, 7),
+    (15, 9),
+]
+
+jhmdb_coco = [
+    (3, 6),
+    (4, 5),
+    (5, 12),
+    (6, 11),
+    (7, 8),
+    (8, 7),
+    (9, 14),
+    (10, 13),
+    (11, 10),
+    (12, 9),
+    (13, 16),
+    (14, 15),
+]
+
+halpe_coco = [
+    (0, 0),
+    (1, 1),
+    (2, 2),
+    (3, 3),
+    (4, 4),
+    (5, 5),
+    (6, 6),
+    (7, 7),
+    (8, 8),
+    (9, 9),
+    (10, 10),
+    (11, 11),
+    (12, 12),
+    (13, 13),
+    (14, 14),
+    (15, 15),
+    (16, 16),
+]
+
+ochuman_coco = [
+    (0, 0),
+    (1, 1),
+    (2, 2),
+    (3, 3),
+    (4, 4),
+    (5, 5),
+    (6, 6),
+    (7, 7),
+    (8, 8),
+    (9, 9),
+    (10, 10),
+    (11, 11),
+    (12, 12),
+    (13, 13),
+    (14, 14),
+    (15, 15),
+    (16, 16),
+]
+
+posetrack_coco = [
+    (0, 0),
+    (3, 3),
+    (4, 4),
+    (5, 5),
+    (6, 6),
+    (7, 7),
+    (8, 8),
+    (9, 9),
+    (10, 10),
+    (11, 11),
+    (12, 12),
+    (13, 13),
+    (14, 14),
+    (15, 15),
+    (16, 16),
+]
+
+# train datasets
+dataset_coco = dict(
+    type='CocoDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='coco/annotations/person_keypoints_train2017.json',
+    data_prefix=dict(img='coco/train2017/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=17,
+            mapping=[(i, i) for i in range(17)])
+    ],
+)
+
+dataset_aic = dict(
+    type='AicDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='aic/annotations/aic_train.json',
+    data_prefix=dict(img='pose/ai_challenge/ai_challenger_keypoint'
+                     '_train_20170902/keypoint_train_images_20170902/'),
+    pipeline=[
+        dict(type='KeypointConverter', num_keypoints=17, mapping=aic_coco)
+    ],
+)
+
+dataset_crowdpose = dict(
+    type='CrowdPoseDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='crowdpose/annotations/mmpose_crowdpose_trainval.json',
+    data_prefix=dict(img='pose/CrowdPose/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter', num_keypoints=17, mapping=crowdpose_coco)
+    ],
+)
+
+dataset_mpii = dict(
+    type='MpiiDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='mpii/annotations/mpii_train.json',
+    data_prefix=dict(img='pose/MPI/images/'),
+    pipeline=[
+        dict(type='KeypointConverter', num_keypoints=17, mapping=mpii_coco)
+    ],
+)
+
+dataset_jhmdb = dict(
+    type='JhmdbDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='jhmdb/annotations/Sub1_train.json',
+    data_prefix=dict(img='pose/JHMDB/'),
+    pipeline=[
+        dict(type='KeypointConverter', num_keypoints=17, mapping=jhmdb_coco)
+    ],
+)
+
+dataset_halpe = dict(
+    type='HalpeDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='halpe/annotations/halpe_train_v1.json',
+    data_prefix=dict(img='pose/Halpe/hico_20160224_det/images/train2015'),
+    pipeline=[
+        dict(type='KeypointConverter', num_keypoints=17, mapping=halpe_coco)
+    ],
+)
+
+dataset_posetrack = dict(
+    type='PoseTrack18Dataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='posetrack18/annotations/posetrack18_train.json',
+    data_prefix=dict(img='pose/PoseChallenge2018/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter', num_keypoints=17, mapping=posetrack_coco)
+    ],
+)
+
+train_dataset = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file=metafile),
+    datasets=[
+        dataset_coco,
+        dataset_aic,
+        dataset_crowdpose,
+        dataset_mpii,
+        dataset_jhmdb,
+        dataset_halpe,
+        dataset_posetrack,
+    ],
+    sample_ratio_factor=[1, 0.3, 0.5, 0.3, 0.3, 0.4, 0.3],
+    test_mode=False,
+    pipeline=train_pipeline_stage1)
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    persistent_workers=True,
+    pin_memory=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=train_dataset)
+
+# val datasets
+val_pipeline = [
+    dict(type='LoadImage'),
+    dict(
+        type='BottomupResize', input_size=input_size, pad_val=(114, 114, 114)),
+    dict(
+        type='PackPoseInputs',
+        meta_keys=('id', 'img_id', 'img_path', 'ori_shape', 'img_shape',
+                   'input_size', 'input_center', 'input_scale'))
+]
+
+val_dataloader = dict(
+    batch_size=1,
+    num_workers=2,
+    persistent_workers=True,
+    pin_memory=True,
+    drop_last=False,
+    sampler=dict(type='DefaultSampler', shuffle=False, round_up=False),
+    dataset=dict(
+        type='CocoDataset',
+        data_root=data_root,
+        data_mode=data_mode,
+        ann_file='coco/annotations/person_keypoints_val2017.json',
+        data_prefix=dict(img='coco/val2017/'),
+        test_mode=True,
+        pipeline=val_pipeline,
+    ))
+test_dataloader = val_dataloader
+
+# evaluators
+val_evaluator = dict(
+    type='CocoMetric',
+    ann_file=data_root + 'coco/annotations/person_keypoints_val2017.json',
+    score_mode='bbox',
+    nms_mode='none',
+)
+test_evaluator = val_evaluator
+
+# hooks
+custom_hooks = [
+    dict(
+        type='YOLOXPoseModeSwitchHook',
+        num_last_epochs=20,
+        new_train_dataset=dataset_coco,
+        new_train_pipeline=train_pipeline_stage2,
+        priority=48),
+    dict(
+        type='RTMOModeSwitchHook',
+        epoch_attributes={
+            280: {
+                'proxy_target_cc': True,
+                'overlaps_power': 1.0,
+                'loss_cls.loss_weight': 2.0,
+                'loss_mle.loss_weight': 5.0,
+                'loss_oks.loss_weight': 10.0
+            },
+        },
+        priority=48),
+    dict(type='SyncNormHook', priority=48),
+    dict(
+        type='EMAHook',
+        ema_type='ExpMomentumEMA',
+        momentum=0.0002,
+        update_buffers=True,
+        strict_load=False,
+        priority=49),
+]
+
+# model
+widen_factor = 1.0
+deepen_factor = 1.0
+
+model = dict(
+    type='BottomupPoseEstimator',
+    init_cfg=dict(
+        type='Kaiming',
+        layer='Conv2d',
+        a=2.23606797749979,
+        distribution='uniform',
+        mode='fan_in',
+        nonlinearity='leaky_relu'),
+    data_preprocessor=dict(
+        type='PoseDataPreprocessor',
+        pad_size_divisor=32,
+        mean=[0, 0, 0],
+        std=[1, 1, 1],
+        batch_augments=[
+            dict(
+                type='BatchSyncRandomResize',
+                random_size_range=(480, 800),
+                size_divisor=32,
+                interval=1),
+        ]),
+    backbone=dict(
+        type='CSPDarknet',
+        deepen_factor=deepen_factor,
+        widen_factor=widen_factor,
+        out_indices=(2, 3, 4),
+        spp_kernal_sizes=(5, 9, 13),
+        norm_cfg=dict(type='BN', momentum=0.03, eps=0.001),
+        act_cfg=dict(type='Swish'),
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint='https://download.openmmlab.com/mmdetection/v2.0/'
+            'yolox/yolox_l_8x8_300e_coco/yolox_l_8x8_300e_coco'
+            '_20211126_140236-d3bd2b23.pth',
+            prefix='backbone.',
+        )),
+    neck=dict(
+        type='HybridEncoder',
+        in_channels=[256, 512, 1024],
+        deepen_factor=deepen_factor,
+        widen_factor=widen_factor,
+        hidden_dim=256,
+        output_indices=[1, 2],
+        encoder_cfg=dict(
+            self_attn_cfg=dict(embed_dims=256, num_heads=8, dropout=0.0),
+            ffn_cfg=dict(
+                embed_dims=256,
+                feedforward_channels=1024,
+                ffn_drop=0.0,
+                act_cfg=dict(type='GELU'))),
+        projector=dict(
+            type='ChannelMapper',
+            in_channels=[256, 256],
+            kernel_size=1,
+            out_channels=512,
+            act_cfg=None,
+            norm_cfg=dict(type='BN'),
+            num_outs=2)),
+    head=dict(
+        type='RTMOHead',
+        num_keypoints=17,
+        featmap_strides=(16, 32),
+        head_module_cfg=dict(
+            num_classes=1,
+            in_channels=256,
+            cls_feat_channels=256,
+            channels_per_group=36,
+            pose_vec_channels=512,
+            widen_factor=widen_factor,
+            stacked_convs=2,
+            norm_cfg=dict(type='BN', momentum=0.03, eps=0.001),
+            act_cfg=dict(type='Swish')),
+        assigner=dict(
+            type='SimOTAAssigner',
+            dynamic_k_indicator='oks',
+            oks_calculator=dict(type='PoseOKS', metainfo=metafile)),
+        prior_generator=dict(
+            type='MlvlPointGenerator',
+            centralize_points=True,
+            strides=[16, 32]),
+        dcc_cfg=dict(
+            in_channels=512,
+            feat_channels=128,
+            num_bins=(192, 256),
+            spe_channels=128,
+            gau_cfg=dict(
+                s=128,
+                expansion_factor=2,
+                dropout_rate=0.0,
+                drop_path=0.0,
+                act_fn='SiLU',
+                pos_enc='add')),
+        overlaps_power=0.5,
+        loss_cls=dict(
+            type='VariFocalLoss',
+            reduction='sum',
+            use_target_weight=True,
+            loss_weight=1.0),
+        loss_bbox=dict(
+            type='IoULoss',
+            mode='square',
+            eps=1e-16,
+            reduction='sum',
+            loss_weight=5.0),
+        loss_oks=dict(
+            type='OKSLoss',
+            reduction='none',
+            metainfo=metafile,
+            loss_weight=30.0),
+        loss_vis=dict(
+            type='BCELoss',
+            use_target_weight=True,
+            reduction='mean',
+            loss_weight=1.0),
+        loss_mle=dict(
+            type='MLECCLoss',
+            use_target_weight=True,
+            loss_weight=1e-2,
+        ),
+        loss_bbox_aux=dict(type='L1Loss', reduction='sum', loss_weight=1.0),
+    ),
+    test_cfg=dict(
+        input_size=input_size,
+        score_thr=0.1,
+        nms_thr=0.65,
+    ))
diff --git a/configs/body_2d_keypoint/rtmo/body7/rtmo-m_16xb16-600e_body7-640x640.py b/configs/body_2d_keypoint/rtmo/body7/rtmo-m_16xb16-600e_body7-640x640.py
new file mode 100644
index 0000000000..6c1a005366
--- /dev/null
+++ b/configs/body_2d_keypoint/rtmo/body7/rtmo-m_16xb16-600e_body7-640x640.py
@@ -0,0 +1,532 @@
+_base_ = ['../../../_base_/default_runtime.py']
+
+# runtime
+train_cfg = dict(max_epochs=600, val_interval=20, dynamic_intervals=[(580, 1)])
+
+auto_scale_lr = dict(base_batch_size=256)
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=40, max_keep_ckpts=3))
+
+optim_wrapper = dict(
+    type='OptimWrapper',
+    constructor='ForceDefaultOptimWrapperConstructor',
+    optimizer=dict(type='AdamW', lr=0.004, weight_decay=0.05),
+    paramwise_cfg=dict(
+        norm_decay_mult=0,
+        bias_decay_mult=0,
+        bypass_duplicate=True,
+        force_default_settings=True,
+        custom_keys=dict({'neck.encoder': dict(lr_mult=0.05)})),
+    clip_grad=dict(max_norm=0.1, norm_type=2))
+
+param_scheduler = [
+    dict(
+        type='QuadraticWarmupLR',
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=0.0002,
+        begin=5,
+        T_max=280,
+        end=280,
+        by_epoch=True,
+        convert_to_iter_based=True),
+    # this scheduler is used to increase the lr from 2e-4 to 5e-4
+    dict(type='ConstantLR', by_epoch=True, factor=2.5, begin=280, end=281),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=0.0002,
+        begin=281,
+        T_max=300,
+        end=580,
+        by_epoch=True,
+        convert_to_iter_based=True),
+    dict(type='ConstantLR', by_epoch=True, factor=1, begin=580, end=600),
+]
+
+# data
+input_size = (640, 640)
+metafile = 'configs/_base_/datasets/coco.py'
+codec = dict(type='YOLOXPoseAnnotationProcessor', input_size=input_size)
+
+train_pipeline_stage1 = [
+    dict(type='LoadImage', backend_args=None),
+    dict(
+        type='Mosaic',
+        img_scale=(640, 640),
+        pad_val=114.0,
+        pre_transform=[dict(type='LoadImage', backend_args=None)]),
+    dict(
+        type='BottomupRandomAffine',
+        input_size=(640, 640),
+        shift_factor=0.1,
+        rotate_factor=10,
+        scale_factor=(0.75, 1.0),
+        pad_val=114,
+        distribution='uniform',
+        transform_mode='perspective',
+        bbox_keep_corner=False,
+        clip_border=True,
+    ),
+    dict(
+        type='YOLOXMixUp',
+        img_scale=(640, 640),
+        ratio_range=(0.8, 1.6),
+        pad_val=114.0,
+        pre_transform=[dict(type='LoadImage', backend_args=None)]),
+    dict(type='YOLOXHSVRandomAug'),
+    dict(type='RandomFlip'),
+    dict(type='FilterAnnotations', by_kpt=True, by_box=True, keep_empty=False),
+    dict(type='GenerateTarget', encoder=codec),
+    dict(type='PackPoseInputs'),
+]
+train_pipeline_stage2 = [
+    dict(type='LoadImage'),
+    dict(
+        type='BottomupRandomAffine',
+        input_size=(640, 640),
+        scale_type='long',
+        pad_val=(114, 114, 114),
+        bbox_keep_corner=False,
+        clip_border=True,
+    ),
+    dict(type='YOLOXHSVRandomAug'),
+    dict(type='RandomFlip'),
+    dict(type='BottomupGetHeatmapMask', get_invalid=True),
+    dict(type='FilterAnnotations', by_kpt=True, by_box=True, keep_empty=False),
+    dict(type='GenerateTarget', encoder=codec),
+    dict(type='PackPoseInputs'),
+]
+
+# data settings
+data_mode = 'bottomup'
+data_root = 'data/'
+
+# mapping
+aic_coco = [
+    (0, 6),
+    (1, 8),
+    (2, 10),
+    (3, 5),
+    (4, 7),
+    (5, 9),
+    (6, 12),
+    (7, 14),
+    (8, 16),
+    (9, 11),
+    (10, 13),
+    (11, 15),
+]
+
+crowdpose_coco = [
+    (0, 5),
+    (1, 6),
+    (2, 7),
+    (3, 8),
+    (4, 9),
+    (5, 10),
+    (6, 11),
+    (7, 12),
+    (8, 13),
+    (9, 14),
+    (10, 15),
+    (11, 16),
+]
+
+mpii_coco = [
+    (0, 16),
+    (1, 14),
+    (2, 12),
+    (3, 11),
+    (4, 13),
+    (5, 15),
+    (10, 10),
+    (11, 8),
+    (12, 6),
+    (13, 5),
+    (14, 7),
+    (15, 9),
+]
+
+jhmdb_coco = [
+    (3, 6),
+    (4, 5),
+    (5, 12),
+    (6, 11),
+    (7, 8),
+    (8, 7),
+    (9, 14),
+    (10, 13),
+    (11, 10),
+    (12, 9),
+    (13, 16),
+    (14, 15),
+]
+
+halpe_coco = [
+    (0, 0),
+    (1, 1),
+    (2, 2),
+    (3, 3),
+    (4, 4),
+    (5, 5),
+    (6, 6),
+    (7, 7),
+    (8, 8),
+    (9, 9),
+    (10, 10),
+    (11, 11),
+    (12, 12),
+    (13, 13),
+    (14, 14),
+    (15, 15),
+    (16, 16),
+]
+
+ochuman_coco = [
+    (0, 0),
+    (1, 1),
+    (2, 2),
+    (3, 3),
+    (4, 4),
+    (5, 5),
+    (6, 6),
+    (7, 7),
+    (8, 8),
+    (9, 9),
+    (10, 10),
+    (11, 11),
+    (12, 12),
+    (13, 13),
+    (14, 14),
+    (15, 15),
+    (16, 16),
+]
+
+posetrack_coco = [
+    (0, 0),
+    (3, 3),
+    (4, 4),
+    (5, 5),
+    (6, 6),
+    (7, 7),
+    (8, 8),
+    (9, 9),
+    (10, 10),
+    (11, 11),
+    (12, 12),
+    (13, 13),
+    (14, 14),
+    (15, 15),
+    (16, 16),
+]
+
+# train datasets
+dataset_coco = dict(
+    type='CocoDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='coco/annotations/person_keypoints_train2017.json',
+    data_prefix=dict(img='coco/train2017/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=17,
+            mapping=[(i, i) for i in range(17)])
+    ],
+)
+
+dataset_aic = dict(
+    type='AicDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='aic/annotations/aic_train.json',
+    data_prefix=dict(img='pose/ai_challenge/ai_challenger_keypoint'
+                     '_train_20170902/keypoint_train_images_20170902/'),
+    pipeline=[
+        dict(type='KeypointConverter', num_keypoints=17, mapping=aic_coco)
+    ],
+)
+
+dataset_crowdpose = dict(
+    type='CrowdPoseDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='crowdpose/annotations/mmpose_crowdpose_trainval.json',
+    data_prefix=dict(img='pose/CrowdPose/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter', num_keypoints=17, mapping=crowdpose_coco)
+    ],
+)
+
+dataset_mpii = dict(
+    type='MpiiDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='mpii/annotations/mpii_train.json',
+    data_prefix=dict(img='pose/MPI/images/'),
+    pipeline=[
+        dict(type='KeypointConverter', num_keypoints=17, mapping=mpii_coco)
+    ],
+)
+
+dataset_jhmdb = dict(
+    type='JhmdbDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='jhmdb/annotations/Sub1_train.json',
+    data_prefix=dict(img='pose/JHMDB/'),
+    pipeline=[
+        dict(type='KeypointConverter', num_keypoints=17, mapping=jhmdb_coco)
+    ],
+)
+
+dataset_halpe = dict(
+    type='HalpeDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='halpe/annotations/halpe_train_v1.json',
+    data_prefix=dict(img='pose/Halpe/hico_20160224_det/images/train2015'),
+    pipeline=[
+        dict(type='KeypointConverter', num_keypoints=17, mapping=halpe_coco)
+    ],
+)
+
+dataset_posetrack = dict(
+    type='PoseTrack18Dataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='posetrack18/annotations/posetrack18_train.json',
+    data_prefix=dict(img='pose/PoseChallenge2018/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter', num_keypoints=17, mapping=posetrack_coco)
+    ],
+)
+
+train_dataset = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file=metafile),
+    datasets=[
+        dataset_coco,
+        dataset_aic,
+        dataset_crowdpose,
+        dataset_mpii,
+        dataset_jhmdb,
+        dataset_halpe,
+        dataset_posetrack,
+    ],
+    sample_ratio_factor=[1, 0.3, 0.5, 0.3, 0.3, 0.4, 0.3],
+    test_mode=False,
+    pipeline=train_pipeline_stage1)
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    persistent_workers=True,
+    pin_memory=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=train_dataset)
+
+# val datasets
+val_pipeline = [
+    dict(type='LoadImage'),
+    dict(
+        type='BottomupResize', input_size=input_size, pad_val=(114, 114, 114)),
+    dict(
+        type='PackPoseInputs',
+        meta_keys=('id', 'img_id', 'img_path', 'ori_shape', 'img_shape',
+                   'input_size', 'input_center', 'input_scale'))
+]
+
+val_dataloader = dict(
+    batch_size=1,
+    num_workers=2,
+    persistent_workers=True,
+    pin_memory=True,
+    drop_last=False,
+    sampler=dict(type='DefaultSampler', shuffle=False, round_up=False),
+    dataset=dict(
+        type='CocoDataset',
+        data_root=data_root,
+        data_mode=data_mode,
+        ann_file='coco/annotations/person_keypoints_val2017.json',
+        data_prefix=dict(img='coco/val2017/'),
+        test_mode=True,
+        pipeline=val_pipeline,
+    ))
+test_dataloader = val_dataloader
+
+# evaluators
+val_evaluator = dict(
+    type='CocoMetric',
+    ann_file=data_root + 'coco/annotations/person_keypoints_val2017.json',
+    score_mode='bbox',
+    nms_mode='none',
+)
+test_evaluator = val_evaluator
+
+# hooks
+custom_hooks = [
+    dict(
+        type='YOLOXPoseModeSwitchHook',
+        num_last_epochs=20,
+        new_train_dataset=dataset_coco,
+        new_train_pipeline=train_pipeline_stage2,
+        priority=48),
+    dict(
+        type='RTMOModeSwitchHook',
+        epoch_attributes={
+            280: {
+                'proxy_target_cc': True,
+                'overlaps_power': 1.0,
+                'loss_cls.loss_weight': 2.0,
+                'loss_mle.loss_weight': 5.0,
+                'loss_oks.loss_weight': 10.0
+            },
+        },
+        priority=48),
+    dict(type='SyncNormHook', priority=48),
+    dict(
+        type='EMAHook',
+        ema_type='ExpMomentumEMA',
+        momentum=0.0002,
+        update_buffers=True,
+        strict_load=False,
+        priority=49),
+]
+
+# model
+widen_factor = 0.75
+deepen_factor = 0.67
+
+model = dict(
+    type='BottomupPoseEstimator',
+    init_cfg=dict(
+        type='Kaiming',
+        layer='Conv2d',
+        a=2.23606797749979,
+        distribution='uniform',
+        mode='fan_in',
+        nonlinearity='leaky_relu'),
+    data_preprocessor=dict(
+        type='PoseDataPreprocessor',
+        pad_size_divisor=32,
+        mean=[0, 0, 0],
+        std=[1, 1, 1],
+        batch_augments=[
+            dict(
+                type='BatchSyncRandomResize',
+                random_size_range=(480, 800),
+                size_divisor=32,
+                interval=1),
+        ]),
+    backbone=dict(
+        type='CSPDarknet',
+        deepen_factor=deepen_factor,
+        widen_factor=widen_factor,
+        out_indices=(2, 3, 4),
+        spp_kernal_sizes=(5, 9, 13),
+        norm_cfg=dict(type='BN', momentum=0.03, eps=0.001),
+        act_cfg=dict(type='Swish'),
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint='https://download.openmmlab.com/mmpose/v1/'
+            'pretrained_models/yolox_m_8x8_300e_coco_20230829.pth',
+            prefix='backbone.',
+        )),
+    neck=dict(
+        type='HybridEncoder',
+        in_channels=[192, 384, 768],
+        deepen_factor=deepen_factor,
+        widen_factor=widen_factor,
+        hidden_dim=256,
+        output_indices=[1, 2],
+        encoder_cfg=dict(
+            self_attn_cfg=dict(embed_dims=256, num_heads=8, dropout=0.0),
+            ffn_cfg=dict(
+                embed_dims=256,
+                feedforward_channels=1024,
+                ffn_drop=0.0,
+                act_cfg=dict(type='GELU'))),
+        projector=dict(
+            type='ChannelMapper',
+            in_channels=[256, 256],
+            kernel_size=1,
+            out_channels=384,
+            act_cfg=None,
+            norm_cfg=dict(type='BN'),
+            num_outs=2)),
+    head=dict(
+        type='RTMOHead',
+        num_keypoints=17,
+        featmap_strides=(16, 32),
+        head_module_cfg=dict(
+            num_classes=1,
+            in_channels=256,
+            cls_feat_channels=256,
+            channels_per_group=36,
+            pose_vec_channels=384,
+            widen_factor=widen_factor,
+            stacked_convs=2,
+            norm_cfg=dict(type='BN', momentum=0.03, eps=0.001),
+            act_cfg=dict(type='Swish')),
+        assigner=dict(
+            type='SimOTAAssigner',
+            dynamic_k_indicator='oks',
+            oks_calculator=dict(type='PoseOKS', metainfo=metafile)),
+        prior_generator=dict(
+            type='MlvlPointGenerator',
+            centralize_points=True,
+            strides=[16, 32]),
+        dcc_cfg=dict(
+            in_channels=384,
+            feat_channels=128,
+            num_bins=(192, 256),
+            spe_channels=128,
+            gau_cfg=dict(
+                s=128,
+                expansion_factor=2,
+                dropout_rate=0.0,
+                drop_path=0.0,
+                act_fn='SiLU',
+                pos_enc='add')),
+        overlaps_power=0.5,
+        loss_cls=dict(
+            type='VariFocalLoss',
+            reduction='sum',
+            use_target_weight=True,
+            loss_weight=1.0),
+        loss_bbox=dict(
+            type='IoULoss',
+            mode='square',
+            eps=1e-16,
+            reduction='sum',
+            loss_weight=5.0),
+        loss_oks=dict(
+            type='OKSLoss',
+            reduction='none',
+            metainfo=metafile,
+            loss_weight=30.0),
+        loss_vis=dict(
+            type='BCELoss',
+            use_target_weight=True,
+            reduction='mean',
+            loss_weight=1.0),
+        loss_mle=dict(
+            type='MLECCLoss',
+            use_target_weight=True,
+            loss_weight=1e-2,
+        ),
+        loss_bbox_aux=dict(type='L1Loss', reduction='sum', loss_weight=1.0),
+    ),
+    test_cfg=dict(
+        input_size=input_size,
+        score_thr=0.1,
+        nms_thr=0.65,
+    ))
diff --git a/configs/body_2d_keypoint/rtmo/body7/rtmo-s_8xb32-600e_body7-640x640.py b/configs/body_2d_keypoint/rtmo/body7/rtmo-s_8xb32-600e_body7-640x640.py
new file mode 100644
index 0000000000..83d7c21d8a
--- /dev/null
+++ b/configs/body_2d_keypoint/rtmo/body7/rtmo-s_8xb32-600e_body7-640x640.py
@@ -0,0 +1,535 @@
+_base_ = ['../../../_base_/default_runtime.py']
+
+# runtime
+train_cfg = dict(max_epochs=600, val_interval=20, dynamic_intervals=[(580, 1)])
+
+auto_scale_lr = dict(base_batch_size=256)
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=40, max_keep_ckpts=3))
+
+optim_wrapper = dict(
+    type='OptimWrapper',
+    constructor='ForceDefaultOptimWrapperConstructor',
+    optimizer=dict(type='AdamW', lr=0.004, weight_decay=0.05),
+    paramwise_cfg=dict(
+        norm_decay_mult=0,
+        bias_decay_mult=0,
+        bypass_duplicate=True,
+        force_default_settings=True,
+        custom_keys=dict({'neck.encoder': dict(lr_mult=0.05)})),
+    clip_grad=dict(max_norm=0.1, norm_type=2))
+
+param_scheduler = [
+    dict(
+        type='QuadraticWarmupLR',
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=0.0002,
+        begin=5,
+        T_max=280,
+        end=280,
+        by_epoch=True,
+        convert_to_iter_based=True),
+    # this scheduler is used to increase the lr from 2e-4 to 5e-4
+    dict(type='ConstantLR', by_epoch=True, factor=2.5, begin=280, end=281),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=0.0002,
+        begin=281,
+        T_max=300,
+        end=580,
+        by_epoch=True,
+        convert_to_iter_based=True),
+    dict(type='ConstantLR', by_epoch=True, factor=1, begin=580, end=600),
+]
+
+# data
+input_size = (640, 640)
+metafile = 'configs/_base_/datasets/coco.py'
+codec = dict(type='YOLOXPoseAnnotationProcessor', input_size=input_size)
+
+train_pipeline_stage1 = [
+    dict(type='LoadImage', backend_args=None),
+    dict(
+        type='Mosaic',
+        img_scale=(640, 640),
+        pad_val=114.0,
+        pre_transform=[dict(type='LoadImage', backend_args=None)]),
+    dict(
+        type='BottomupRandomAffine',
+        input_size=(640, 640),
+        shift_factor=0.1,
+        rotate_factor=10,
+        scale_factor=(0.75, 1.0),
+        pad_val=114,
+        distribution='uniform',
+        transform_mode='perspective',
+        bbox_keep_corner=False,
+        clip_border=True,
+    ),
+    dict(
+        type='YOLOXMixUp',
+        img_scale=(640, 640),
+        ratio_range=(0.8, 1.6),
+        pad_val=114.0,
+        pre_transform=[dict(type='LoadImage', backend_args=None)]),
+    dict(type='YOLOXHSVRandomAug'),
+    dict(type='RandomFlip'),
+    dict(type='FilterAnnotations', by_kpt=True, by_box=True, keep_empty=False),
+    dict(type='GenerateTarget', encoder=codec),
+    dict(type='PackPoseInputs'),
+]
+train_pipeline_stage2 = [
+    dict(type='LoadImage'),
+    dict(
+        type='BottomupRandomAffine',
+        input_size=(640, 640),
+        shift_prob=0,
+        rotate_prob=0,
+        scale_prob=0,
+        scale_type='long',
+        pad_val=(114, 114, 114),
+        bbox_keep_corner=False,
+        clip_border=True,
+    ),
+    dict(type='YOLOXHSVRandomAug'),
+    dict(type='RandomFlip'),
+    dict(type='BottomupGetHeatmapMask', get_invalid=True),
+    dict(type='FilterAnnotations', by_kpt=True, by_box=True, keep_empty=False),
+    dict(type='GenerateTarget', encoder=codec),
+    dict(type='PackPoseInputs'),
+]
+
+# data settings
+data_mode = 'bottomup'
+data_root = 'data/'
+
+# mapping
+aic_coco = [
+    (0, 6),
+    (1, 8),
+    (2, 10),
+    (3, 5),
+    (4, 7),
+    (5, 9),
+    (6, 12),
+    (7, 14),
+    (8, 16),
+    (9, 11),
+    (10, 13),
+    (11, 15),
+]
+
+crowdpose_coco = [
+    (0, 5),
+    (1, 6),
+    (2, 7),
+    (3, 8),
+    (4, 9),
+    (5, 10),
+    (6, 11),
+    (7, 12),
+    (8, 13),
+    (9, 14),
+    (10, 15),
+    (11, 16),
+]
+
+mpii_coco = [
+    (0, 16),
+    (1, 14),
+    (2, 12),
+    (3, 11),
+    (4, 13),
+    (5, 15),
+    (10, 10),
+    (11, 8),
+    (12, 6),
+    (13, 5),
+    (14, 7),
+    (15, 9),
+]
+
+jhmdb_coco = [
+    (3, 6),
+    (4, 5),
+    (5, 12),
+    (6, 11),
+    (7, 8),
+    (8, 7),
+    (9, 14),
+    (10, 13),
+    (11, 10),
+    (12, 9),
+    (13, 16),
+    (14, 15),
+]
+
+halpe_coco = [
+    (0, 0),
+    (1, 1),
+    (2, 2),
+    (3, 3),
+    (4, 4),
+    (5, 5),
+    (6, 6),
+    (7, 7),
+    (8, 8),
+    (9, 9),
+    (10, 10),
+    (11, 11),
+    (12, 12),
+    (13, 13),
+    (14, 14),
+    (15, 15),
+    (16, 16),
+]
+
+ochuman_coco = [
+    (0, 0),
+    (1, 1),
+    (2, 2),
+    (3, 3),
+    (4, 4),
+    (5, 5),
+    (6, 6),
+    (7, 7),
+    (8, 8),
+    (9, 9),
+    (10, 10),
+    (11, 11),
+    (12, 12),
+    (13, 13),
+    (14, 14),
+    (15, 15),
+    (16, 16),
+]
+
+posetrack_coco = [
+    (0, 0),
+    (3, 3),
+    (4, 4),
+    (5, 5),
+    (6, 6),
+    (7, 7),
+    (8, 8),
+    (9, 9),
+    (10, 10),
+    (11, 11),
+    (12, 12),
+    (13, 13),
+    (14, 14),
+    (15, 15),
+    (16, 16),
+]
+
+# train datasets
+dataset_coco = dict(
+    type='CocoDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='coco/annotations/person_keypoints_train2017.json',
+    data_prefix=dict(img='coco/train2017/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=17,
+            mapping=[(i, i) for i in range(17)])
+    ],
+)
+
+dataset_aic = dict(
+    type='AicDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='aic/annotations/aic_train.json',
+    data_prefix=dict(img='pose/ai_challenge/ai_challenger_keypoint'
+                     '_train_20170902/keypoint_train_images_20170902/'),
+    pipeline=[
+        dict(type='KeypointConverter', num_keypoints=17, mapping=aic_coco)
+    ],
+)
+
+dataset_crowdpose = dict(
+    type='CrowdPoseDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='crowdpose/annotations/mmpose_crowdpose_trainval.json',
+    data_prefix=dict(img='pose/CrowdPose/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter', num_keypoints=17, mapping=crowdpose_coco)
+    ],
+)
+
+dataset_mpii = dict(
+    type='MpiiDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='mpii/annotations/mpii_train.json',
+    data_prefix=dict(img='pose/MPI/images/'),
+    pipeline=[
+        dict(type='KeypointConverter', num_keypoints=17, mapping=mpii_coco)
+    ],
+)
+
+dataset_jhmdb = dict(
+    type='JhmdbDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='jhmdb/annotations/Sub1_train.json',
+    data_prefix=dict(img='pose/JHMDB/'),
+    pipeline=[
+        dict(type='KeypointConverter', num_keypoints=17, mapping=jhmdb_coco)
+    ],
+)
+
+dataset_halpe = dict(
+    type='HalpeDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='halpe/annotations/halpe_train_v1.json',
+    data_prefix=dict(img='pose/Halpe/hico_20160224_det/images/train2015'),
+    pipeline=[
+        dict(type='KeypointConverter', num_keypoints=17, mapping=halpe_coco)
+    ],
+)
+
+dataset_posetrack = dict(
+    type='PoseTrack18Dataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='posetrack18/annotations/posetrack18_train.json',
+    data_prefix=dict(img='pose/PoseChallenge2018/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter', num_keypoints=17, mapping=posetrack_coco)
+    ],
+)
+
+train_dataset = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file=metafile),
+    datasets=[
+        dataset_coco,
+        dataset_aic,
+        dataset_crowdpose,
+        dataset_mpii,
+        dataset_jhmdb,
+        dataset_halpe,
+        dataset_posetrack,
+    ],
+    sample_ratio_factor=[1, 0.3, 0.5, 0.3, 0.3, 0.4, 0.3],
+    test_mode=False,
+    pipeline=train_pipeline_stage1)
+
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    persistent_workers=True,
+    pin_memory=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=train_dataset)
+
+# val datasets
+val_pipeline = [
+    dict(type='LoadImage'),
+    dict(
+        type='BottomupResize', input_size=input_size, pad_val=(114, 114, 114)),
+    dict(
+        type='PackPoseInputs',
+        meta_keys=('id', 'img_id', 'img_path', 'ori_shape', 'img_shape',
+                   'input_size', 'input_center', 'input_scale'))
+]
+
+val_dataloader = dict(
+    batch_size=1,
+    num_workers=2,
+    persistent_workers=True,
+    pin_memory=True,
+    drop_last=False,
+    sampler=dict(type='DefaultSampler', shuffle=False, round_up=False),
+    dataset=dict(
+        type='CocoDataset',
+        data_root=data_root,
+        data_mode=data_mode,
+        ann_file='coco/annotations/person_keypoints_val2017.json',
+        data_prefix=dict(img='coco/val2017/'),
+        test_mode=True,
+        pipeline=val_pipeline,
+    ))
+test_dataloader = val_dataloader
+
+# evaluators
+val_evaluator = dict(
+    type='CocoMetric',
+    ann_file=data_root + 'coco/annotations/person_keypoints_val2017.json',
+    score_mode='bbox',
+    nms_mode='none',
+)
+test_evaluator = val_evaluator
+
+# hooks
+custom_hooks = [
+    dict(
+        type='YOLOXPoseModeSwitchHook',
+        num_last_epochs=20,
+        new_train_dataset=dataset_coco,
+        new_train_pipeline=train_pipeline_stage2,
+        priority=48),
+    dict(
+        type='RTMOModeSwitchHook',
+        epoch_attributes={
+            280: {
+                'proxy_target_cc': True,
+                'loss_mle.loss_weight': 5.0,
+                'loss_oks.loss_weight': 10.0
+            },
+        },
+        priority=48),
+    dict(type='SyncNormHook', priority=48),
+    dict(
+        type='EMAHook',
+        ema_type='ExpMomentumEMA',
+        momentum=0.0002,
+        update_buffers=True,
+        strict_load=False,
+        priority=49),
+]
+
+# model
+widen_factor = 0.5
+deepen_factor = 0.33
+
+model = dict(
+    type='BottomupPoseEstimator',
+    init_cfg=dict(
+        type='Kaiming',
+        layer='Conv2d',
+        a=2.23606797749979,
+        distribution='uniform',
+        mode='fan_in',
+        nonlinearity='leaky_relu'),
+    data_preprocessor=dict(
+        type='PoseDataPreprocessor',
+        pad_size_divisor=32,
+        mean=[0, 0, 0],
+        std=[1, 1, 1],
+        batch_augments=[
+            dict(
+                type='BatchSyncRandomResize',
+                random_size_range=(480, 800),
+                size_divisor=32,
+                interval=1),
+        ]),
+    backbone=dict(
+        type='CSPDarknet',
+        deepen_factor=deepen_factor,
+        widen_factor=widen_factor,
+        out_indices=(2, 3, 4),
+        spp_kernal_sizes=(5, 9, 13),
+        norm_cfg=dict(type='BN', momentum=0.03, eps=0.001),
+        act_cfg=dict(type='Swish'),
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint='https://download.openmmlab.com/mmdetection/v2.0/'
+            'yolox/yolox_s_8x8_300e_coco/yolox_s_8x8_300e_coco_'
+            '20211121_095711-4592a793.pth',
+            prefix='backbone.',
+        )),
+    neck=dict(
+        type='HybridEncoder',
+        in_channels=[128, 256, 512],
+        deepen_factor=deepen_factor,
+        widen_factor=widen_factor,
+        hidden_dim=256,
+        output_indices=[1, 2],
+        encoder_cfg=dict(
+            self_attn_cfg=dict(embed_dims=256, num_heads=8, dropout=0.0),
+            ffn_cfg=dict(
+                embed_dims=256,
+                feedforward_channels=1024,
+                ffn_drop=0.0,
+                act_cfg=dict(type='GELU'))),
+        projector=dict(
+            type='ChannelMapper',
+            in_channels=[256, 256],
+            kernel_size=1,
+            out_channels=256,
+            act_cfg=None,
+            norm_cfg=dict(type='BN'),
+            num_outs=2)),
+    head=dict(
+        type='RTMOHead',
+        num_keypoints=17,
+        featmap_strides=(16, 32),
+        head_module_cfg=dict(
+            num_classes=1,
+            in_channels=256,
+            cls_feat_channels=256,
+            channels_per_group=36,
+            pose_vec_channels=256,
+            widen_factor=widen_factor,
+            stacked_convs=2,
+            norm_cfg=dict(type='BN', momentum=0.03, eps=0.001),
+            act_cfg=dict(type='Swish')),
+        assigner=dict(
+            type='SimOTAAssigner',
+            dynamic_k_indicator='oks',
+            oks_calculator=dict(type='PoseOKS', metainfo=metafile),
+            use_keypoints_for_center=True),
+        prior_generator=dict(
+            type='MlvlPointGenerator',
+            centralize_points=True,
+            strides=[16, 32]),
+        dcc_cfg=dict(
+            in_channels=256,
+            feat_channels=128,
+            num_bins=(192, 256),
+            spe_channels=128,
+            gau_cfg=dict(
+                s=128,
+                expansion_factor=2,
+                dropout_rate=0.0,
+                drop_path=0.0,
+                act_fn='SiLU',
+                pos_enc='add')),
+        overlaps_power=0.5,
+        loss_cls=dict(
+            type='VariFocalLoss',
+            reduction='sum',
+            use_target_weight=True,
+            loss_weight=1.0),
+        loss_bbox=dict(
+            type='IoULoss',
+            mode='square',
+            eps=1e-16,
+            reduction='sum',
+            loss_weight=5.0),
+        loss_oks=dict(
+            type='OKSLoss',
+            reduction='none',
+            metainfo=metafile,
+            loss_weight=30.0),
+        loss_vis=dict(
+            type='BCELoss',
+            use_target_weight=True,
+            reduction='mean',
+            loss_weight=1.0),
+        loss_mle=dict(
+            type='MLECCLoss',
+            use_target_weight=True,
+            loss_weight=1.0,
+        ),
+        loss_bbox_aux=dict(type='L1Loss', reduction='sum', loss_weight=1.0),
+    ),
+    test_cfg=dict(
+        input_size=input_size,
+        score_thr=0.1,
+        nms_thr=0.65,
+    ))
diff --git a/configs/body_2d_keypoint/rtmo/body7/rtmo-t_8xb32-600e_body7-416x416.py b/configs/body_2d_keypoint/rtmo/body7/rtmo-t_8xb32-600e_body7-416x416.py
new file mode 100644
index 0000000000..566fe34455
--- /dev/null
+++ b/configs/body_2d_keypoint/rtmo/body7/rtmo-t_8xb32-600e_body7-416x416.py
@@ -0,0 +1,529 @@
+_base_ = ['../../../_base_/default_runtime.py']
+
+# runtime
+train_cfg = dict(max_epochs=600, val_interval=20, dynamic_intervals=[(580, 1)])
+
+auto_scale_lr = dict(base_batch_size=256)
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=40, max_keep_ckpts=3))
+
+optim_wrapper = dict(
+    type='OptimWrapper',
+    constructor='ForceDefaultOptimWrapperConstructor',
+    optimizer=dict(type='AdamW', lr=0.004, weight_decay=0.05),
+    paramwise_cfg=dict(
+        norm_decay_mult=0,
+        bias_decay_mult=0,
+        bypass_duplicate=True,
+        force_default_settings=True,
+        custom_keys=dict({'neck.encoder': dict(lr_mult=0.05)})),
+    clip_grad=dict(max_norm=0.1, norm_type=2))
+
+param_scheduler = [
+    dict(
+        type='QuadraticWarmupLR',
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=0.0002,
+        begin=5,
+        T_max=280,
+        end=280,
+        by_epoch=True,
+        convert_to_iter_based=True),
+    # this scheduler is used to increase the lr from 2e-4 to 5e-4
+    dict(type='ConstantLR', by_epoch=True, factor=2.5, begin=280, end=281),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=0.0002,
+        begin=281,
+        T_max=300,
+        end=580,
+        by_epoch=True,
+        convert_to_iter_based=True),
+    dict(type='ConstantLR', by_epoch=True, factor=1, begin=580, end=600),
+]
+
+# data
+input_size = (416, 416)
+metafile = 'configs/_base_/datasets/coco.py'
+codec = dict(type='YOLOXPoseAnnotationProcessor', input_size=input_size)
+
+train_pipeline_stage1 = [
+    dict(type='LoadImage', backend_args=None),
+    dict(
+        type='Mosaic',
+        img_scale=(416, 416),
+        pad_val=114.0,
+        pre_transform=[dict(type='LoadImage', backend_args=None)]),
+    dict(
+        type='BottomupRandomAffine',
+        input_size=(416, 416),
+        shift_factor=0.1,
+        rotate_factor=10,
+        scale_factor=(0.75, 1.0),
+        pad_val=114,
+        distribution='uniform',
+        transform_mode='perspective',
+        bbox_keep_corner=False,
+        clip_border=True,
+    ),
+    dict(type='YOLOXHSVRandomAug'),
+    dict(type='RandomFlip'),
+    dict(type='FilterAnnotations', by_kpt=True, by_box=True, keep_empty=False),
+    dict(type='GenerateTarget', encoder=codec),
+    dict(type='PackPoseInputs'),
+]
+train_pipeline_stage2 = [
+    dict(type='LoadImage'),
+    dict(
+        type='BottomupRandomAffine',
+        input_size=(416, 416),
+        shift_prob=0,
+        rotate_prob=0,
+        scale_prob=0,
+        scale_type='long',
+        pad_val=(114, 114, 114),
+        bbox_keep_corner=False,
+        clip_border=True,
+    ),
+    dict(type='YOLOXHSVRandomAug'),
+    dict(type='RandomFlip'),
+    dict(type='BottomupGetHeatmapMask', get_invalid=True),
+    dict(type='FilterAnnotations', by_kpt=True, by_box=True, keep_empty=False),
+    dict(type='GenerateTarget', encoder=codec),
+    dict(type='PackPoseInputs'),
+]
+
+# data settings
+data_mode = 'bottomup'
+data_root = 'data/'
+
+# mapping
+aic_coco = [
+    (0, 6),
+    (1, 8),
+    (2, 10),
+    (3, 5),
+    (4, 7),
+    (5, 9),
+    (6, 12),
+    (7, 14),
+    (8, 16),
+    (9, 11),
+    (10, 13),
+    (11, 15),
+]
+
+crowdpose_coco = [
+    (0, 5),
+    (1, 6),
+    (2, 7),
+    (3, 8),
+    (4, 9),
+    (5, 10),
+    (6, 11),
+    (7, 12),
+    (8, 13),
+    (9, 14),
+    (10, 15),
+    (11, 16),
+]
+
+mpii_coco = [
+    (0, 16),
+    (1, 14),
+    (2, 12),
+    (3, 11),
+    (4, 13),
+    (5, 15),
+    (10, 10),
+    (11, 8),
+    (12, 6),
+    (13, 5),
+    (14, 7),
+    (15, 9),
+]
+
+jhmdb_coco = [
+    (3, 6),
+    (4, 5),
+    (5, 12),
+    (6, 11),
+    (7, 8),
+    (8, 7),
+    (9, 14),
+    (10, 13),
+    (11, 10),
+    (12, 9),
+    (13, 16),
+    (14, 15),
+]
+
+halpe_coco = [
+    (0, 0),
+    (1, 1),
+    (2, 2),
+    (3, 3),
+    (4, 4),
+    (5, 5),
+    (6, 6),
+    (7, 7),
+    (8, 8),
+    (9, 9),
+    (10, 10),
+    (11, 11),
+    (12, 12),
+    (13, 13),
+    (14, 14),
+    (15, 15),
+    (16, 16),
+]
+
+ochuman_coco = [
+    (0, 0),
+    (1, 1),
+    (2, 2),
+    (3, 3),
+    (4, 4),
+    (5, 5),
+    (6, 6),
+    (7, 7),
+    (8, 8),
+    (9, 9),
+    (10, 10),
+    (11, 11),
+    (12, 12),
+    (13, 13),
+    (14, 14),
+    (15, 15),
+    (16, 16),
+]
+
+posetrack_coco = [
+    (0, 0),
+    (3, 3),
+    (4, 4),
+    (5, 5),
+    (6, 6),
+    (7, 7),
+    (8, 8),
+    (9, 9),
+    (10, 10),
+    (11, 11),
+    (12, 12),
+    (13, 13),
+    (14, 14),
+    (15, 15),
+    (16, 16),
+]
+
+# train datasets
+dataset_coco = dict(
+    type='CocoDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='coco/annotations/person_keypoints_train2017.json',
+    data_prefix=dict(img='coco/train2017/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=17,
+            mapping=[(i, i) for i in range(17)])
+    ],
+)
+
+dataset_aic = dict(
+    type='AicDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='aic/annotations/aic_train.json',
+    data_prefix=dict(img='pose/ai_challenge/ai_challenger_keypoint'
+                     '_train_20170902/keypoint_train_images_20170902/'),
+    pipeline=[
+        dict(type='KeypointConverter', num_keypoints=17, mapping=aic_coco)
+    ],
+)
+
+dataset_crowdpose = dict(
+    type='CrowdPoseDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='crowdpose/annotations/mmpose_crowdpose_trainval.json',
+    data_prefix=dict(img='pose/CrowdPose/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter', num_keypoints=17, mapping=crowdpose_coco)
+    ],
+)
+
+dataset_mpii = dict(
+    type='MpiiDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='mpii/annotations/mpii_train.json',
+    data_prefix=dict(img='pose/MPI/images/'),
+    pipeline=[
+        dict(type='KeypointConverter', num_keypoints=17, mapping=mpii_coco)
+    ],
+)
+
+dataset_jhmdb = dict(
+    type='JhmdbDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='jhmdb/annotations/Sub1_train.json',
+    data_prefix=dict(img='pose/JHMDB/'),
+    pipeline=[
+        dict(type='KeypointConverter', num_keypoints=17, mapping=jhmdb_coco)
+    ],
+)
+
+dataset_halpe = dict(
+    type='HalpeDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='halpe/annotations/halpe_train_v1.json',
+    data_prefix=dict(img='pose/Halpe/hico_20160224_det/images/train2015'),
+    pipeline=[
+        dict(type='KeypointConverter', num_keypoints=17, mapping=halpe_coco)
+    ],
+)
+
+dataset_posetrack = dict(
+    type='PoseTrack18Dataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='posetrack18/annotations/posetrack18_train.json',
+    data_prefix=dict(img='pose/PoseChallenge2018/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter', num_keypoints=17, mapping=posetrack_coco)
+    ],
+)
+
+train_dataset = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file=metafile),
+    datasets=[
+        dataset_coco,
+        dataset_aic,
+        dataset_crowdpose,
+        dataset_mpii,
+        dataset_jhmdb,
+        dataset_halpe,
+        dataset_posetrack,
+    ],
+    sample_ratio_factor=[1, 0.3, 0.5, 0.3, 0.3, 0.4, 0.3],
+    test_mode=False,
+    pipeline=train_pipeline_stage1)
+
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    persistent_workers=True,
+    pin_memory=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=train_dataset)
+
+# val datasets
+val_pipeline = [
+    dict(type='LoadImage'),
+    dict(
+        type='BottomupResize', input_size=input_size, pad_val=(114, 114, 114)),
+    dict(
+        type='PackPoseInputs',
+        meta_keys=('id', 'img_id', 'img_path', 'ori_shape', 'img_shape',
+                   'input_size', 'input_center', 'input_scale'))
+]
+
+val_dataloader = dict(
+    batch_size=1,
+    num_workers=2,
+    persistent_workers=True,
+    pin_memory=True,
+    drop_last=False,
+    sampler=dict(type='DefaultSampler', shuffle=False, round_up=False),
+    dataset=dict(
+        type='CocoDataset',
+        data_root=data_root,
+        data_mode=data_mode,
+        ann_file='coco/annotations/person_keypoints_val2017.json',
+        data_prefix=dict(img='coco/val2017/'),
+        test_mode=True,
+        pipeline=val_pipeline,
+    ))
+test_dataloader = val_dataloader
+
+# evaluators
+val_evaluator = dict(
+    type='CocoMetric',
+    ann_file=data_root + 'coco/annotations/person_keypoints_val2017.json',
+    score_mode='bbox',
+    nms_mode='none',
+)
+test_evaluator = val_evaluator
+
+# hooks
+custom_hooks = [
+    dict(
+        type='YOLOXPoseModeSwitchHook',
+        num_last_epochs=20,
+        new_train_dataset=dataset_coco,
+        new_train_pipeline=train_pipeline_stage2,
+        priority=48),
+    dict(
+        type='RTMOModeSwitchHook',
+        epoch_attributes={
+            280: {
+                'proxy_target_cc': True,
+                'loss_mle.loss_weight': 5.0,
+                'loss_oks.loss_weight': 10.0
+            },
+        },
+        priority=48),
+    dict(type='SyncNormHook', priority=48),
+    dict(
+        type='EMAHook',
+        ema_type='ExpMomentumEMA',
+        momentum=0.0002,
+        update_buffers=True,
+        strict_load=False,
+        priority=49),
+]
+
+# model
+widen_factor = 0.375
+deepen_factor = 0.33
+
+model = dict(
+    type='BottomupPoseEstimator',
+    init_cfg=dict(
+        type='Kaiming',
+        layer='Conv2d',
+        a=2.23606797749979,
+        distribution='uniform',
+        mode='fan_in',
+        nonlinearity='leaky_relu'),
+    data_preprocessor=dict(
+        type='PoseDataPreprocessor',
+        pad_size_divisor=32,
+        mean=[0, 0, 0],
+        std=[1, 1, 1],
+        batch_augments=[
+            dict(
+                type='BatchSyncRandomResize',
+                random_size_range=(320, 640),
+                size_divisor=32,
+                interval=1),
+        ]),
+    backbone=dict(
+        type='CSPDarknet',
+        deepen_factor=deepen_factor,
+        widen_factor=widen_factor,
+        out_indices=(2, 3, 4),
+        spp_kernal_sizes=(5, 9, 13),
+        norm_cfg=dict(type='BN', momentum=0.03, eps=0.001),
+        act_cfg=dict(type='Swish'),
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint='https://download.openmmlab.com/mmdetection/v2.0/'
+            'yolox/yolox_tiny_8x8_300e_coco/yolox_tiny_8x8_300e_coco_'
+            '20211124_171234-b4047906.pth',
+            prefix='backbone.',
+        )),
+    neck=dict(
+        type='HybridEncoder',
+        in_channels=[96, 192, 384],
+        deepen_factor=deepen_factor,
+        widen_factor=widen_factor,
+        hidden_dim=256,
+        output_indices=[1, 2],
+        encoder_cfg=dict(
+            self_attn_cfg=dict(embed_dims=256, num_heads=8, dropout=0.0),
+            ffn_cfg=dict(
+                embed_dims=256,
+                feedforward_channels=1024,
+                ffn_drop=0.0,
+                act_cfg=dict(type='GELU'))),
+        projector=dict(
+            type='ChannelMapper',
+            in_channels=[256, 256],
+            kernel_size=1,
+            out_channels=192,
+            act_cfg=None,
+            norm_cfg=dict(type='BN'),
+            num_outs=2)),
+    head=dict(
+        type='RTMOHead',
+        num_keypoints=17,
+        featmap_strides=(16, 32),
+        head_module_cfg=dict(
+            num_classes=1,
+            in_channels=256,
+            cls_feat_channels=256,
+            channels_per_group=36,
+            pose_vec_channels=192,
+            widen_factor=widen_factor,
+            stacked_convs=2,
+            norm_cfg=dict(type='BN', momentum=0.03, eps=0.001),
+            act_cfg=dict(type='Swish')),
+        assigner=dict(
+            type='SimOTAAssigner',
+            dynamic_k_indicator='oks',
+            oks_calculator=dict(type='PoseOKS', metainfo=metafile),
+            use_keypoints_for_center=True),
+        prior_generator=dict(
+            type='MlvlPointGenerator',
+            centralize_points=True,
+            strides=[16, 32]),
+        dcc_cfg=dict(
+            in_channels=192,
+            feat_channels=128,
+            num_bins=(192, 256),
+            spe_channels=128,
+            gau_cfg=dict(
+                s=128,
+                expansion_factor=2,
+                dropout_rate=0.0,
+                drop_path=0.0,
+                act_fn='SiLU',
+                pos_enc='add')),
+        overlaps_power=0.5,
+        loss_cls=dict(
+            type='VariFocalLoss',
+            reduction='sum',
+            use_target_weight=True,
+            loss_weight=1.0),
+        loss_bbox=dict(
+            type='IoULoss',
+            mode='square',
+            eps=1e-16,
+            reduction='sum',
+            loss_weight=5.0),
+        loss_oks=dict(
+            type='OKSLoss',
+            reduction='none',
+            metainfo=metafile,
+            loss_weight=30.0),
+        loss_vis=dict(
+            type='BCELoss',
+            use_target_weight=True,
+            reduction='mean',
+            loss_weight=1.0),
+        loss_mle=dict(
+            type='MLECCLoss',
+            use_target_weight=True,
+            loss_weight=1.0,
+        ),
+        loss_bbox_aux=dict(type='L1Loss', reduction='sum', loss_weight=1.0),
+    ),
+    test_cfg=dict(
+        input_size=input_size,
+        score_thr=0.1,
+        nms_thr=0.65,
+    ))
diff --git a/configs/body_2d_keypoint/rtmo/body7/rtmo_body7.md b/configs/body_2d_keypoint/rtmo/body7/rtmo_body7.md
new file mode 100644
index 0000000000..e6174b2942
--- /dev/null
+++ b/configs/body_2d_keypoint/rtmo/body7/rtmo_body7.md
@@ -0,0 +1,132 @@
+<!-- [ALGORITHM] -->
+
+<details>
+<summary align="right"><a href="https://arxiv.org/abs/2312.07526">RTMO</a></summary>
+
+```bibtex
+@misc{lu2023rtmo,
+      title={{RTMO}: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation},
+      author={Peng Lu and Tao Jiang and Yining Li and Xiangtai Li and Kai Chen and Wenming Yang},
+      year={2023},
+      eprint={2312.07526},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+```
+
+</details>
+
+<!-- [DATASET] -->
+
+<details>
+<summary align="right"><a href="https://arxiv.org/abs/1711.06475">AI Challenger (ArXiv'2017)</a></summary>
+
+```bibtex
+@article{wu2017ai,
+  title={Ai challenger: A large-scale dataset for going deeper in image understanding},
+  author={Wu, Jiahong and Zheng, He and Zhao, Bo and Li, Yixin and Yan, Baoming and Liang, Rui and Wang, Wenjia and Zhou, Shipei and Lin, Guosen and Fu, Yanwei and others},
+  journal={arXiv preprint arXiv:1711.06475},
+  year={2017}
+}
+```
+
+</details>
+
+<details>
+<summary align="right"><a href="https://link.springer.com/chapter/10.1007/978-3-319-10602-1_48">COCO (ECCV'2014)</a></summary>
+
+```bibtex
+@inproceedings{lin2014microsoft,
+  title={Microsoft coco: Common objects in context},
+  author={Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Hays, James and Perona, Pietro and Ramanan, Deva and Doll{\'a}r, Piotr and Zitnick, C Lawrence},
+  booktitle={European conference on computer vision},
+  pages={740--755},
+  year={2014},
+  organization={Springer}
+}
+```
+
+</details>
+
+<details>
+<summary align="right"><a href="http://openaccess.thecvf.com/content_CVPR_2019/html/Li_CrowdPose_Efficient_Crowded_Scenes_Pose_Estimation_and_a_New_Benchmark_CVPR_2019_paper.html">CrowdPose (CVPR'2019)</a></summary>
+
+```bibtex
+@article{li2018crowdpose,
+  title={CrowdPose: Efficient Crowded Scenes Pose Estimation and A New Benchmark},
+  author={Li, Jiefeng and Wang, Can and Zhu, Hao and Mao, Yihuan and Fang, Hao-Shu and Lu, Cewu},
+  journal={arXiv preprint arXiv:1812.00324},
+  year={2018}
+}
+```
+
+</details>
+
+<details>
+<summary align="right"><a href="https://www.cv-foundation.org/openaccess/content_iccv_2013/html/Jhuang_Towards_Understanding_Action_2013_ICCV_paper.html">JHMDB (ICCV'2013)</a></summary>
+
+```bibtex
+@inproceedings{Jhuang:ICCV:2013,
+  title = {Towards understanding action recognition},
+  author = {H. Jhuang and J. Gall and S. Zuffi and C. Schmid and M. J. Black},
+  booktitle = {International Conf. on Computer Vision (ICCV)},
+  month = Dec,
+  pages = {3192-3199},
+  year = {2013}
+}
+```
+
+</details>
+
+<details>
+<summary align="right"><a href="http://openaccess.thecvf.com/content_cvpr_2014/html/Andriluka_2D_Human_Pose_2014_CVPR_paper.html">MPII (CVPR'2014)</a></summary>
+
+```bibtex
+@inproceedings{andriluka14cvpr,
+  author = {Mykhaylo Andriluka and Leonid Pishchulin and Peter Gehler and Schiele, Bernt},
+  title = {2D Human Pose Estimation: New Benchmark and State of the Art Analysis},
+  booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
+  year = {2014},
+  month = {June}
+}
+```
+
+</details>
+
+<details>
+<summary align="right"><a href="http://openaccess.thecvf.com/content_cvpr_2018/html/Andriluka_PoseTrack_A_Benchmark_CVPR_2018_paper.html">PoseTrack18 (CVPR'2018)</a></summary>
+
+```bibtex
+@inproceedings{andriluka2018posetrack,
+  title={Posetrack: A benchmark for human pose estimation and tracking},
+  author={Andriluka, Mykhaylo and Iqbal, Umar and Insafutdinov, Eldar and Pishchulin, Leonid and Milan, Anton and Gall, Juergen and Schiele, Bernt},
+  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
+  pages={5167--5176},
+  year={2018}
+}
+```
+
+</details>
+
+<details>
+<summary align="right"><a href="https://arxiv.org/abs/2004.00945">Halpe (CVPR'2020)</a></summary>
+
+```bibtex
+@inproceedings{li2020pastanet,
+  title={PaStaNet: Toward Human Activity Knowledge Engine},
+  author={Li, Yong-Lu and Xu, Liang and Liu, Xinpeng and Huang, Xijie and Xu, Yue and Wang, Shiyi and Fang, Hao-Shu and Ma, Ze and Chen, Mingyang and Lu, Cewu},
+  booktitle={CVPR},
+  year={2020}
+}
+```
+
+</details>
+
+Results on COCO val2017
+
+| Arch                               | Input Size |  AP   | AP<sup>50</sup> | AP<sup>75</sup> |  AR   | AR<sup>50</sup> |                ckpt                |                log                |                onnx                |
+| :--------------------------------- | :--------: | :---: | :-------------: | :-------------: | :---: | :-------------: | :--------------------------------: | :-------------------------------: | :--------------------------------: |
+| [RTMO-t](/configs/body_2d_keypoint/rtmo/body7/rtmo-t_8xb32-600e_body7-416x416.py) |  640x640   | 0.574 |      0.803      |      0.613      | 0.611 |      0.836      | [ckpt](https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-t_8xb32-600e_body7-416x416-f48f75cb_20231219.pth) | [log](https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-t_8xb32-600e_body7-416x416_20231219.json) | [onnx](https://download.openmmlab.com/mmpose/v1/projects/rtmo/onnx_sdk/rtmo-t_8xb32-600e_body7-416x416-f48f75cb_20231219.zip) |
+| [RTMO-s](/configs/body_2d_keypoint/rtmo/body7/rtmo-s_8xb32-600e_body7-640x640.py) |  640x640   | 0.686 |      0.879      |      0.744      | 0.723 |      0.908      | [ckpt](https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-s_8xb32-600e_body7-640x640-dac2bf74_20231211.pth) | [log](https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-s_8xb32-600e_body7-640x640_20231211.json) | [onnx](https://download.openmmlab.com/mmpose/v1/projects/rtmo/onnx_sdk/rtmo-s_8xb32-600e_body7-640x640-dac2bf74_20231211.zip) |
+| [RTMO-m](/configs/body_2d_keypoint/rtmo/body7/rtmo-m_16xb16-600e_body7-640x640.py) |  640x640   | 0.726 |      0.899      |      0.790      | 0.763 |      0.926      | [ckpt](https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-m_16xb16-600e_body7-640x640-39e78cc4_20231211.pth) | [log](https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-m_16xb16-600e_body7-640x640_20231211.json) | [onnx](https://download.openmmlab.com/mmpose/v1/projects/rtmo/onnx_sdk/rtmo-m_16xb16-600e_body7-640x640-39e78cc4_20231211.zip) |
+| [RTMO-l](/configs/body_2d_keypoint/rtmo/body7/rtmo-l_16xb16-600e_body7-640x640.py) |  640x640   | 0.748 |      0.911      |      0.813      | 0.786 |      0.939      | [ckpt](https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-l_16xb16-600e_body7-640x640-b37118ce_20231211.pth) | [log](https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-l_16xb16-600e_body7-640x640_20231211.json) | [onnx](https://download.openmmlab.com/mmpose/v1/projects/rtmo/onnx_sdk/rtmo-l_16xb16-600e_body7-640x640-b37118ce_20231211.zip) |
diff --git a/configs/body_2d_keypoint/rtmo/body7/rtmo_body7.yml b/configs/body_2d_keypoint/rtmo/body7/rtmo_body7.yml
new file mode 100644
index 0000000000..6f802531bb
--- /dev/null
+++ b/configs/body_2d_keypoint/rtmo/body7/rtmo_body7.yml
@@ -0,0 +1,74 @@
+Models:
+- Config: configs/body_2d_keypoint/rtmo/body7/rtmo-t_8xb32-600e_body7-416x416.py
+  In Collection: RTMO
+  Metadata:
+    Architecture: &id001
+    - RTMO
+    Training Data: &id002
+    - AI Challenger
+    - COCO
+    - CrowdPose
+    - MPII
+    - sub-JHMDB
+    - Halpe
+    - PoseTrack18
+  Name: rtmo-t_8xb32-600e_body7-416x416
+  Results:
+  - Dataset: COCO
+    Metrics:
+      AP: 0.574
+      AP@0.5: 0.803
+      AP@0.75: 0.613
+      AR: 0.611
+      AR@0.5: 0.836
+    Task: Body 2D Keypoint
+  Weights: https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-t_8xb32-600e_body7-416x416-f48f75cb_20231219.pth
+- Config: configs/body_2d_keypoint/rtmo/body7/rtmo-s_8xb32-600e_body7-640x640.py
+  In Collection: RTMO
+  Metadata:
+    Architecture: *id001
+    Training Data: *id002
+  Name: rtmo-s_8xb32-600e_body7-640x640
+  Results:
+  - Dataset: COCO
+    Metrics:
+      AP: 0.686
+      AP@0.5: 0.879
+      AP@0.75: 0.744
+      AR: 0.723
+      AR@0.5: 0.908
+    Task: Body 2D Keypoint
+  Weights: https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-s_8xb32-600e_body7-640x640-dac2bf74_20231211.pth
+- Config: configs/body_2d_keypoint/rtmo/body7/rtmo-m_16xb16-600e_body7-640x640.py
+  In Collection: RTMO
+  Metadata:
+    Architecture: *id001
+    Training Data: *id002
+  Name: rtmo-m_16xb16-600e_body7-640x640
+  Results:
+  - Dataset: COCO
+    Metrics:
+      AP: 0.726
+      AP@0.5: 0.899
+      AP@0.75: 0.790
+      AR: 0.763
+      AR@0.5: 0.926
+    Task: Body 2D Keypoint
+  Weights: https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-m_16xb16-600e_body7-640x640-39e78cc4_20231211.pth
+- Config: configs/body_2d_keypoint/rtmo/body7/rtmo-l_16xb16-600e_body7-640x640.py
+  In Collection: RTMO
+  Alias: rtmo
+  Metadata:
+    Architecture: *id001
+    Training Data: *id002
+  Name: rtmo-l_16xb16-600e_body7-640x640
+  Results:
+  - Dataset: COCO
+    Metrics:
+      AP: 0.748
+      AP@0.5: 0.911
+      AP@0.75: 0.813
+      AR: 0.786
+      AR@0.5: 0.939
+    Task: Body 2D Keypoint
+  Weights: https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-l_16xb16-600e_body7-640x640-b37118ce_20231211.pth
diff --git a/configs/body_2d_keypoint/rtmo/coco/rtmo-l_16xb16-600e_coco-640x640.py b/configs/body_2d_keypoint/rtmo/coco/rtmo-l_16xb16-600e_coco-640x640.py
new file mode 100644
index 0000000000..97bbd109ca
--- /dev/null
+++ b/configs/body_2d_keypoint/rtmo/coco/rtmo-l_16xb16-600e_coco-640x640.py
@@ -0,0 +1,321 @@
+_base_ = ['../../../_base_/default_runtime.py']
+
+# runtime
+train_cfg = dict(max_epochs=600, val_interval=20, dynamic_intervals=[(580, 1)])
+
+auto_scale_lr = dict(base_batch_size=256)
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=40, max_keep_ckpts=3))
+
+optim_wrapper = dict(
+    type='OptimWrapper',
+    constructor='ForceDefaultOptimWrapperConstructor',
+    optimizer=dict(type='AdamW', lr=0.004, weight_decay=0.05),
+    paramwise_cfg=dict(
+        norm_decay_mult=0,
+        bias_decay_mult=0,
+        bypass_duplicate=True,
+        force_default_settings=True,
+        custom_keys=dict({'neck.encoder': dict(lr_mult=0.05)})),
+    clip_grad=dict(max_norm=0.1, norm_type=2))
+
+param_scheduler = [
+    dict(
+        type='QuadraticWarmupLR',
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=0.0002,
+        begin=5,
+        T_max=280,
+        end=280,
+        by_epoch=True,
+        convert_to_iter_based=True),
+    # this scheduler is used to increase the lr from 2e-4 to 5e-4
+    dict(type='ConstantLR', by_epoch=True, factor=2.5, begin=280, end=281),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=0.0002,
+        begin=281,
+        T_max=300,
+        end=580,
+        by_epoch=True,
+        convert_to_iter_based=True),
+    dict(type='ConstantLR', by_epoch=True, factor=1, begin=580, end=600),
+]
+
+# data
+input_size = (640, 640)
+metafile = 'configs/_base_/datasets/coco.py'
+codec = dict(type='YOLOXPoseAnnotationProcessor', input_size=input_size)
+
+train_pipeline_stage1 = [
+    dict(type='LoadImage', backend_args=None),
+    dict(
+        type='Mosaic',
+        img_scale=(640, 640),
+        pad_val=114.0,
+        pre_transform=[dict(type='LoadImage', backend_args=None)]),
+    dict(
+        type='BottomupRandomAffine',
+        input_size=(640, 640),
+        shift_factor=0.1,
+        rotate_factor=10,
+        scale_factor=(0.75, 1.0),
+        pad_val=114,
+        distribution='uniform',
+        transform_mode='perspective',
+        bbox_keep_corner=False,
+        clip_border=True,
+    ),
+    dict(
+        type='YOLOXMixUp',
+        img_scale=(640, 640),
+        ratio_range=(0.8, 1.6),
+        pad_val=114.0,
+        pre_transform=[dict(type='LoadImage', backend_args=None)]),
+    dict(type='YOLOXHSVRandomAug'),
+    dict(type='RandomFlip'),
+    dict(type='FilterAnnotations', by_kpt=True, by_box=True, keep_empty=False),
+    dict(type='GenerateTarget', encoder=codec),
+    dict(type='PackPoseInputs'),
+]
+train_pipeline_stage2 = [
+    dict(type='LoadImage'),
+    dict(
+        type='BottomupRandomAffine',
+        input_size=(640, 640),
+        scale_type='long',
+        pad_val=(114, 114, 114),
+        bbox_keep_corner=False,
+        clip_border=True,
+    ),
+    dict(type='YOLOXHSVRandomAug'),
+    dict(type='RandomFlip'),
+    dict(type='BottomupGetHeatmapMask', get_invalid=True),
+    dict(type='FilterAnnotations', by_kpt=True, by_box=True, keep_empty=False),
+    dict(type='GenerateTarget', encoder=codec),
+    dict(type='PackPoseInputs'),
+]
+
+data_mode = 'bottomup'
+data_root = 'data/'
+
+# train datasets
+dataset_coco = dict(
+    type='CocoDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='coco/annotations/person_keypoints_train2017.json',
+    data_prefix=dict(img='coco/train2017/'),
+    pipeline=train_pipeline_stage1,
+)
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    persistent_workers=True,
+    pin_memory=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dataset_coco)
+
+val_pipeline = [
+    dict(type='LoadImage'),
+    dict(
+        type='BottomupResize', input_size=input_size, pad_val=(114, 114, 114)),
+    dict(
+        type='PackPoseInputs',
+        meta_keys=('id', 'img_id', 'img_path', 'ori_shape', 'img_shape',
+                   'input_size', 'input_center', 'input_scale'))
+]
+
+val_dataloader = dict(
+    batch_size=1,
+    num_workers=2,
+    persistent_workers=True,
+    pin_memory=True,
+    drop_last=False,
+    sampler=dict(type='DefaultSampler', shuffle=False, round_up=False),
+    dataset=dict(
+        type='CocoDataset',
+        data_root=data_root,
+        data_mode=data_mode,
+        ann_file='coco/annotations/person_keypoints_val2017.json',
+        data_prefix=dict(img='coco/val2017/'),
+        test_mode=True,
+        pipeline=val_pipeline,
+    ))
+test_dataloader = val_dataloader
+
+# evaluators
+val_evaluator = dict(
+    type='CocoMetric',
+    ann_file=data_root + 'coco/annotations/person_keypoints_val2017.json',
+    score_mode='bbox',
+    nms_mode='none',
+)
+test_evaluator = val_evaluator
+
+# hooks
+custom_hooks = [
+    dict(
+        type='YOLOXPoseModeSwitchHook',
+        num_last_epochs=20,
+        new_train_pipeline=train_pipeline_stage2,
+        priority=48),
+    dict(
+        type='RTMOModeSwitchHook',
+        epoch_attributes={
+            280: {
+                'proxy_target_cc': True,
+                'overlaps_power': 1.0,
+                'loss_cls.loss_weight': 2.0,
+                'loss_mle.loss_weight': 5.0,
+                'loss_oks.loss_weight': 10.0
+            },
+        },
+        priority=48),
+    dict(type='SyncNormHook', priority=48),
+    dict(
+        type='EMAHook',
+        ema_type='ExpMomentumEMA',
+        momentum=0.0002,
+        update_buffers=True,
+        strict_load=False,
+        priority=49),
+]
+
+# model
+widen_factor = 1.0
+deepen_factor = 1.0
+
+model = dict(
+    type='BottomupPoseEstimator',
+    init_cfg=dict(
+        type='Kaiming',
+        layer='Conv2d',
+        a=2.23606797749979,
+        distribution='uniform',
+        mode='fan_in',
+        nonlinearity='leaky_relu'),
+    data_preprocessor=dict(
+        type='PoseDataPreprocessor',
+        pad_size_divisor=32,
+        mean=[0, 0, 0],
+        std=[1, 1, 1],
+        batch_augments=[
+            dict(
+                type='BatchSyncRandomResize',
+                random_size_range=(480, 800),
+                size_divisor=32,
+                interval=1),
+        ]),
+    backbone=dict(
+        type='CSPDarknet',
+        deepen_factor=deepen_factor,
+        widen_factor=widen_factor,
+        out_indices=(2, 3, 4),
+        spp_kernal_sizes=(5, 9, 13),
+        norm_cfg=dict(type='BN', momentum=0.03, eps=0.001),
+        act_cfg=dict(type='Swish'),
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint='https://download.openmmlab.com/mmdetection/v2.0/'
+            'yolox/yolox_l_8x8_300e_coco/yolox_l_8x8_300e_coco'
+            '_20211126_140236-d3bd2b23.pth',
+            prefix='backbone.',
+        )),
+    neck=dict(
+        type='HybridEncoder',
+        in_channels=[256, 512, 1024],
+        deepen_factor=deepen_factor,
+        widen_factor=widen_factor,
+        hidden_dim=256,
+        output_indices=[1, 2],
+        encoder_cfg=dict(
+            self_attn_cfg=dict(embed_dims=256, num_heads=8, dropout=0.0),
+            ffn_cfg=dict(
+                embed_dims=256,
+                feedforward_channels=1024,
+                ffn_drop=0.0,
+                act_cfg=dict(type='GELU'))),
+        projector=dict(
+            type='ChannelMapper',
+            in_channels=[256, 256],
+            kernel_size=1,
+            out_channels=512,
+            act_cfg=None,
+            norm_cfg=dict(type='BN'),
+            num_outs=2)),
+    head=dict(
+        type='RTMOHead',
+        num_keypoints=17,
+        featmap_strides=(16, 32),
+        head_module_cfg=dict(
+            num_classes=1,
+            in_channels=256,
+            cls_feat_channels=256,
+            channels_per_group=36,
+            pose_vec_channels=512,
+            widen_factor=widen_factor,
+            stacked_convs=2,
+            norm_cfg=dict(type='BN', momentum=0.03, eps=0.001),
+            act_cfg=dict(type='Swish')),
+        assigner=dict(
+            type='SimOTAAssigner',
+            dynamic_k_indicator='oks',
+            oks_calculator=dict(type='PoseOKS', metainfo=metafile)),
+        prior_generator=dict(
+            type='MlvlPointGenerator',
+            centralize_points=True,
+            strides=[16, 32]),
+        dcc_cfg=dict(
+            in_channels=512,
+            feat_channels=128,
+            num_bins=(192, 256),
+            spe_channels=128,
+            gau_cfg=dict(
+                s=128,
+                expansion_factor=2,
+                dropout_rate=0.0,
+                drop_path=0.0,
+                act_fn='SiLU',
+                pos_enc='add')),
+        overlaps_power=0.5,
+        loss_cls=dict(
+            type='VariFocalLoss',
+            reduction='sum',
+            use_target_weight=True,
+            loss_weight=1.0),
+        loss_bbox=dict(
+            type='IoULoss',
+            mode='square',
+            eps=1e-16,
+            reduction='sum',
+            loss_weight=5.0),
+        loss_oks=dict(
+            type='OKSLoss',
+            reduction='none',
+            metainfo=metafile,
+            loss_weight=30.0),
+        loss_vis=dict(
+            type='BCELoss',
+            use_target_weight=True,
+            reduction='mean',
+            loss_weight=1.0),
+        loss_mle=dict(
+            type='MLECCLoss',
+            use_target_weight=True,
+            loss_weight=1e-2,
+        ),
+        loss_bbox_aux=dict(type='L1Loss', reduction='sum', loss_weight=1.0),
+    ),
+    test_cfg=dict(
+        input_size=input_size,
+        score_thr=0.1,
+        nms_thr=0.65,
+    ))
diff --git a/configs/body_2d_keypoint/rtmo/coco/rtmo-m_16xb16-600e_coco-640x640.py b/configs/body_2d_keypoint/rtmo/coco/rtmo-m_16xb16-600e_coco-640x640.py
new file mode 100644
index 0000000000..de669ba604
--- /dev/null
+++ b/configs/body_2d_keypoint/rtmo/coco/rtmo-m_16xb16-600e_coco-640x640.py
@@ -0,0 +1,320 @@
+_base_ = ['../../../_base_/default_runtime.py']
+
+# runtime
+train_cfg = dict(max_epochs=600, val_interval=20, dynamic_intervals=[(580, 1)])
+
+auto_scale_lr = dict(base_batch_size=256)
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=40, max_keep_ckpts=3))
+
+optim_wrapper = dict(
+    type='OptimWrapper',
+    constructor='ForceDefaultOptimWrapperConstructor',
+    optimizer=dict(type='AdamW', lr=0.004, weight_decay=0.05),
+    paramwise_cfg=dict(
+        norm_decay_mult=0,
+        bias_decay_mult=0,
+        bypass_duplicate=True,
+        force_default_settings=True,
+        custom_keys=dict({'neck.encoder': dict(lr_mult=0.05)})),
+    clip_grad=dict(max_norm=0.1, norm_type=2))
+
+param_scheduler = [
+    dict(
+        type='QuadraticWarmupLR',
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=0.0002,
+        begin=5,
+        T_max=280,
+        end=280,
+        by_epoch=True,
+        convert_to_iter_based=True),
+    # this scheduler is used to increase the lr from 2e-4 to 5e-4
+    dict(type='ConstantLR', by_epoch=True, factor=2.5, begin=280, end=281),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=0.0002,
+        begin=281,
+        T_max=300,
+        end=580,
+        by_epoch=True,
+        convert_to_iter_based=True),
+    dict(type='ConstantLR', by_epoch=True, factor=1, begin=580, end=600),
+]
+
+# data
+input_size = (640, 640)
+metafile = 'configs/_base_/datasets/coco.py'
+codec = dict(type='YOLOXPoseAnnotationProcessor', input_size=input_size)
+
+train_pipeline_stage1 = [
+    dict(type='LoadImage', backend_args=None),
+    dict(
+        type='Mosaic',
+        img_scale=(640, 640),
+        pad_val=114.0,
+        pre_transform=[dict(type='LoadImage', backend_args=None)]),
+    dict(
+        type='BottomupRandomAffine',
+        input_size=(640, 640),
+        shift_factor=0.1,
+        rotate_factor=10,
+        scale_factor=(0.75, 1.0),
+        pad_val=114,
+        distribution='uniform',
+        transform_mode='perspective',
+        bbox_keep_corner=False,
+        clip_border=True,
+    ),
+    dict(
+        type='YOLOXMixUp',
+        img_scale=(640, 640),
+        ratio_range=(0.8, 1.6),
+        pad_val=114.0,
+        pre_transform=[dict(type='LoadImage', backend_args=None)]),
+    dict(type='YOLOXHSVRandomAug'),
+    dict(type='RandomFlip'),
+    dict(type='FilterAnnotations', by_kpt=True, by_box=True, keep_empty=False),
+    dict(type='GenerateTarget', encoder=codec),
+    dict(type='PackPoseInputs'),
+]
+train_pipeline_stage2 = [
+    dict(type='LoadImage'),
+    dict(
+        type='BottomupRandomAffine',
+        input_size=(640, 640),
+        scale_type='long',
+        pad_val=(114, 114, 114),
+        bbox_keep_corner=False,
+        clip_border=True,
+    ),
+    dict(type='YOLOXHSVRandomAug'),
+    dict(type='RandomFlip'),
+    dict(type='BottomupGetHeatmapMask', get_invalid=True),
+    dict(type='FilterAnnotations', by_kpt=True, by_box=True, keep_empty=False),
+    dict(type='GenerateTarget', encoder=codec),
+    dict(type='PackPoseInputs'),
+]
+
+data_mode = 'bottomup'
+data_root = 'data/'
+
+# train datasets
+dataset_coco = dict(
+    type='CocoDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='coco/annotations/person_keypoints_train2017.json',
+    data_prefix=dict(img='coco/train2017/'),
+    pipeline=train_pipeline_stage1,
+)
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    persistent_workers=True,
+    pin_memory=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dataset_coco)
+
+val_pipeline = [
+    dict(type='LoadImage'),
+    dict(
+        type='BottomupResize', input_size=input_size, pad_val=(114, 114, 114)),
+    dict(
+        type='PackPoseInputs',
+        meta_keys=('id', 'img_id', 'img_path', 'ori_shape', 'img_shape',
+                   'input_size', 'input_center', 'input_scale'))
+]
+
+val_dataloader = dict(
+    batch_size=1,
+    num_workers=2,
+    persistent_workers=True,
+    pin_memory=True,
+    drop_last=False,
+    sampler=dict(type='DefaultSampler', shuffle=False, round_up=False),
+    dataset=dict(
+        type='CocoDataset',
+        data_root=data_root,
+        data_mode=data_mode,
+        ann_file='coco/annotations/person_keypoints_val2017.json',
+        data_prefix=dict(img='coco/val2017/'),
+        test_mode=True,
+        pipeline=val_pipeline,
+    ))
+test_dataloader = val_dataloader
+
+# evaluators
+val_evaluator = dict(
+    type='CocoMetric',
+    ann_file=data_root + 'coco/annotations/person_keypoints_val2017.json',
+    score_mode='bbox',
+    nms_mode='none',
+)
+test_evaluator = val_evaluator
+
+# hooks
+custom_hooks = [
+    dict(
+        type='YOLOXPoseModeSwitchHook',
+        num_last_epochs=20,
+        new_train_pipeline=train_pipeline_stage2,
+        priority=48),
+    dict(
+        type='RTMOModeSwitchHook',
+        epoch_attributes={
+            280: {
+                'proxy_target_cc': True,
+                'overlaps_power': 1.0,
+                'loss_cls.loss_weight': 2.0,
+                'loss_mle.loss_weight': 5.0,
+                'loss_oks.loss_weight': 10.0
+            },
+        },
+        priority=48),
+    dict(type='SyncNormHook', priority=48),
+    dict(
+        type='EMAHook',
+        ema_type='ExpMomentumEMA',
+        momentum=0.0002,
+        update_buffers=True,
+        strict_load=False,
+        priority=49),
+]
+
+# model
+widen_factor = 0.75
+deepen_factor = 0.67
+
+model = dict(
+    type='BottomupPoseEstimator',
+    init_cfg=dict(
+        type='Kaiming',
+        layer='Conv2d',
+        a=2.23606797749979,
+        distribution='uniform',
+        mode='fan_in',
+        nonlinearity='leaky_relu'),
+    data_preprocessor=dict(
+        type='PoseDataPreprocessor',
+        pad_size_divisor=32,
+        mean=[0, 0, 0],
+        std=[1, 1, 1],
+        batch_augments=[
+            dict(
+                type='BatchSyncRandomResize',
+                random_size_range=(480, 800),
+                size_divisor=32,
+                interval=1),
+        ]),
+    backbone=dict(
+        type='CSPDarknet',
+        deepen_factor=deepen_factor,
+        widen_factor=widen_factor,
+        out_indices=(2, 3, 4),
+        spp_kernal_sizes=(5, 9, 13),
+        norm_cfg=dict(type='BN', momentum=0.03, eps=0.001),
+        act_cfg=dict(type='Swish'),
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint='https://download.openmmlab.com/mmpose/v1/'
+            'pretrained_models/yolox_m_8x8_300e_coco_20230829.pth',
+            prefix='backbone.',
+        )),
+    neck=dict(
+        type='HybridEncoder',
+        in_channels=[192, 384, 768],
+        deepen_factor=deepen_factor,
+        widen_factor=widen_factor,
+        hidden_dim=256,
+        output_indices=[1, 2],
+        encoder_cfg=dict(
+            self_attn_cfg=dict(embed_dims=256, num_heads=8, dropout=0.0),
+            ffn_cfg=dict(
+                embed_dims=256,
+                feedforward_channels=1024,
+                ffn_drop=0.0,
+                act_cfg=dict(type='GELU'))),
+        projector=dict(
+            type='ChannelMapper',
+            in_channels=[256, 256],
+            kernel_size=1,
+            out_channels=384,
+            act_cfg=None,
+            norm_cfg=dict(type='BN'),
+            num_outs=2)),
+    head=dict(
+        type='RTMOHead',
+        num_keypoints=17,
+        featmap_strides=(16, 32),
+        head_module_cfg=dict(
+            num_classes=1,
+            in_channels=256,
+            cls_feat_channels=256,
+            channels_per_group=36,
+            pose_vec_channels=384,
+            widen_factor=widen_factor,
+            stacked_convs=2,
+            norm_cfg=dict(type='BN', momentum=0.03, eps=0.001),
+            act_cfg=dict(type='Swish')),
+        assigner=dict(
+            type='SimOTAAssigner',
+            dynamic_k_indicator='oks',
+            oks_calculator=dict(type='PoseOKS', metainfo=metafile)),
+        prior_generator=dict(
+            type='MlvlPointGenerator',
+            centralize_points=True,
+            strides=[16, 32]),
+        dcc_cfg=dict(
+            in_channels=384,
+            feat_channels=128,
+            num_bins=(192, 256),
+            spe_channels=128,
+            gau_cfg=dict(
+                s=128,
+                expansion_factor=2,
+                dropout_rate=0.0,
+                drop_path=0.0,
+                act_fn='SiLU',
+                pos_enc='add')),
+        overlaps_power=0.5,
+        loss_cls=dict(
+            type='VariFocalLoss',
+            reduction='sum',
+            use_target_weight=True,
+            loss_weight=1.0),
+        loss_bbox=dict(
+            type='IoULoss',
+            mode='square',
+            eps=1e-16,
+            reduction='sum',
+            loss_weight=5.0),
+        loss_oks=dict(
+            type='OKSLoss',
+            reduction='none',
+            metainfo=metafile,
+            loss_weight=30.0),
+        loss_vis=dict(
+            type='BCELoss',
+            use_target_weight=True,
+            reduction='mean',
+            loss_weight=1.0),
+        loss_mle=dict(
+            type='MLECCLoss',
+            use_target_weight=True,
+            loss_weight=1e-2,
+        ),
+        loss_bbox_aux=dict(type='L1Loss', reduction='sum', loss_weight=1.0),
+    ),
+    test_cfg=dict(
+        input_size=input_size,
+        score_thr=0.1,
+        nms_thr=0.65,
+    ))
diff --git a/configs/body_2d_keypoint/rtmo/coco/rtmo-s_8xb32-600e_coco-640x640.py b/configs/body_2d_keypoint/rtmo/coco/rtmo-s_8xb32-600e_coco-640x640.py
new file mode 100644
index 0000000000..755c47bf82
--- /dev/null
+++ b/configs/body_2d_keypoint/rtmo/coco/rtmo-s_8xb32-600e_coco-640x640.py
@@ -0,0 +1,323 @@
+_base_ = ['../../../_base_/default_runtime.py']
+
+# runtime
+train_cfg = dict(max_epochs=600, val_interval=20, dynamic_intervals=[(580, 1)])
+
+auto_scale_lr = dict(base_batch_size=256)
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=40, max_keep_ckpts=3))
+
+optim_wrapper = dict(
+    type='OptimWrapper',
+    constructor='ForceDefaultOptimWrapperConstructor',
+    optimizer=dict(type='AdamW', lr=0.004, weight_decay=0.05),
+    paramwise_cfg=dict(
+        norm_decay_mult=0,
+        bias_decay_mult=0,
+        bypass_duplicate=True,
+        force_default_settings=True,
+        custom_keys=dict({'neck.encoder': dict(lr_mult=0.05)})),
+    clip_grad=dict(max_norm=0.1, norm_type=2))
+
+param_scheduler = [
+    dict(
+        type='QuadraticWarmupLR',
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=0.0002,
+        begin=5,
+        T_max=280,
+        end=280,
+        by_epoch=True,
+        convert_to_iter_based=True),
+    # this scheduler is used to increase the lr from 2e-4 to 5e-4
+    dict(type='ConstantLR', by_epoch=True, factor=2.5, begin=280, end=281),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=0.0002,
+        begin=281,
+        T_max=300,
+        end=580,
+        by_epoch=True,
+        convert_to_iter_based=True),
+    dict(type='ConstantLR', by_epoch=True, factor=1, begin=580, end=600),
+]
+
+# data
+input_size = (640, 640)
+metafile = 'configs/_base_/datasets/coco.py'
+codec = dict(type='YOLOXPoseAnnotationProcessor', input_size=input_size)
+
+train_pipeline_stage1 = [
+    dict(type='LoadImage', backend_args=None),
+    dict(
+        type='Mosaic',
+        img_scale=(640, 640),
+        pad_val=114.0,
+        pre_transform=[dict(type='LoadImage', backend_args=None)]),
+    dict(
+        type='BottomupRandomAffine',
+        input_size=(640, 640),
+        shift_factor=0.1,
+        rotate_factor=10,
+        scale_factor=(0.75, 1.0),
+        pad_val=114,
+        distribution='uniform',
+        transform_mode='perspective',
+        bbox_keep_corner=False,
+        clip_border=True,
+    ),
+    dict(
+        type='YOLOXMixUp',
+        img_scale=(640, 640),
+        ratio_range=(0.8, 1.6),
+        pad_val=114.0,
+        pre_transform=[dict(type='LoadImage', backend_args=None)]),
+    dict(type='YOLOXHSVRandomAug'),
+    dict(type='RandomFlip'),
+    dict(type='FilterAnnotations', by_kpt=True, by_box=True, keep_empty=False),
+    dict(type='GenerateTarget', encoder=codec),
+    dict(type='PackPoseInputs'),
+]
+train_pipeline_stage2 = [
+    dict(type='LoadImage'),
+    dict(
+        type='BottomupRandomAffine',
+        input_size=(640, 640),
+        shift_prob=0,
+        rotate_prob=0,
+        scale_prob=0,
+        scale_type='long',
+        pad_val=(114, 114, 114),
+        bbox_keep_corner=False,
+        clip_border=True,
+    ),
+    dict(type='YOLOXHSVRandomAug'),
+    dict(type='RandomFlip'),
+    dict(type='BottomupGetHeatmapMask', get_invalid=True),
+    dict(type='FilterAnnotations', by_kpt=True, by_box=True, keep_empty=False),
+    dict(type='GenerateTarget', encoder=codec),
+    dict(type='PackPoseInputs'),
+]
+
+data_mode = 'bottomup'
+data_root = 'data/'
+
+# train datasets
+dataset_coco = dict(
+    type='CocoDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='coco/annotations/person_keypoints_train2017.json',
+    data_prefix=dict(img='coco/train2017/'),
+    pipeline=train_pipeline_stage1,
+)
+
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    persistent_workers=True,
+    pin_memory=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dataset_coco)
+
+val_pipeline = [
+    dict(type='LoadImage'),
+    dict(
+        type='BottomupResize', input_size=input_size, pad_val=(114, 114, 114)),
+    dict(
+        type='PackPoseInputs',
+        meta_keys=('id', 'img_id', 'img_path', 'ori_shape', 'img_shape',
+                   'input_size', 'input_center', 'input_scale'))
+]
+
+val_dataloader = dict(
+    batch_size=1,
+    num_workers=2,
+    persistent_workers=True,
+    pin_memory=True,
+    drop_last=False,
+    sampler=dict(type='DefaultSampler', shuffle=False, round_up=False),
+    dataset=dict(
+        type='CocoDataset',
+        data_root=data_root,
+        data_mode=data_mode,
+        ann_file='coco/annotations/person_keypoints_val2017.json',
+        data_prefix=dict(img='coco/val2017/'),
+        test_mode=True,
+        pipeline=val_pipeline,
+    ))
+test_dataloader = val_dataloader
+
+# evaluators
+val_evaluator = dict(
+    type='CocoMetric',
+    ann_file=data_root + 'coco/annotations/person_keypoints_val2017.json',
+    score_mode='bbox',
+    nms_mode='none',
+)
+test_evaluator = val_evaluator
+
+# hooks
+custom_hooks = [
+    dict(
+        type='YOLOXPoseModeSwitchHook',
+        num_last_epochs=20,
+        new_train_pipeline=train_pipeline_stage2,
+        priority=48),
+    dict(
+        type='RTMOModeSwitchHook',
+        epoch_attributes={
+            280: {
+                'proxy_target_cc': True,
+                'loss_mle.loss_weight': 5.0,
+                'loss_oks.loss_weight': 10.0
+            },
+        },
+        priority=48),
+    dict(type='SyncNormHook', priority=48),
+    dict(
+        type='EMAHook',
+        ema_type='ExpMomentumEMA',
+        momentum=0.0002,
+        update_buffers=True,
+        strict_load=False,
+        priority=49),
+]
+
+# model
+widen_factor = 0.5
+deepen_factor = 0.33
+
+model = dict(
+    type='BottomupPoseEstimator',
+    init_cfg=dict(
+        type='Kaiming',
+        layer='Conv2d',
+        a=2.23606797749979,
+        distribution='uniform',
+        mode='fan_in',
+        nonlinearity='leaky_relu'),
+    data_preprocessor=dict(
+        type='PoseDataPreprocessor',
+        pad_size_divisor=32,
+        mean=[0, 0, 0],
+        std=[1, 1, 1],
+        batch_augments=[
+            dict(
+                type='BatchSyncRandomResize',
+                random_size_range=(480, 800),
+                size_divisor=32,
+                interval=1),
+        ]),
+    backbone=dict(
+        type='CSPDarknet',
+        deepen_factor=deepen_factor,
+        widen_factor=widen_factor,
+        out_indices=(2, 3, 4),
+        spp_kernal_sizes=(5, 9, 13),
+        norm_cfg=dict(type='BN', momentum=0.03, eps=0.001),
+        act_cfg=dict(type='Swish'),
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint='https://download.openmmlab.com/mmdetection/v2.0/'
+            'yolox/yolox_s_8x8_300e_coco/yolox_s_8x8_300e_coco_'
+            '20211121_095711-4592a793.pth',
+            prefix='backbone.',
+        )),
+    neck=dict(
+        type='HybridEncoder',
+        in_channels=[128, 256, 512],
+        deepen_factor=deepen_factor,
+        widen_factor=widen_factor,
+        hidden_dim=256,
+        output_indices=[1, 2],
+        encoder_cfg=dict(
+            self_attn_cfg=dict(embed_dims=256, num_heads=8, dropout=0.0),
+            ffn_cfg=dict(
+                embed_dims=256,
+                feedforward_channels=1024,
+                ffn_drop=0.0,
+                act_cfg=dict(type='GELU'))),
+        projector=dict(
+            type='ChannelMapper',
+            in_channels=[256, 256],
+            kernel_size=1,
+            out_channels=256,
+            act_cfg=None,
+            norm_cfg=dict(type='BN'),
+            num_outs=2)),
+    head=dict(
+        type='RTMOHead',
+        num_keypoints=17,
+        featmap_strides=(16, 32),
+        head_module_cfg=dict(
+            num_classes=1,
+            in_channels=256,
+            cls_feat_channels=256,
+            channels_per_group=36,
+            pose_vec_channels=256,
+            widen_factor=widen_factor,
+            stacked_convs=2,
+            norm_cfg=dict(type='BN', momentum=0.03, eps=0.001),
+            act_cfg=dict(type='Swish')),
+        assigner=dict(
+            type='SimOTAAssigner',
+            dynamic_k_indicator='oks',
+            oks_calculator=dict(type='PoseOKS', metainfo=metafile),
+            use_keypoints_for_center=True),
+        prior_generator=dict(
+            type='MlvlPointGenerator',
+            centralize_points=True,
+            strides=[16, 32]),
+        dcc_cfg=dict(
+            in_channels=256,
+            feat_channels=128,
+            num_bins=(192, 256),
+            spe_channels=128,
+            gau_cfg=dict(
+                s=128,
+                expansion_factor=2,
+                dropout_rate=0.0,
+                drop_path=0.0,
+                act_fn='SiLU',
+                pos_enc='add')),
+        overlaps_power=0.5,
+        loss_cls=dict(
+            type='VariFocalLoss',
+            reduction='sum',
+            use_target_weight=True,
+            loss_weight=1.0),
+        loss_bbox=dict(
+            type='IoULoss',
+            mode='square',
+            eps=1e-16,
+            reduction='sum',
+            loss_weight=5.0),
+        loss_oks=dict(
+            type='OKSLoss',
+            reduction='none',
+            metainfo=metafile,
+            loss_weight=30.0),
+        loss_vis=dict(
+            type='BCELoss',
+            use_target_weight=True,
+            reduction='mean',
+            loss_weight=1.0),
+        loss_mle=dict(
+            type='MLECCLoss',
+            use_target_weight=True,
+            loss_weight=1.0,
+        ),
+        loss_bbox_aux=dict(type='L1Loss', reduction='sum', loss_weight=1.0),
+    ),
+    test_cfg=dict(
+        input_size=input_size,
+        score_thr=0.1,
+        nms_thr=0.65,
+    ))
diff --git a/configs/body_2d_keypoint/rtmo/coco/rtmo_coco.md b/configs/body_2d_keypoint/rtmo/coco/rtmo_coco.md
new file mode 100644
index 0000000000..23aac68f0d
--- /dev/null
+++ b/configs/body_2d_keypoint/rtmo/coco/rtmo_coco.md
@@ -0,0 +1,43 @@
+<!-- [ALGORITHM] -->
+
+<details>
+<summary align="right"><a href="https://arxiv.org/abs/2312.07526">RTMO</a></summary>
+
+```bibtex
+@misc{lu2023rtmo,
+      title={{RTMO}: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation},
+      author={Peng Lu and Tao Jiang and Yining Li and Xiangtai Li and Kai Chen and Wenming Yang},
+      year={2023},
+      eprint={2312.07526},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+```
+
+</details>
+
+<!-- [DATASET] -->
+
+<details>
+<summary align="right"><a href="https://link.springer.com/chapter/10.1007/978-3-319-10602-1_48">COCO (ECCV'2014)</a></summary>
+
+```bibtex
+@inproceedings{lin2014microsoft,
+  title={Microsoft coco: Common objects in context},
+  author={Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Hays, James and Perona, Pietro and Ramanan, Deva and Doll{\'a}r, Piotr and Zitnick, C Lawrence},
+  booktitle={European conference on computer vision},
+  pages={740--755},
+  year={2014},
+  organization={Springer}
+}
+```
+
+</details>
+
+Results on COCO val2017
+
+| Arch                                          | Input Size |  AP   | AP<sup>50</sup> | AP<sup>75</sup> |  AR   | AR<sup>50</sup> |                     ckpt                      |                      log                      |
+| :-------------------------------------------- | :--------: | :---: | :-------------: | :-------------: | :---: | :-------------: | :-------------------------------------------: | :-------------------------------------------: |
+| [RTMO-s](/configs/body_2d_keypoint/rtmo/coco/rtmo-s_8xb32-600e_coco-640x640.py) |  640x640   | 0.677 |      0.878      |      0.737      | 0.715 |      0.908      | [ckpt](https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-s_8xb32-600e_coco-640x640-8db55a59_20231211.pth) | [log](https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-s_8xb32-600e_coco-640x640_20231211.json) |
+| [RTMO-m](/configs/body_2d_keypoint/rtmo/coco/rtmo-m_16xb16-600e_coco-640x640.py) |  640x640   | 0.709 |      0.890      |      0.778      | 0.747 |      0.920      | [ckpt](https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-m_16xb16-600e_coco-640x640-6f4e0306_20231211.pth) | [log](https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-m_16xb16-600e_coco-640x640_20231211.json) |
+| [RTMO-l](/configs/body_2d_keypoint/rtmo/coco/rtmo-l_16xb16-600e_coco-640x640.py) |  640x640   | 0.724 |      0.899      |      0.788      | 0.762 |      0.927      | [ckpt](https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-l_16xb16-600e_coco-640x640-516a421f_20231211.pth) | [log](https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-l_16xb16-600e_coco-640x640_20231211.json) |
diff --git a/configs/body_2d_keypoint/rtmo/coco/rtmo_coco.yml b/configs/body_2d_keypoint/rtmo/coco/rtmo_coco.yml
new file mode 100644
index 0000000000..c3fc84429f
--- /dev/null
+++ b/configs/body_2d_keypoint/rtmo/coco/rtmo_coco.yml
@@ -0,0 +1,56 @@
+Collections:
+- Name: RTMO
+  Paper:
+    Title: 'RTMO: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation'
+    URL: https://arxiv.org/abs/2312.07526
+  README: https://github.com/open-mmlab/mmpose/blob/main/docs/src/papers/algorithms/rtmo.md
+Models:
+- Config: configs/body_2d_keypoint/rtmo/coco/rtmo-s_8xb32-600e_coco-640x640.py
+  In Collection: RTMO
+  Metadata:
+    Architecture: &id001
+    - RTMO
+    Training Data: CrowdPose
+  Name: rtmo-s_8xb32-600e_coco-640x640
+  Results:
+  - Dataset: CrowdPose
+    Metrics:
+      AP: 0.673
+      AP@0.5: 0.878
+      AP@0.75: 0.737
+      AR: 0.715
+      AR@0.5: 0.908
+    Task: Body 2D Keypoint
+  Weights: https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-s_8xb32-600e_coco-640x640-8db55a59_20231211.pth
+- Config: configs/body_2d_keypoint/rtmo/coco/rtmo-m_16xb16-600e_coco-640x640.py
+  In Collection: RTMO
+  Metadata:
+    Architecture: *id001
+    Training Data: CrowdPose
+  Name: rtmo-m_16xb16-600e_coco-640x640
+  Results:
+  - Dataset: CrowdPose
+    Metrics:
+      AP: 0.709
+      AP@0.5: 0.890
+      AP@0.75: 0.778
+      AR: 0.747
+      AR@0.5: 0.920
+    Task: Body 2D Keypoint
+  Weights: https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-m_16xb16-600e_coco-640x640-6f4e0306_20231211.pth
+- Config: configs/body_2d_keypoint/rtmo/coco/rtmo-l_16xb16-600e_coco-640x640.py
+  In Collection: RTMO
+  Metadata:
+    Architecture: *id001
+    Training Data: CrowdPose
+  Name: rtmo-l_16xb16-600e_coco-640x640
+  Results:
+  - Dataset: CrowdPose
+    Metrics:
+      AP: 0.724
+      AP@0.5: 0.899
+      AP@0.75: 0.788
+      AR: 0.762
+      AR@0.5: 0.927
+    Task: Body 2D Keypoint
+  Weights: https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-l_16xb16-600e_coco-640x640-516a421f_20231211.pth
diff --git a/configs/body_2d_keypoint/rtmo/crowdpose/rtmo-l_16xb16-700e_body7-crowdpose-640x640.py b/configs/body_2d_keypoint/rtmo/crowdpose/rtmo-l_16xb16-700e_body7-crowdpose-640x640.py
new file mode 100644
index 0000000000..6ba9fbe04c
--- /dev/null
+++ b/configs/body_2d_keypoint/rtmo/crowdpose/rtmo-l_16xb16-700e_body7-crowdpose-640x640.py
@@ -0,0 +1,502 @@
+_base_ = ['../../../_base_/default_runtime.py']
+
+# runtime
+train_cfg = dict(max_epochs=700, val_interval=50, dynamic_intervals=[(670, 1)])
+
+auto_scale_lr = dict(base_batch_size=256)
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=50, max_keep_ckpts=3))
+
+optim_wrapper = dict(
+    type='OptimWrapper',
+    constructor='ForceDefaultOptimWrapperConstructor',
+    optimizer=dict(type='AdamW', lr=0.004, weight_decay=0.05),
+    paramwise_cfg=dict(
+        norm_decay_mult=0,
+        bias_decay_mult=0,
+        bypass_duplicate=True,
+        force_default_settings=True,
+        custom_keys=dict({'neck.encoder': dict(lr_mult=0.05)})),
+    clip_grad=dict(max_norm=0.1, norm_type=2))
+
+param_scheduler = [
+    dict(
+        type='QuadraticWarmupLR',
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=0.0002,
+        begin=5,
+        T_max=350,
+        end=349,
+        by_epoch=True,
+        convert_to_iter_based=True),
+    # this scheduler is used to increase the lr from 2e-4 to 5e-4
+    dict(type='ConstantLR', by_epoch=True, factor=2.5, begin=349, end=350),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=0.0002,
+        begin=350,
+        T_max=320,
+        end=670,
+        by_epoch=True,
+        convert_to_iter_based=True),
+    dict(type='ConstantLR', by_epoch=True, factor=1, begin=670, end=700),
+]
+
+# data
+input_size = (640, 640)
+metafile = 'configs/_base_/datasets/crowdpose.py'
+codec = dict(type='YOLOXPoseAnnotationProcessor', input_size=input_size)
+
+train_pipeline_stage1 = [
+    dict(type='LoadImage', backend_args=None),
+    dict(
+        type='Mosaic',
+        img_scale=(640, 640),
+        pad_val=114.0,
+        pre_transform=[dict(type='LoadImage', backend_args=None)]),
+    dict(
+        type='BottomupRandomAffine',
+        input_size=(640, 640),
+        shift_factor=0.1,
+        rotate_factor=10,
+        scale_factor=(0.75, 1.0),
+        pad_val=114,
+        distribution='uniform',
+        transform_mode='perspective',
+        bbox_keep_corner=False,
+        clip_border=True,
+    ),
+    dict(
+        type='YOLOXMixUp',
+        img_scale=(640, 640),
+        ratio_range=(0.8, 1.6),
+        pad_val=114.0,
+        pre_transform=[dict(type='LoadImage', backend_args=None)]),
+    dict(type='YOLOXHSVRandomAug'),
+    dict(type='RandomFlip'),
+    dict(type='FilterAnnotations', by_kpt=True, by_box=True, keep_empty=False),
+    dict(type='GenerateTarget', encoder=codec),
+    dict(type='PackPoseInputs'),
+]
+train_pipeline_stage2 = [
+    dict(type='LoadImage'),
+    dict(
+        type='BottomupRandomAffine',
+        input_size=(640, 640),
+        shift_prob=0,
+        rotate_prob=0,
+        scale_prob=0,
+        scale_type='long',
+        pad_val=(114, 114, 114),
+        bbox_keep_corner=False,
+        clip_border=True,
+    ),
+    dict(type='YOLOXHSVRandomAug'),
+    dict(type='RandomFlip'),
+    dict(type='BottomupGetHeatmapMask', get_invalid=True),
+    dict(type='FilterAnnotations', by_kpt=True, by_box=True, keep_empty=False),
+    dict(type='GenerateTarget', encoder=codec),
+    dict(type='PackPoseInputs'),
+]
+
+# data settings
+data_mode = 'bottomup'
+data_root = 'data/'
+
+# mapping
+aic_crowdpose = [(3, 0), (0, 1), (4, 2), (1, 3), (5, 4), (2, 5),
+                 (9, 6), (6, 7), (10, 8), (7, 9), (11, 10), (8, 11), (12, 12),
+                 (13, 13)]
+
+coco_crowdpose = [
+    (5, 0),
+    (6, 1),
+    (7, 2),
+    (8, 3),
+    (9, 4),
+    (10, 5),
+    (11, 6),
+    (12, 7),
+    (13, 8),
+    (14, 9),
+    (15, 10),
+    (16, 11),
+]
+
+mpii_crowdpose = [
+    (13, 0),
+    (12, 1),
+    (14, 2),
+    (11, 3),
+    (15, 4),
+    (10, 5),
+    (3, 6),
+    (2, 7),
+    (4, 8),
+    (1, 9),
+    (5, 10),
+    (0, 11),
+    (9, 12),
+    (7, 13),
+]
+
+jhmdb_crowdpose = [(4, 0), (3, 1), (8, 2), (7, 3), (12, 4), (11, 5), (6, 6),
+                   (5, 7), (10, 8), (9, 9), (14, 10), (13, 11), (2, 12),
+                   (0, 13)]
+
+halpe_crowdpose = [
+    (5, 0),
+    (6, 1),
+    (7, 2),
+    (8, 3),
+    (9, 4),
+    (10, 5),
+    (11, 6),
+    (12, 7),
+    (13, 8),
+    (14, 9),
+    (15, 10),
+    (16, 11),
+]
+
+posetrack_crowdpose = [
+    (5, 0),
+    (6, 1),
+    (7, 2),
+    (8, 3),
+    (9, 4),
+    (10, 5),
+    (11, 6),
+    (12, 7),
+    (13, 8),
+    (14, 9),
+    (15, 10),
+    (16, 11),
+    (2, 12),
+    (1, 13),
+]
+
+# train datasets
+dataset_coco = dict(
+    type='CocoDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='coco/annotations/person_keypoints_train2017.json',
+    data_prefix=dict(img='coco/train2017/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter', num_keypoints=14, mapping=coco_crowdpose)
+    ],
+)
+
+dataset_aic = dict(
+    type='AicDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='aic/annotations/aic_train.json',
+    data_prefix=dict(img='pose/ai_challenge/ai_challenger_keypoint'
+                     '_train_20170902/keypoint_train_images_20170902/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter', num_keypoints=14, mapping=aic_crowdpose)
+    ],
+)
+
+dataset_crowdpose = dict(
+    type='CrowdPoseDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='crowdpose/annotations/mmpose_crowdpose_trainval.json',
+    data_prefix=dict(img='pose/CrowdPose/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=14,
+            mapping=[(i, i) for i in range(14)])
+    ],
+)
+
+dataset_mpii = dict(
+    type='MpiiDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='mpii/annotations/mpii_train.json',
+    data_prefix=dict(img='pose/MPI/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter', num_keypoints=14, mapping=mpii_crowdpose)
+    ],
+)
+
+dataset_jhmdb = dict(
+    type='JhmdbDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='jhmdb/annotations/Sub1_train.json',
+    data_prefix=dict(img='pose/JHMDB/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=14,
+            mapping=jhmdb_crowdpose)
+    ],
+)
+
+dataset_halpe = dict(
+    type='HalpeDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='halpe/annotations/halpe_train_v1.json',
+    data_prefix=dict(img='pose/Halpe/hico_20160224_det/images/train2015'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=14,
+            mapping=halpe_crowdpose)
+    ],
+)
+
+dataset_posetrack = dict(
+    type='PoseTrack18Dataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='posetrack18/annotations/posetrack18_train.json',
+    data_prefix=dict(img='pose/PoseChallenge2018/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=14,
+            mapping=posetrack_crowdpose)
+    ],
+)
+
+train_dataset_stage1 = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file=metafile),
+    datasets=[
+        dataset_coco,
+        dataset_aic,
+        dataset_crowdpose,
+        dataset_mpii,
+        dataset_jhmdb,
+        dataset_halpe,
+        dataset_posetrack,
+    ],
+    sample_ratio_factor=[1, 0.3, 1, 0.3, 0.3, 0.4, 0.3],
+    test_mode=False,
+    pipeline=train_pipeline_stage1)
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    persistent_workers=True,
+    pin_memory=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=train_dataset_stage1)
+
+# val datasets
+val_pipeline = [
+    dict(type='LoadImage'),
+    dict(
+        type='BottomupResize', input_size=input_size, pad_val=(114, 114, 114)),
+    dict(
+        type='PackPoseInputs',
+        meta_keys=('id', 'img_id', 'img_path', 'ori_shape', 'img_shape',
+                   'input_size', 'input_center', 'input_scale'))
+]
+
+val_dataloader = dict(
+    batch_size=1,
+    num_workers=2,
+    persistent_workers=True,
+    pin_memory=True,
+    drop_last=False,
+    sampler=dict(type='DefaultSampler', shuffle=False, round_up=False),
+    dataset=dict(
+        type='CrowdPoseDataset',
+        data_root=data_root,
+        data_mode=data_mode,
+        ann_file='crowdpose/annotations/mmpose_crowdpose_test.json',
+        data_prefix=dict(img='pose/CrowdPose/images/'),
+        test_mode=True,
+        pipeline=val_pipeline,
+    ))
+test_dataloader = val_dataloader
+
+# evaluators
+val_evaluator = dict(
+    type='CocoMetric',
+    score_mode='bbox',
+    nms_mode='none',
+    iou_type='keypoints_crowd',
+    prefix='crowdpose',
+    use_area=False,
+)
+test_evaluator = val_evaluator
+
+# hooks
+custom_hooks = [
+    dict(
+        type='YOLOXPoseModeSwitchHook',
+        num_last_epochs=30,
+        new_train_dataset=dataset_crowdpose,
+        new_train_pipeline=train_pipeline_stage2,
+        priority=48),
+    dict(
+        type='RTMOModeSwitchHook',
+        epoch_attributes={
+            350: {
+                'proxy_target_cc': True,
+                'overlaps_power': 1.0,
+                'loss_cls.loss_weight': 2.0,
+                'loss_mle.loss_weight': 5.0,
+                'loss_oks.loss_weight': 10.0
+            },
+        },
+        priority=48),
+    dict(type='SyncNormHook', priority=48),
+    dict(
+        type='EMAHook',
+        ema_type='ExpMomentumEMA',
+        momentum=0.0002,
+        update_buffers=True,
+        strict_load=False,
+        priority=49),
+]
+
+# model
+widen_factor = 1.0
+deepen_factor = 1.0
+
+model = dict(
+    type='BottomupPoseEstimator',
+    init_cfg=dict(
+        type='Kaiming',
+        layer='Conv2d',
+        a=2.23606797749979,
+        distribution='uniform',
+        mode='fan_in',
+        nonlinearity='leaky_relu'),
+    data_preprocessor=dict(
+        type='PoseDataPreprocessor',
+        pad_size_divisor=32,
+        mean=[0, 0, 0],
+        std=[1, 1, 1],
+        batch_augments=[
+            dict(
+                type='BatchSyncRandomResize',
+                random_size_range=(480, 800),
+                size_divisor=32,
+                interval=1),
+        ]),
+    backbone=dict(
+        type='CSPDarknet',
+        deepen_factor=deepen_factor,
+        widen_factor=widen_factor,
+        out_indices=(2, 3, 4),
+        spp_kernal_sizes=(5, 9, 13),
+        norm_cfg=dict(type='BN', momentum=0.03, eps=0.001),
+        act_cfg=dict(type='Swish'),
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint='https://download.openmmlab.com/mmdetection/v2.0/'
+            'yolox/yolox_l_8x8_300e_coco/yolox_l_8x8_300e_coco'
+            '_20211126_140236-d3bd2b23.pth',
+            prefix='backbone.',
+        )),
+    neck=dict(
+        type='HybridEncoder',
+        in_channels=[256, 512, 1024],
+        deepen_factor=deepen_factor,
+        widen_factor=widen_factor,
+        hidden_dim=256,
+        output_indices=[1, 2],
+        encoder_cfg=dict(
+            self_attn_cfg=dict(embed_dims=256, num_heads=8, dropout=0.0),
+            ffn_cfg=dict(
+                embed_dims=256,
+                feedforward_channels=1024,
+                ffn_drop=0.0,
+                act_cfg=dict(type='GELU'))),
+        projector=dict(
+            type='ChannelMapper',
+            in_channels=[256, 256],
+            kernel_size=1,
+            out_channels=512,
+            act_cfg=None,
+            norm_cfg=dict(type='BN'),
+            num_outs=2)),
+    head=dict(
+        type='RTMOHead',
+        num_keypoints=14,
+        featmap_strides=(16, 32),
+        head_module_cfg=dict(
+            num_classes=1,
+            in_channels=256,
+            cls_feat_channels=256,
+            channels_per_group=36,
+            pose_vec_channels=512,
+            widen_factor=widen_factor,
+            stacked_convs=2,
+            norm_cfg=dict(type='BN', momentum=0.03, eps=0.001),
+            act_cfg=dict(type='Swish')),
+        assigner=dict(
+            type='SimOTAAssigner',
+            dynamic_k_indicator='oks',
+            oks_calculator=dict(type='PoseOKS', metainfo=metafile)),
+        prior_generator=dict(
+            type='MlvlPointGenerator',
+            centralize_points=True,
+            strides=[16, 32]),
+        dcc_cfg=dict(
+            in_channels=512,
+            feat_channels=128,
+            num_bins=(192, 256),
+            spe_channels=128,
+            gau_cfg=dict(
+                s=128,
+                expansion_factor=2,
+                dropout_rate=0.0,
+                drop_path=0.0,
+                act_fn='SiLU',
+                pos_enc='add')),
+        overlaps_power=0.5,
+        loss_cls=dict(
+            type='VariFocalLoss',
+            reduction='sum',
+            use_target_weight=True,
+            loss_weight=1.0),
+        loss_bbox=dict(
+            type='IoULoss',
+            mode='square',
+            eps=1e-16,
+            reduction='sum',
+            loss_weight=5.0),
+        loss_oks=dict(
+            type='OKSLoss',
+            reduction='none',
+            metainfo=metafile,
+            loss_weight=30.0),
+        loss_vis=dict(
+            type='BCELoss',
+            use_target_weight=True,
+            reduction='mean',
+            loss_weight=1.0),
+        loss_mle=dict(
+            type='MLECCLoss',
+            use_target_weight=True,
+            loss_weight=1e-3,
+        ),
+        loss_bbox_aux=dict(type='L1Loss', reduction='sum', loss_weight=1.0),
+    ),
+    test_cfg=dict(
+        input_size=input_size,
+        score_thr=0.1,
+        nms_thr=0.65,
+    ))
diff --git a/configs/body_2d_keypoint/rtmo/crowdpose/rtmo-l_16xb16-700e_crowdpose-640x640.py b/configs/body_2d_keypoint/rtmo/crowdpose/rtmo-l_16xb16-700e_crowdpose-640x640.py
new file mode 100644
index 0000000000..6b2c78b5a3
--- /dev/null
+++ b/configs/body_2d_keypoint/rtmo/crowdpose/rtmo-l_16xb16-700e_crowdpose-640x640.py
@@ -0,0 +1,326 @@
+_base_ = ['../../../_base_/default_runtime.py']
+
+# runtime
+train_cfg = dict(max_epochs=700, val_interval=50, dynamic_intervals=[(670, 1)])
+
+auto_scale_lr = dict(base_batch_size=256)
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=50, max_keep_ckpts=3))
+
+optim_wrapper = dict(
+    type='OptimWrapper',
+    constructor='ForceDefaultOptimWrapperConstructor',
+    optimizer=dict(type='AdamW', lr=0.004, weight_decay=0.05),
+    paramwise_cfg=dict(
+        norm_decay_mult=0,
+        bias_decay_mult=0,
+        bypass_duplicate=True,
+        force_default_settings=True,
+        custom_keys=dict({'neck.encoder': dict(lr_mult=0.05)})),
+    clip_grad=dict(max_norm=0.1, norm_type=2))
+
+param_scheduler = [
+    dict(
+        type='QuadraticWarmupLR',
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=0.0002,
+        begin=5,
+        T_max=350,
+        end=349,
+        by_epoch=True,
+        convert_to_iter_based=True),
+    # this scheduler is used to increase the lr from 2e-4 to 5e-4
+    dict(type='ConstantLR', by_epoch=True, factor=2.5, begin=349, end=350),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=0.0002,
+        begin=350,
+        T_max=320,
+        end=670,
+        by_epoch=True,
+        convert_to_iter_based=True),
+    dict(type='ConstantLR', by_epoch=True, factor=1, begin=670, end=700),
+]
+
+# data
+input_size = (640, 640)
+metafile = 'configs/_base_/datasets/crowdpose.py'
+codec = dict(type='YOLOXPoseAnnotationProcessor', input_size=input_size)
+
+train_pipeline_stage1 = [
+    dict(type='LoadImage', backend_args=None),
+    dict(
+        type='Mosaic',
+        img_scale=(640, 640),
+        pad_val=114.0,
+        pre_transform=[dict(type='LoadImage', backend_args=None)]),
+    dict(
+        type='BottomupRandomAffine',
+        input_size=(640, 640),
+        shift_factor=0.2,
+        rotate_factor=30,
+        scale_factor=(0.5, 1.5),
+        pad_val=114,
+        distribution='uniform',
+        transform_mode='perspective',
+        bbox_keep_corner=False,
+        clip_border=True,
+    ),
+    dict(
+        type='YOLOXMixUp',
+        img_scale=(640, 640),
+        ratio_range=(0.6, 1.6),
+        pad_val=114.0,
+        pre_transform=[dict(type='LoadImage', backend_args=None)]),
+    dict(type='YOLOXHSVRandomAug'),
+    dict(type='RandomFlip'),
+    dict(type='FilterAnnotations', by_kpt=True, by_box=True, keep_empty=False),
+    dict(type='GenerateTarget', encoder=codec),
+    dict(type='PackPoseInputs'),
+]
+train_pipeline_stage2 = [
+    dict(type='LoadImage'),
+    dict(
+        type='BottomupRandomAffine',
+        input_size=(640, 640),
+        shift_prob=0,
+        rotate_prob=0,
+        scale_prob=0,
+        scale_type='long',
+        pad_val=(114, 114, 114),
+        bbox_keep_corner=False,
+        clip_border=True,
+    ),
+    dict(type='YOLOXHSVRandomAug'),
+    dict(type='RandomFlip'),
+    dict(type='BottomupGetHeatmapMask', get_invalid=True),
+    dict(type='FilterAnnotations', by_kpt=True, by_box=True, keep_empty=False),
+    dict(type='GenerateTarget', encoder=codec),
+    dict(type='PackPoseInputs'),
+]
+
+data_mode = 'bottomup'
+data_root = 'data/'
+
+# train datasets
+dataset_crowdpose = dict(
+    type='CrowdPoseDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='crowdpose/annotations/mmpose_crowdpose_trainval.json',
+    data_prefix=dict(img='pose/CrowdPose/images/'),
+    pipeline=train_pipeline_stage1,
+)
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    persistent_workers=True,
+    pin_memory=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dataset_crowdpose)
+
+val_pipeline = [
+    dict(type='LoadImage'),
+    dict(
+        type='BottomupResize', input_size=input_size, pad_val=(114, 114, 114)),
+    dict(
+        type='PackPoseInputs',
+        meta_keys=('id', 'img_id', 'img_path', 'ori_shape', 'img_shape',
+                   'input_size', 'input_center', 'input_scale'))
+]
+
+val_dataloader = dict(
+    batch_size=1,
+    num_workers=2,
+    persistent_workers=True,
+    pin_memory=True,
+    drop_last=False,
+    sampler=dict(type='DefaultSampler', shuffle=False, round_up=False),
+    dataset=dict(
+        type='CrowdPoseDataset',
+        data_root=data_root,
+        data_mode=data_mode,
+        ann_file='crowdpose/annotations/mmpose_crowdpose_test.json',
+        data_prefix=dict(img='pose/CrowdPose/images/'),
+        test_mode=True,
+        pipeline=val_pipeline,
+    ))
+test_dataloader = val_dataloader
+
+# evaluators
+val_evaluator = dict(
+    type='CocoMetric',
+    score_mode='bbox',
+    nms_mode='none',
+    iou_type='keypoints_crowd',
+    prefix='crowdpose',
+    use_area=False,
+)
+test_evaluator = val_evaluator
+
+# hooks
+custom_hooks = [
+    dict(
+        type='YOLOXPoseModeSwitchHook',
+        num_last_epochs=30,
+        new_train_pipeline=train_pipeline_stage2,
+        priority=48),
+    dict(
+        type='RTMOModeSwitchHook',
+        epoch_attributes={
+            350: {
+                'proxy_target_cc': True,
+                'overlaps_power': 1.0,
+                'loss_cls.loss_weight': 2.0,
+                'loss_mle.loss_weight': 5.0,
+                'loss_oks.loss_weight': 10.0
+            },
+        },
+        priority=48),
+    dict(type='SyncNormHook', priority=48),
+    dict(
+        type='EMAHook',
+        ema_type='ExpMomentumEMA',
+        momentum=0.0002,
+        update_buffers=True,
+        strict_load=False,
+        priority=49),
+]
+
+# model
+widen_factor = 1.0
+deepen_factor = 1.0
+
+model = dict(
+    type='BottomupPoseEstimator',
+    init_cfg=dict(
+        type='Kaiming',
+        layer='Conv2d',
+        a=2.23606797749979,
+        distribution='uniform',
+        mode='fan_in',
+        nonlinearity='leaky_relu'),
+    data_preprocessor=dict(
+        type='PoseDataPreprocessor',
+        pad_size_divisor=32,
+        mean=[0, 0, 0],
+        std=[1, 1, 1],
+        batch_augments=[
+            dict(
+                type='BatchSyncRandomResize',
+                random_size_range=(480, 800),
+                size_divisor=32,
+                interval=1),
+        ]),
+    backbone=dict(
+        type='CSPDarknet',
+        deepen_factor=deepen_factor,
+        widen_factor=widen_factor,
+        out_indices=(2, 3, 4),
+        spp_kernal_sizes=(5, 9, 13),
+        norm_cfg=dict(type='BN', momentum=0.03, eps=0.001),
+        act_cfg=dict(type='Swish'),
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint='https://download.openmmlab.com/mmdetection/v2.0/'
+            'yolox/yolox_l_8x8_300e_coco/yolox_l_8x8_300e_coco'
+            '_20211126_140236-d3bd2b23.pth',
+            prefix='backbone.',
+        )),
+    neck=dict(
+        type='HybridEncoder',
+        in_channels=[256, 512, 1024],
+        deepen_factor=deepen_factor,
+        widen_factor=widen_factor,
+        hidden_dim=256,
+        output_indices=[1, 2],
+        encoder_cfg=dict(
+            self_attn_cfg=dict(embed_dims=256, num_heads=8, dropout=0.0),
+            ffn_cfg=dict(
+                embed_dims=256,
+                feedforward_channels=1024,
+                ffn_drop=0.0,
+                act_cfg=dict(type='GELU'))),
+        projector=dict(
+            type='ChannelMapper',
+            in_channels=[256, 256],
+            kernel_size=1,
+            out_channels=512,
+            act_cfg=None,
+            norm_cfg=dict(type='BN'),
+            num_outs=2)),
+    head=dict(
+        type='RTMOHead',
+        num_keypoints=14,
+        featmap_strides=(16, 32),
+        head_module_cfg=dict(
+            num_classes=1,
+            in_channels=256,
+            cls_feat_channels=256,
+            channels_per_group=36,
+            pose_vec_channels=512,
+            widen_factor=widen_factor,
+            stacked_convs=2,
+            norm_cfg=dict(type='BN', momentum=0.03, eps=0.001),
+            act_cfg=dict(type='Swish')),
+        assigner=dict(
+            type='SimOTAAssigner',
+            dynamic_k_indicator='oks',
+            oks_calculator=dict(type='PoseOKS', metainfo=metafile)),
+        prior_generator=dict(
+            type='MlvlPointGenerator',
+            centralize_points=True,
+            strides=[16, 32]),
+        dcc_cfg=dict(
+            in_channels=512,
+            feat_channels=128,
+            num_bins=(192, 256),
+            spe_channels=128,
+            gau_cfg=dict(
+                s=128,
+                expansion_factor=2,
+                dropout_rate=0.0,
+                drop_path=0.0,
+                act_fn='SiLU',
+                pos_enc='add')),
+        overlaps_power=0.5,
+        loss_cls=dict(
+            type='VariFocalLoss',
+            reduction='sum',
+            use_target_weight=True,
+            loss_weight=1.0),
+        loss_bbox=dict(
+            type='IoULoss',
+            mode='square',
+            eps=1e-16,
+            reduction='sum',
+            loss_weight=5.0),
+        loss_oks=dict(
+            type='OKSLoss',
+            reduction='none',
+            metainfo=metafile,
+            loss_weight=30.0),
+        loss_vis=dict(
+            type='BCELoss',
+            use_target_weight=True,
+            reduction='mean',
+            loss_weight=1.0),
+        loss_mle=dict(
+            type='MLECCLoss',
+            use_target_weight=True,
+            loss_weight=1e-3,
+        ),
+        loss_bbox_aux=dict(type='L1Loss', reduction='sum', loss_weight=1.0),
+    ),
+    test_cfg=dict(
+        input_size=input_size,
+        score_thr=0.1,
+        nms_thr=0.65,
+    ))
diff --git a/configs/body_2d_keypoint/rtmo/crowdpose/rtmo-m_16xb16-700e_crowdpose-640x640.py b/configs/body_2d_keypoint/rtmo/crowdpose/rtmo-m_16xb16-700e_crowdpose-640x640.py
new file mode 100644
index 0000000000..af8da87942
--- /dev/null
+++ b/configs/body_2d_keypoint/rtmo/crowdpose/rtmo-m_16xb16-700e_crowdpose-640x640.py
@@ -0,0 +1,325 @@
+_base_ = ['../../../_base_/default_runtime.py']
+
+# runtime
+train_cfg = dict(max_epochs=700, val_interval=50, dynamic_intervals=[(670, 1)])
+
+auto_scale_lr = dict(base_batch_size=256)
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=50, max_keep_ckpts=3))
+
+optim_wrapper = dict(
+    type='OptimWrapper',
+    constructor='ForceDefaultOptimWrapperConstructor',
+    optimizer=dict(type='AdamW', lr=0.004, weight_decay=0.05),
+    paramwise_cfg=dict(
+        norm_decay_mult=0,
+        bias_decay_mult=0,
+        bypass_duplicate=True,
+        force_default_settings=True,
+        custom_keys=dict({'neck.encoder': dict(lr_mult=0.05)})),
+    clip_grad=dict(max_norm=0.1, norm_type=2))
+
+param_scheduler = [
+    dict(
+        type='QuadraticWarmupLR',
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=0.0002,
+        begin=5,
+        T_max=350,
+        end=349,
+        by_epoch=True,
+        convert_to_iter_based=True),
+    # this scheduler is used to increase the lr from 2e-4 to 5e-4
+    dict(type='ConstantLR', by_epoch=True, factor=2.5, begin=349, end=350),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=0.0002,
+        begin=350,
+        T_max=320,
+        end=670,
+        by_epoch=True,
+        convert_to_iter_based=True),
+    dict(type='ConstantLR', by_epoch=True, factor=1, begin=670, end=700),
+]
+
+# data
+input_size = (640, 640)
+metafile = 'configs/_base_/datasets/crowdpose.py'
+codec = dict(type='YOLOXPoseAnnotationProcessor', input_size=input_size)
+
+train_pipeline_stage1 = [
+    dict(type='LoadImage', backend_args=None),
+    dict(
+        type='Mosaic',
+        img_scale=(640, 640),
+        pad_val=114.0,
+        pre_transform=[dict(type='LoadImage', backend_args=None)]),
+    dict(
+        type='BottomupRandomAffine',
+        input_size=(640, 640),
+        shift_factor=0.2,
+        rotate_factor=30,
+        scale_factor=(0.5, 1.5),
+        pad_val=114,
+        distribution='uniform',
+        transform_mode='perspective',
+        bbox_keep_corner=False,
+        clip_border=True,
+    ),
+    dict(
+        type='YOLOXMixUp',
+        img_scale=(640, 640),
+        ratio_range=(0.6, 1.6),
+        pad_val=114.0,
+        pre_transform=[dict(type='LoadImage', backend_args=None)]),
+    dict(type='YOLOXHSVRandomAug'),
+    dict(type='RandomFlip'),
+    dict(type='FilterAnnotations', by_kpt=True, by_box=True, keep_empty=False),
+    dict(type='GenerateTarget', encoder=codec),
+    dict(type='PackPoseInputs'),
+]
+train_pipeline_stage2 = [
+    dict(type='LoadImage'),
+    dict(
+        type='BottomupRandomAffine',
+        input_size=(640, 640),
+        shift_prob=0,
+        rotate_prob=0,
+        scale_prob=0,
+        scale_type='long',
+        pad_val=(114, 114, 114),
+        bbox_keep_corner=False,
+        clip_border=True,
+    ),
+    dict(type='YOLOXHSVRandomAug'),
+    dict(type='RandomFlip'),
+    dict(type='BottomupGetHeatmapMask', get_invalid=True),
+    dict(type='FilterAnnotations', by_kpt=True, by_box=True, keep_empty=False),
+    dict(type='GenerateTarget', encoder=codec),
+    dict(type='PackPoseInputs'),
+]
+
+data_mode = 'bottomup'
+data_root = 'data/'
+
+# train datasets
+dataset_crowdpose = dict(
+    type='CrowdPoseDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='crowdpose/annotations/mmpose_crowdpose_trainval.json',
+    data_prefix=dict(img='pose/CrowdPose/images/'),
+    pipeline=train_pipeline_stage1,
+)
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    persistent_workers=True,
+    pin_memory=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dataset_crowdpose)
+
+val_pipeline = [
+    dict(type='LoadImage'),
+    dict(
+        type='BottomupResize', input_size=input_size, pad_val=(114, 114, 114)),
+    dict(
+        type='PackPoseInputs',
+        meta_keys=('id', 'img_id', 'img_path', 'ori_shape', 'img_shape',
+                   'input_size', 'input_center', 'input_scale'))
+]
+
+val_dataloader = dict(
+    batch_size=1,
+    num_workers=2,
+    persistent_workers=True,
+    pin_memory=True,
+    drop_last=False,
+    sampler=dict(type='DefaultSampler', shuffle=False, round_up=False),
+    dataset=dict(
+        type='CrowdPoseDataset',
+        data_root=data_root,
+        data_mode=data_mode,
+        ann_file='crowdpose/annotations/mmpose_crowdpose_test.json',
+        data_prefix=dict(img='pose/CrowdPose/images/'),
+        test_mode=True,
+        pipeline=val_pipeline,
+    ))
+test_dataloader = val_dataloader
+
+# evaluators
+val_evaluator = dict(
+    type='CocoMetric',
+    score_mode='bbox',
+    nms_mode='none',
+    iou_type='keypoints_crowd',
+    prefix='crowdpose',
+    use_area=False,
+)
+test_evaluator = val_evaluator
+
+# hooks
+custom_hooks = [
+    dict(
+        type='YOLOXPoseModeSwitchHook',
+        num_last_epochs=30,
+        new_train_pipeline=train_pipeline_stage2,
+        priority=48),
+    dict(
+        type='RTMOModeSwitchHook',
+        epoch_attributes={
+            350: {
+                'proxy_target_cc': True,
+                'overlaps_power': 1.0,
+                'loss_cls.loss_weight': 2.0,
+                'loss_mle.loss_weight': 5.0,
+                'loss_oks.loss_weight': 10.0
+            },
+        },
+        priority=48),
+    dict(type='SyncNormHook', priority=48),
+    dict(
+        type='EMAHook',
+        ema_type='ExpMomentumEMA',
+        momentum=0.0002,
+        update_buffers=True,
+        strict_load=False,
+        priority=49),
+]
+
+# model
+widen_factor = 0.75
+deepen_factor = 0.67
+
+model = dict(
+    type='BottomupPoseEstimator',
+    init_cfg=dict(
+        type='Kaiming',
+        layer='Conv2d',
+        a=2.23606797749979,
+        distribution='uniform',
+        mode='fan_in',
+        nonlinearity='leaky_relu'),
+    data_preprocessor=dict(
+        type='PoseDataPreprocessor',
+        pad_size_divisor=32,
+        mean=[0, 0, 0],
+        std=[1, 1, 1],
+        batch_augments=[
+            dict(
+                type='BatchSyncRandomResize',
+                random_size_range=(480, 800),
+                size_divisor=32,
+                interval=1),
+        ]),
+    backbone=dict(
+        type='CSPDarknet',
+        deepen_factor=deepen_factor,
+        widen_factor=widen_factor,
+        out_indices=(2, 3, 4),
+        spp_kernal_sizes=(5, 9, 13),
+        norm_cfg=dict(type='BN', momentum=0.03, eps=0.001),
+        act_cfg=dict(type='Swish'),
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint='https://download.openmmlab.com/mmpose/v1/'
+            'pretrained_models/yolox_m_8x8_300e_coco_20230829.pth',
+            prefix='backbone.',
+        )),
+    neck=dict(
+        type='HybridEncoder',
+        in_channels=[192, 384, 768],
+        deepen_factor=deepen_factor,
+        widen_factor=widen_factor,
+        hidden_dim=256,
+        output_indices=[1, 2],
+        encoder_cfg=dict(
+            self_attn_cfg=dict(embed_dims=256, num_heads=8, dropout=0.0),
+            ffn_cfg=dict(
+                embed_dims=256,
+                feedforward_channels=1024,
+                ffn_drop=0.0,
+                act_cfg=dict(type='GELU'))),
+        projector=dict(
+            type='ChannelMapper',
+            in_channels=[256, 256],
+            kernel_size=1,
+            out_channels=384,
+            act_cfg=None,
+            norm_cfg=dict(type='BN'),
+            num_outs=2)),
+    head=dict(
+        type='RTMOHead',
+        num_keypoints=14,
+        featmap_strides=(16, 32),
+        head_module_cfg=dict(
+            num_classes=1,
+            in_channels=256,
+            cls_feat_channels=256,
+            channels_per_group=36,
+            pose_vec_channels=384,
+            widen_factor=widen_factor,
+            stacked_convs=2,
+            norm_cfg=dict(type='BN', momentum=0.03, eps=0.001),
+            act_cfg=dict(type='Swish')),
+        assigner=dict(
+            type='SimOTAAssigner',
+            dynamic_k_indicator='oks',
+            oks_calculator=dict(type='PoseOKS', metainfo=metafile)),
+        prior_generator=dict(
+            type='MlvlPointGenerator',
+            centralize_points=True,
+            strides=[16, 32]),
+        dcc_cfg=dict(
+            in_channels=384,
+            feat_channels=128,
+            num_bins=(192, 256),
+            spe_channels=128,
+            gau_cfg=dict(
+                s=128,
+                expansion_factor=2,
+                dropout_rate=0.0,
+                drop_path=0.0,
+                act_fn='SiLU',
+                pos_enc='add')),
+        overlaps_power=0.5,
+        loss_cls=dict(
+            type='VariFocalLoss',
+            reduction='sum',
+            use_target_weight=True,
+            loss_weight=1.0),
+        loss_bbox=dict(
+            type='IoULoss',
+            mode='square',
+            eps=1e-16,
+            reduction='sum',
+            loss_weight=5.0),
+        loss_oks=dict(
+            type='OKSLoss',
+            reduction='none',
+            metainfo=metafile,
+            loss_weight=30.0),
+        loss_vis=dict(
+            type='BCELoss',
+            use_target_weight=True,
+            reduction='mean',
+            loss_weight=1.0),
+        loss_mle=dict(
+            type='MLECCLoss',
+            use_target_weight=True,
+            loss_weight=1e-3,
+        ),
+        loss_bbox_aux=dict(type='L1Loss', reduction='sum', loss_weight=1.0),
+    ),
+    test_cfg=dict(
+        input_size=input_size,
+        score_thr=0.1,
+        nms_thr=0.65,
+    ))
diff --git a/configs/body_2d_keypoint/rtmo/crowdpose/rtmo-s_8xb32-700e_crowdpose-640x640.py b/configs/body_2d_keypoint/rtmo/crowdpose/rtmo-s_8xb32-700e_crowdpose-640x640.py
new file mode 100644
index 0000000000..288da890e8
--- /dev/null
+++ b/configs/body_2d_keypoint/rtmo/crowdpose/rtmo-s_8xb32-700e_crowdpose-640x640.py
@@ -0,0 +1,326 @@
+_base_ = ['../../../_base_/default_runtime.py']
+
+# runtime
+train_cfg = dict(max_epochs=700, val_interval=50, dynamic_intervals=[(670, 1)])
+
+auto_scale_lr = dict(base_batch_size=256)
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=50, max_keep_ckpts=3))
+
+optim_wrapper = dict(
+    type='OptimWrapper',
+    constructor='ForceDefaultOptimWrapperConstructor',
+    optimizer=dict(type='AdamW', lr=0.004, weight_decay=0.05),
+    paramwise_cfg=dict(
+        norm_decay_mult=0,
+        bias_decay_mult=0,
+        bypass_duplicate=True,
+        force_default_settings=True,
+        custom_keys=dict({'neck.encoder': dict(lr_mult=0.05)})),
+    clip_grad=dict(max_norm=0.1, norm_type=2))
+
+param_scheduler = [
+    dict(
+        type='QuadraticWarmupLR',
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=0.0002,
+        begin=5,
+        T_max=350,
+        end=349,
+        by_epoch=True,
+        convert_to_iter_based=True),
+    # this scheduler is used to increase the lr from 2e-4 to 5e-4
+    dict(type='ConstantLR', by_epoch=True, factor=2.5, begin=349, end=350),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=0.0002,
+        begin=350,
+        T_max=320,
+        end=670,
+        by_epoch=True,
+        convert_to_iter_based=True),
+    dict(type='ConstantLR', by_epoch=True, factor=1, begin=670, end=700),
+]
+
+# data
+input_size = (640, 640)
+metafile = 'configs/_base_/datasets/crowdpose.py'
+codec = dict(type='YOLOXPoseAnnotationProcessor', input_size=input_size)
+
+train_pipeline_stage1 = [
+    dict(type='LoadImage', backend_args=None),
+    dict(
+        type='Mosaic',
+        img_scale=(640, 640),
+        pad_val=114.0,
+        pre_transform=[dict(type='LoadImage', backend_args=None)]),
+    dict(
+        type='BottomupRandomAffine',
+        input_size=(640, 640),
+        shift_factor=0.2,
+        rotate_factor=30,
+        scale_factor=(0.5, 1.5),
+        pad_val=114,
+        distribution='uniform',
+        transform_mode='perspective',
+        bbox_keep_corner=False,
+        clip_border=True,
+    ),
+    dict(
+        type='YOLOXMixUp',
+        img_scale=(640, 640),
+        ratio_range=(0.6, 1.6),
+        pad_val=114.0,
+        pre_transform=[dict(type='LoadImage', backend_args=None)]),
+    dict(type='YOLOXHSVRandomAug'),
+    dict(type='RandomFlip'),
+    dict(type='FilterAnnotations', by_kpt=True, by_box=True, keep_empty=False),
+    dict(type='GenerateTarget', encoder=codec),
+    dict(type='PackPoseInputs'),
+]
+train_pipeline_stage2 = [
+    dict(type='LoadImage'),
+    dict(
+        type='BottomupRandomAffine',
+        input_size=(640, 640),
+        shift_prob=0,
+        rotate_prob=0,
+        scale_prob=0,
+        scale_type='long',
+        pad_val=(114, 114, 114),
+        bbox_keep_corner=False,
+        clip_border=True,
+    ),
+    dict(type='YOLOXHSVRandomAug'),
+    dict(type='RandomFlip'),
+    dict(type='BottomupGetHeatmapMask', get_invalid=True),
+    dict(type='FilterAnnotations', by_kpt=True, by_box=True, keep_empty=False),
+    dict(type='GenerateTarget', encoder=codec),
+    dict(type='PackPoseInputs'),
+]
+
+data_mode = 'bottomup'
+data_root = 'data/'
+
+# train datasets
+dataset_crowdpose = dict(
+    type='CrowdPoseDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='crowdpose/annotations/mmpose_crowdpose_trainval.json',
+    data_prefix=dict(img='pose/CrowdPose/images/'),
+    pipeline=train_pipeline_stage1,
+)
+
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    persistent_workers=True,
+    pin_memory=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dataset_crowdpose)
+
+val_pipeline = [
+    dict(type='LoadImage'),
+    dict(
+        type='BottomupResize', input_size=input_size, pad_val=(114, 114, 114)),
+    dict(
+        type='PackPoseInputs',
+        meta_keys=('id', 'img_id', 'img_path', 'ori_shape', 'img_shape',
+                   'input_size', 'input_center', 'input_scale'))
+]
+
+val_dataloader = dict(
+    batch_size=1,
+    num_workers=2,
+    persistent_workers=True,
+    pin_memory=True,
+    drop_last=False,
+    sampler=dict(type='DefaultSampler', shuffle=False, round_up=False),
+    dataset=dict(
+        type='CrowdPoseDataset',
+        data_root=data_root,
+        data_mode=data_mode,
+        ann_file='crowdpose/annotations/mmpose_crowdpose_test.json',
+        data_prefix=dict(img='pose/CrowdPose/images/'),
+        test_mode=True,
+        pipeline=val_pipeline,
+    ))
+test_dataloader = val_dataloader
+
+# evaluators
+val_evaluator = dict(
+    type='CocoMetric',
+    score_mode='bbox',
+    nms_mode='none',
+    iou_type='keypoints_crowd',
+    prefix='crowdpose',
+    use_area=False,
+)
+test_evaluator = val_evaluator
+
+# hooks
+custom_hooks = [
+    dict(
+        type='YOLOXPoseModeSwitchHook',
+        num_last_epochs=30,
+        new_train_pipeline=train_pipeline_stage2,
+        priority=48),
+    dict(
+        type='RTMOModeSwitchHook',
+        epoch_attributes={
+            350: {
+                'proxy_target_cc': True,
+                'overlaps_power': 1.0,
+                'loss_cls.loss_weight': 2.0,
+                'loss_mle.loss_weight': 5.0,
+                'loss_oks.loss_weight': 10.0
+            },
+        },
+        priority=48),
+    dict(type='SyncNormHook', priority=48),
+    dict(
+        type='EMAHook',
+        ema_type='ExpMomentumEMA',
+        momentum=0.0002,
+        update_buffers=True,
+        strict_load=False,
+        priority=49),
+]
+
+# model
+widen_factor = 0.5
+deepen_factor = 0.33
+
+model = dict(
+    type='BottomupPoseEstimator',
+    init_cfg=dict(
+        type='Kaiming',
+        layer='Conv2d',
+        a=2.23606797749979,
+        distribution='uniform',
+        mode='fan_in',
+        nonlinearity='leaky_relu'),
+    data_preprocessor=dict(
+        type='PoseDataPreprocessor',
+        pad_size_divisor=32,
+        mean=[0, 0, 0],
+        std=[1, 1, 1],
+        batch_augments=[
+            dict(
+                type='BatchSyncRandomResize',
+                random_size_range=(480, 800),
+                size_divisor=32,
+                interval=1),
+        ]),
+    backbone=dict(
+        type='CSPDarknet',
+        deepen_factor=deepen_factor,
+        widen_factor=widen_factor,
+        out_indices=(2, 3, 4),
+        spp_kernal_sizes=(5, 9, 13),
+        norm_cfg=dict(type='BN', momentum=0.03, eps=0.001),
+        act_cfg=dict(type='Swish'),
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint='https://download.openmmlab.com/mmdetection/v2.0/'
+            'yolox/yolox_s_8x8_300e_coco/yolox_s_8x8_300e_coco_'
+            '20211121_095711-4592a793.pth',
+            prefix='backbone.',
+        )),
+    neck=dict(
+        type='HybridEncoder',
+        in_channels=[128, 256, 512],
+        deepen_factor=deepen_factor,
+        widen_factor=widen_factor,
+        hidden_dim=256,
+        output_indices=[1, 2],
+        encoder_cfg=dict(
+            self_attn_cfg=dict(embed_dims=256, num_heads=8, dropout=0.0),
+            ffn_cfg=dict(
+                embed_dims=256,
+                feedforward_channels=1024,
+                ffn_drop=0.0,
+                act_cfg=dict(type='GELU'))),
+        projector=dict(
+            type='ChannelMapper',
+            in_channels=[256, 256],
+            kernel_size=1,
+            out_channels=256,
+            act_cfg=None,
+            norm_cfg=dict(type='BN'),
+            num_outs=2)),
+    head=dict(
+        type='RTMOHead',
+        num_keypoints=14,
+        featmap_strides=(16, 32),
+        head_module_cfg=dict(
+            num_classes=1,
+            in_channels=256,
+            cls_feat_channels=256,
+            channels_per_group=36,
+            pose_vec_channels=256,
+            widen_factor=widen_factor,
+            stacked_convs=2,
+            norm_cfg=dict(type='BN', momentum=0.03, eps=0.001),
+            act_cfg=dict(type='Swish')),
+        assigner=dict(
+            type='SimOTAAssigner',
+            dynamic_k_indicator='oks',
+            oks_calculator=dict(type='PoseOKS', metainfo=metafile)),
+        prior_generator=dict(
+            type='MlvlPointGenerator',
+            centralize_points=True,
+            strides=[16, 32]),
+        dcc_cfg=dict(
+            in_channels=256,
+            feat_channels=128,
+            num_bins=(192, 256),
+            spe_channels=128,
+            gau_cfg=dict(
+                s=128,
+                expansion_factor=2,
+                dropout_rate=0.0,
+                drop_path=0.0,
+                act_fn='SiLU',
+                pos_enc='add')),
+        overlaps_power=0.5,
+        loss_cls=dict(
+            type='VariFocalLoss',
+            reduction='sum',
+            use_target_weight=True,
+            loss_weight=1.0),
+        loss_bbox=dict(
+            type='IoULoss',
+            mode='square',
+            eps=1e-16,
+            reduction='sum',
+            loss_weight=5.0),
+        loss_oks=dict(
+            type='OKSLoss',
+            reduction='none',
+            metainfo=metafile,
+            loss_weight=30.0),
+        loss_vis=dict(
+            type='BCELoss',
+            use_target_weight=True,
+            reduction='mean',
+            loss_weight=1.0),
+        loss_mle=dict(
+            type='MLECCLoss',
+            use_target_weight=True,
+            loss_weight=1e-3,
+        ),
+        loss_bbox_aux=dict(type='L1Loss', reduction='sum', loss_weight=1.0),
+    ),
+    test_cfg=dict(
+        input_size=input_size,
+        score_thr=0.1,
+        nms_thr=0.65,
+    ))
diff --git a/configs/body_2d_keypoint/rtmo/crowdpose/rtmo_crowdpose.md b/configs/body_2d_keypoint/rtmo/crowdpose/rtmo_crowdpose.md
new file mode 100644
index 0000000000..314afb40f8
--- /dev/null
+++ b/configs/body_2d_keypoint/rtmo/crowdpose/rtmo_crowdpose.md
@@ -0,0 +1,44 @@
+<!-- [ALGORITHM] -->
+
+<details>
+<summary align="right"><a href="https://arxiv.org/abs/2312.07526">RTMO</a></summary>
+
+```bibtex
+@misc{lu2023rtmo,
+      title={{RTMO}: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation},
+      author={Peng Lu and Tao Jiang and Yining Li and Xiangtai Li and Kai Chen and Wenming Yang},
+      year={2023},
+      eprint={2312.07526},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+```
+
+</details>
+
+<!-- [DATASET] -->
+
+<details>
+<summary align="right"><a href="http://openaccess.thecvf.com/content_CVPR_2019/html/Li_CrowdPose_Efficient_Crowded_Scenes_Pose_Estimation_and_a_New_Benchmark_CVPR_2019_paper.html">CrowdPose (CVPR'2019)</a></summary>
+
+```bibtex
+@article{li2018crowdpose,
+  title={CrowdPose: Efficient Crowded Scenes Pose Estimation and A New Benchmark},
+  author={Li, Jiefeng and Wang, Can and Zhu, Hao and Mao, Yihuan and Fang, Hao-Shu and Lu, Cewu},
+  journal={arXiv preprint arXiv:1812.00324},
+  year={2018}
+}
+```
+
+</details>
+
+Results on COCO val2017
+
+| Arch                                           | Input Size |  AP   | AP<sup>50</sup> | AP<sup>75</sup> | AP (E) | AP (M) | AP (H) |                      ckpt                      |                      log                      |
+| :--------------------------------------------- | :--------: | :---: | :-------------: | :-------------: | :----: | :----: | :----: | :--------------------------------------------: | :-------------------------------------------: |
+| [RTMO-s](/configs/body_2d_keypoint/rtmo/crowdpose/rtmo-s_8xb32-700e_crowdpose-640x640.py) |  640x640   | 0.673 |      0.882      |      0.729      | 0.737  | 0.682  | 0.591  | [ckpt](https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-s_8xb32-700e_crowdpose-640x640-79f81c0d_20231211.pth) | [log](https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-s_8xb32-700e_crowdpose-640x640_20231211.json) |
+| [RTMO-m](/configs/body_2d_keypoint/rtmo/crowdpose/rtmo-m_16xb16-700e_crowdpose-640x640.py) |  640x640   | 0.711 |      0.897      |      0.771      | 0.774  | 0.719  | 0.634  | [ckpt](https://download.openmmlab.com/mmpose/v1/projects/rtmo/rrtmo-m_16xb16-700e_crowdpose-640x640-0eaf670d_20231211.pth) | [log](https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-m_16xb16-700e_crowdpose-640x640_20231211.json) |
+| [RTMO-l](/configs/body_2d_keypoint/rtmo/crowdpose/rtmo-l_16xb16-700e_crowdpose-640x640.py) |  640x640   | 0.732 |      0.907      |      0.793      | 0.792  | 0.741  | 0.653  | [ckpt](https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-l_16xb16-700e_crowdpose-640x640-1008211f_20231211.pth) | [log](https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-l_16xb16-700e_crowdpose-640x640_20231211.json) |
+| [RTMO-l](/configs/body_2d_keypoint/rtmo/crowdpose/rtmo-l_16xb16-700e_body7-crowdpose-640x640.py)\* |  640x640   | 0.838 |      0.947      |      0.893      | 0.888  | 0.847  | 0.772  | [ckpt](https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-l_16xb16-700e_body7-crowdpose-640x640-5bafdc11_20231219.pth) | [log](https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-l_16xb16-700e_body7-crowdpose-640x640_20231219.json) |
+
+\* indicates the model is trained using a combined dataset composed of AI Challenger, COCO, CrowdPose, Halpe, MPII, PoseTrack18 and sub-JHMDB.
diff --git a/configs/body_2d_keypoint/rtmo/crowdpose/rtmo_crowdpose.yml b/configs/body_2d_keypoint/rtmo/crowdpose/rtmo_crowdpose.yml
new file mode 100644
index 0000000000..d808f15e12
--- /dev/null
+++ b/configs/body_2d_keypoint/rtmo/crowdpose/rtmo_crowdpose.yml
@@ -0,0 +1,70 @@
+Models:
+- Config: configs/body_2d_keypoint/rtmo/crowdpose/rtmo-s_8xb32-700e_crowdpose-640x640.py
+  In Collection: RTMO
+  Metadata:
+    Architecture: &id001
+    - RTMO
+    Training Data: CrowdPose
+  Name: rtmo-s_8xb32-700e_crowdpose-640x640
+  Results:
+  - Dataset: CrowdPose
+    Metrics:
+      AP: 0.673
+      AP@0.5: 0.882
+      AP@0.75: 0.729
+      AP (E): 0.737
+      AP (M): 0.682
+      AP (L): 0.591
+    Task: Body 2D Keypoint
+  Weights: https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-s_8xb32-700e_crowdpose-640x640-79f81c0d_20231211.pth
+- Config: configs/body_2d_keypoint/rtmo/crowdpose/rtmo-m_16xb16-700e_crowdpose-640x640.py
+  In Collection: RTMO
+  Metadata:
+    Architecture: *id001
+    Training Data: CrowdPose
+  Name: rtmo-m_16xb16-700e_crowdpose-640x640
+  Results:
+  - Dataset: CrowdPose
+    Metrics:
+      AP: 0.711
+      AP@0.5: 0.897
+      AP@0.75: 0.771
+      AP (E): 0.774
+      AP (M): 0.719
+      AP (L): 0.634
+    Task: Body 2D Keypoint
+  Weights: https://download.openmmlab.com/mmpose/v1/projects/rtmo/rrtmo-m_16xb16-700e_crowdpose-640x640-0eaf670d_20231211.pth
+- Config: configs/body_2d_keypoint/rtmo/crowdpose/rtmo-l_16xb16-700e_crowdpose-640x640.py
+  In Collection: RTMO
+  Metadata:
+    Architecture: *id001
+    Training Data: CrowdPose
+  Name: rtmo-l_16xb16-700e_crowdpose-640x640
+  Results:
+  - Dataset: CrowdPose
+    Metrics:
+      AP: 0.732
+      AP@0.5: 0.907
+      AP@0.75: 0.793
+      AP (E): 0.792
+      AP (M): 0.741
+      AP (L): 0.653
+    Task: Body 2D Keypoint
+  Weights: https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-l_16xb16-700e_crowdpose-640x640-1008211f_20231211.pth
+- Config: configs/body_2d_keypoint/rtmo/crowdpose/rtmo-l_16xb16-700e_body7-crowdpose-640x640.py
+  In Collection: RTMO
+  Metadata:
+    Architecture: *id001
+    Training Data: CrowdPose
+  Name: rtmo-l_16xb16-700e_body7-crowdpose-640x640
+  Results:
+  - Dataset: CrowdPose
+    Metrics:
+      AP: 0.838
+      AP@0.5: 0.947
+      AP@0.75: 0.893
+      AP (E): 0.888
+      AP (M): 0.847
+      AP (L): 0.772
+    Task: Body 2D Keypoint
+  Weights: https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-l_16xb16-700e_body7-crowdpose-640x640-5bafdc11_20231219.pth
diff --git a/configs/body_2d_keypoint/rtmpose/body8/rtmpose_body8-coco.yml b/configs/body_2d_keypoint/rtmpose/body8/rtmpose_body8-coco.yml
index 10a16c61d6..bd8231ccbe 100644
--- a/configs/body_2d_keypoint/rtmpose/body8/rtmpose_body8-coco.yml
+++ b/configs/body_2d_keypoint/rtmpose/body8/rtmpose_body8-coco.yml
@@ -84,6 +84,7 @@ Models:
   Weights: https://download.openmmlab.com/mmpose/v1/projects/rtmposev1/rtmpose-m_simcc-body7_pt-body7_420e-384x288-65e718c4_20230504.pth
 - Config: configs/body_2d_keypoint/rtmpose/body8/rtmpose-l_8xb256-420e_body8-384x288.py
   In Collection: RTMPose
+  Alias: rtmpose-l
   Metadata:
     Architecture: *id001
     Training Data: *id002
diff --git a/configs/body_2d_keypoint/topdown_heatmap/exlpose/hrnet_exlpose.md b/configs/body_2d_keypoint/topdown_heatmap/exlpose/hrnet_exlpose.md
new file mode 100644
index 0000000000..3e387923d5
--- /dev/null
+++ b/configs/body_2d_keypoint/topdown_heatmap/exlpose/hrnet_exlpose.md
@@ -0,0 +1,38 @@
+<!-- [ALGORITHM] -->
+
+<details>
+<summary align="right"><a href="http://openaccess.thecvf.com/content_CVPR_2019/html/Sun_Deep_High-Resolution_Representation_Learning_for_Human_Pose_Estimation_CVPR_2019_paper.html">HRNet (CVPR'2019)</a></summary>
+
+```bibtex
+@inproceedings{sun2019deep,
+  title={Deep high-resolution representation learning for human pose estimation},
+  author={Sun, Ke and Xiao, Bin and Liu, Dong and Wang, Jingdong},
+  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
+  pages={5693--5703},
+  year={2019}
+}
+```
+
+</details>
+
+<!-- [DATASET] -->
+
+<details>
+<summary align="right"><a href="http://cg.postech.ac.kr/research/ExLPose/">ExLPose (2023)</a></summary>
+
+```bibtex
+@inproceedings{ExLPose_2023_CVPR,
+ title={Human Pose Estimation in Extremely Low-Light Conditions},
+ author={Sohyun Lee, Jaesung Rim, Boseung Jeong, Geonu Kim, ByungJu Woo, Haechan Lee, Sunghyun Cho, Suha Kwak},
+ booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
+ year={2023}
+}
+```
+
+</details>
+
+Results on ExLPose-LLA val set with ground-truth bounding boxes
+
+| Arch                                          | Input Size |  AP   | AP<sup>50</sup> | AP<sup>75</sup> |  AR   | AR<sup>50</sup> |                     ckpt                      |                      log                      |
+| :-------------------------------------------- | :--------: | :---: | :-------------: | :-------------: | :---: | :-------------: | :-------------------------------------------: | :-------------------------------------------: |
+| [pose_hrnet_w32](/configs/body_2d_keypoint/topdown_heatmap/exlpose/td-hm_hrnet-w32_8xb64-210e_exlpose-256x192.py) |  256x192   | 0.401 |      0.64       |      0.40       | 0.452 |      0.693      | [ckpt](https://download.openmmlab.com/mmpose/v1/body_2d_keypoint/topdown_heatmap/exlpose/td-hm_hrnet-w32_8xb64-210e_exlpose-ll-256x192.pth) | [log](https://download.openmmlab.com/mmpose/v1/body_2d_keypoint/topdown_heatmap/exlpose/td-hm_hrnet-w32_8xb64-210e_exlpose-ll-256x192.json) |
diff --git a/configs/body_2d_keypoint/topdown_heatmap/exlpose/hrnet_exlpose.yml b/configs/body_2d_keypoint/topdown_heatmap/exlpose/hrnet_exlpose.yml
new file mode 100644
index 0000000000..2b8637f528
--- /dev/null
+++ b/configs/body_2d_keypoint/topdown_heatmap/exlpose/hrnet_exlpose.yml
@@ -0,0 +1,18 @@
+Models:
+- Config: configs/body_2d_keypoint/topdown_heatmap/exlpose/td-hm_hrnet-w32_8xb64-210e_exlpose-256x192.py
+  In Collection: HRNet
+  Metadata:
+    Architecture:
+    - HRNet
+    Training Data: ExLPose-LL
+  Name: td-hm_hrnet-w32_8xb64-210e_exlpose-256x192
+  Results:
+  - Dataset: ExLPose
+    Metrics:
+      AP: 0.401
+      AP@0.5: 0.64
+      AP@0.75: 0.40
+      AR: 0.452
+      AR@0.5: 0.693
+    Task: Body 2D Keypoint
+  Weights: https://download.openmmlab.com/mmpose/v1/body_2d_keypoint/topdown_heatmap/exlpose/td-hm_hrnet-w32_8xb64-210e_exlpose-ll-256x192.pth
diff --git a/configs/body_2d_keypoint/topdown_heatmap/exlpose/td-hm_hrnet-w32_8xb64-210e_exlpose-256x192.py b/configs/body_2d_keypoint/topdown_heatmap/exlpose/td-hm_hrnet-w32_8xb64-210e_exlpose-256x192.py
new file mode 100644
index 0000000000..c1fea18a4a
--- /dev/null
+++ b/configs/body_2d_keypoint/topdown_heatmap/exlpose/td-hm_hrnet-w32_8xb64-210e_exlpose-256x192.py
@@ -0,0 +1,149 @@
+_base_ = ['../../../_base_/default_runtime.py']
+
+# runtime
+train_cfg = dict(max_epochs=210, val_interval=10)
+
+# optimizer
+optim_wrapper = dict(optimizer=dict(
+    type='Adam',
+    lr=5e-4,
+))
+
+# learning policy
+param_scheduler = [
+    dict(
+        type='LinearLR', begin=0, end=500, start_factor=0.001,
+        by_epoch=False),  # warm-up
+    dict(
+        type='MultiStepLR',
+        begin=0,
+        end=210,
+        milestones=[170, 200],
+        gamma=0.1,
+        by_epoch=True)
+]
+
+# automatically scaling LR based on the actual training batch size
+auto_scale_lr = dict(base_batch_size=512)
+
+# hooks
+default_hooks = dict(checkpoint=dict(save_best='coco/AP', rule='greater'))
+
+# codec settings
+codec = dict(
+    type='MSRAHeatmap', input_size=(192, 256), heatmap_size=(48, 64), sigma=2)
+
+# model settings
+model = dict(
+    type='TopdownPoseEstimator',
+    data_preprocessor=dict(
+        type='PoseDataPreprocessor',
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        bgr_to_rgb=True),
+    backbone=dict(
+        type='HRNet',
+        in_channels=3,
+        extra=dict(
+            stage1=dict(
+                num_modules=1,
+                num_branches=1,
+                block='BOTTLENECK',
+                num_blocks=(4, ),
+                num_channels=(64, )),
+            stage2=dict(
+                num_modules=1,
+                num_branches=2,
+                block='BASIC',
+                num_blocks=(4, 4),
+                num_channels=(32, 64)),
+            stage3=dict(
+                num_modules=4,
+                num_branches=3,
+                block='BASIC',
+                num_blocks=(4, 4, 4),
+                num_channels=(32, 64, 128)),
+            stage4=dict(
+                num_modules=3,
+                num_branches=4,
+                block='BASIC',
+                num_blocks=(4, 4, 4, 4),
+                num_channels=(32, 64, 128, 256))),
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint='https://download.openmmlab.com/mmpose/'
+            'pretrain_models/hrnet_w32-36af842e.pth'),
+    ),
+    head=dict(
+        type='HeatmapHead',
+        in_channels=32,
+        out_channels=14,
+        deconv_out_channels=None,
+        loss=dict(type='KeypointMSELoss', use_target_weight=True),
+        decoder=codec),
+    test_cfg=dict(
+        flip_test=True,
+        flip_mode='heatmap',
+        shift_heatmap=True,
+    ))
+
+# base dataset settings
+dataset_type = 'ExlposeDataset'
+data_mode = 'topdown'
+data_root = 'data/ExLPose/'
+
+# pipelines
+train_pipeline = [
+    dict(type='LoadImage'),
+    dict(type='GetBBoxCenterScale'),
+    dict(type='RandomFlip', direction='horizontal'),
+    dict(type='RandomHalfBody'),
+    dict(type='RandomBBoxTransform'),
+    dict(type='TopdownAffine', input_size=codec['input_size']),
+    dict(type='GenerateTarget', encoder=codec),
+    dict(type='PackPoseInputs')
+]
+val_pipeline = [
+    dict(type='LoadImage'),
+    dict(type='GetBBoxCenterScale'),
+    dict(type='TopdownAffine', input_size=codec['input_size']),
+    dict(type='PackPoseInputs')
+]
+
+# data loaders
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=2,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        data_mode=data_mode,
+        ann_file='annotations/ExLPose/ExLPose_train_LL.json',
+        data_prefix=dict(img=''),
+        pipeline=train_pipeline,
+    ))
+val_dataloader = dict(
+    batch_size=32,
+    num_workers=2,
+    persistent_workers=True,
+    drop_last=False,
+    sampler=dict(type='DefaultSampler', shuffle=False, round_up=False),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        data_mode=data_mode,
+        ann_file='annotations/ExLPose/ExLPose_test_LL-A.json',
+        data_prefix=dict(img=''),
+        test_mode=True,
+        pipeline=val_pipeline,
+    ))
+test_dataloader = val_dataloader
+
+# evaluators
+val_evaluator = dict(
+    type='CocoMetric',
+    ann_file=data_root + 'annotations/ExLPose/ExLPose_test_LL-A.json',
+    use_area=False)
+test_evaluator = val_evaluator
diff --git a/configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw-l_8xb1024-270e_cocktail14-256x192.py b/configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw-l_8xb1024-270e_cocktail14-256x192.py
new file mode 100644
index 0000000000..59351d5f4a
--- /dev/null
+++ b/configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw-l_8xb1024-270e_cocktail14-256x192.py
@@ -0,0 +1,615 @@
+_base_ = ['../../../_base_/default_runtime.py']
+
+# common setting
+num_keypoints = 133
+input_size = (192, 256)
+
+# runtime
+max_epochs = 270
+stage2_num_epochs = 10
+base_lr = 5e-4
+train_batch_size = 1024
+val_batch_size = 32
+
+train_cfg = dict(max_epochs=max_epochs, val_interval=10)
+randomness = dict(seed=21)
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='AdamW', lr=base_lr, weight_decay=0.1),
+    clip_grad=dict(max_norm=35, norm_type=2),
+    paramwise_cfg=dict(
+        norm_decay_mult=0, bias_decay_mult=0, bypass_duplicate=True))
+
+# learning rate
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1.0e-5,
+        by_epoch=False,
+        begin=0,
+        end=1000),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=base_lr * 0.05,
+        begin=max_epochs // 2,
+        end=max_epochs,
+        T_max=max_epochs // 2,
+        by_epoch=True,
+        convert_to_iter_based=True),
+]
+
+# automatically scaling LR based on the actual training batch size
+auto_scale_lr = dict(base_batch_size=8192)
+
+# codec settings
+codec = dict(
+    type='SimCCLabel',
+    input_size=input_size,
+    sigma=(4.9, 5.66),
+    simcc_split_ratio=2.0,
+    normalize=False,
+    use_dark=False)
+
+# model settings
+model = dict(
+    type='TopdownPoseEstimator',
+    data_preprocessor=dict(
+        type='PoseDataPreprocessor',
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        bgr_to_rgb=True),
+    backbone=dict(
+        type='CSPNeXt',
+        arch='P5',
+        expand_ratio=0.5,
+        deepen_factor=1.,
+        widen_factor=1.,
+        channel_attention=True,
+        norm_cfg=dict(type='BN'),
+        act_cfg=dict(type='SiLU'),
+        init_cfg=dict(
+            type='Pretrained',
+            prefix='backbone.',
+            checkpoint='https://download.openmmlab.com/mmpose/v1/projects/'
+            'rtmposev1/rtmpose-l_simcc-ucoco_dw-ucoco_270e-256x192-4d6dfc62_20230728.pth'  # noqa
+        )),
+    neck=dict(
+        type='CSPNeXtPAFPN',
+        in_channels=[256, 512, 1024],
+        out_channels=None,
+        out_indices=(
+            1,
+            2,
+        ),
+        num_csp_blocks=2,
+        expand_ratio=0.5,
+        norm_cfg=dict(type='SyncBN'),
+        act_cfg=dict(type='SiLU', inplace=True)),
+    head=dict(
+        type='RTMWHead',
+        in_channels=1024,
+        out_channels=num_keypoints,
+        input_size=input_size,
+        in_featuremap_size=tuple([s // 32 for s in input_size]),
+        simcc_split_ratio=codec['simcc_split_ratio'],
+        final_layer_kernel_size=7,
+        gau_cfg=dict(
+            hidden_dims=256,
+            s=128,
+            expansion_factor=2,
+            dropout_rate=0.,
+            drop_path=0.,
+            act_fn='SiLU',
+            use_rel_bias=False,
+            pos_enc=False),
+        loss=dict(
+            type='KLDiscretLoss',
+            use_target_weight=True,
+            beta=1.,
+            label_softmax=True,
+            label_beta=10.,
+            mask=list(range(23, 91)),
+            mask_weight=0.5,
+        ),
+        decoder=codec),
+    test_cfg=dict(flip_test=True))
+
+# base dataset settings
+dataset_type = 'CocoWholeBodyDataset'
+data_mode = 'topdown'
+data_root = 'data/'
+
+backend_args = dict(backend='local')
+
+# pipelines
+train_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(type='RandomFlip', direction='horizontal'),
+    dict(type='RandomHalfBody'),
+    dict(
+        type='RandomBBoxTransform', scale_factor=[0.5, 1.5], rotate_factor=90),
+    dict(type='TopdownAffine', input_size=codec['input_size']),
+    dict(type='PhotometricDistortion'),
+    dict(
+        type='Albumentation',
+        transforms=[
+            dict(type='Blur', p=0.1),
+            dict(type='MedianBlur', p=0.1),
+            dict(
+                type='CoarseDropout',
+                max_holes=1,
+                max_height=0.4,
+                max_width=0.4,
+                min_holes=1,
+                min_height=0.2,
+                min_width=0.2,
+                p=0.5),
+        ]),
+    dict(
+        type='GenerateTarget',
+        encoder=codec,
+        use_dataset_keypoint_weights=True),
+    dict(type='PackPoseInputs')
+]
+val_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(type='TopdownAffine', input_size=codec['input_size']),
+    dict(type='PackPoseInputs')
+]
+train_pipeline_stage2 = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(type='RandomFlip', direction='horizontal'),
+    dict(type='RandomHalfBody'),
+    dict(
+        type='RandomBBoxTransform',
+        shift_factor=0.,
+        scale_factor=[0.5, 1.5],
+        rotate_factor=90),
+    dict(type='TopdownAffine', input_size=codec['input_size']),
+    dict(
+        type='Albumentation',
+        transforms=[
+            dict(type='Blur', p=0.1),
+            dict(type='MedianBlur', p=0.1),
+        ]),
+    dict(
+        type='GenerateTarget',
+        encoder=codec,
+        use_dataset_keypoint_weights=True),
+    dict(type='PackPoseInputs')
+]
+
+# mapping
+
+aic_coco133 = [(0, 6), (1, 8), (2, 10), (3, 5), (4, 7), (5, 9), (6, 12),
+               (7, 14), (8, 16), (9, 11), (10, 13), (11, 15)]
+
+crowdpose_coco133 = [(0, 5), (1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (6, 11),
+                     (7, 12), (8, 13), (9, 14), (10, 15), (11, 16)]
+
+mpii_coco133 = [
+    (0, 16),
+    (1, 14),
+    (2, 12),
+    (3, 11),
+    (4, 13),
+    (5, 15),
+    (10, 10),
+    (11, 8),
+    (12, 6),
+    (13, 5),
+    (14, 7),
+    (15, 9),
+]
+
+jhmdb_coco133 = [
+    (3, 6),
+    (4, 5),
+    (5, 12),
+    (6, 11),
+    (7, 8),
+    (8, 7),
+    (9, 14),
+    (10, 13),
+    (11, 10),
+    (12, 9),
+    (13, 16),
+    (14, 15),
+]
+
+halpe_coco133 = [(i, i)
+                 for i in range(17)] + [(20, 17), (21, 20), (22, 18), (23, 21),
+                                        (24, 19),
+                                        (25, 22)] + [(i, i - 3)
+                                                     for i in range(26, 136)]
+
+posetrack_coco133 = [
+    (0, 0),
+    (3, 3),
+    (4, 4),
+    (5, 5),
+    (6, 6),
+    (7, 7),
+    (8, 8),
+    (9, 9),
+    (10, 10),
+    (11, 11),
+    (12, 12),
+    (13, 13),
+    (14, 14),
+    (15, 15),
+    (16, 16),
+]
+
+humanart_coco133 = [(i, i) for i in range(17)] + [(17, 99), (18, 120),
+                                                  (19, 17), (20, 20)]
+
+# train datasets
+dataset_coco = dict(
+    type=dataset_type,
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='coco/annotations/coco_wholebody_train_v1.0.json',
+    data_prefix=dict(img='detection/coco/train2017/'),
+    pipeline=[],
+)
+
+dataset_aic = dict(
+    type='AicDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='aic/annotations/aic_train.json',
+    data_prefix=dict(img='pose/ai_challenge/ai_challenger_keypoint'
+                     '_train_20170902/keypoint_train_images_20170902/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=aic_coco133)
+    ],
+)
+
+dataset_crowdpose = dict(
+    type='CrowdPoseDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='crowdpose/annotations/mmpose_crowdpose_trainval.json',
+    data_prefix=dict(img='pose/CrowdPose/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=crowdpose_coco133)
+    ],
+)
+
+dataset_mpii = dict(
+    type='MpiiDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='mpii/annotations/mpii_train.json',
+    data_prefix=dict(img='pose/MPI/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=mpii_coco133)
+    ],
+)
+
+dataset_jhmdb = dict(
+    type='JhmdbDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='jhmdb/annotations/Sub1_train.json',
+    data_prefix=dict(img='pose/JHMDB/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=jhmdb_coco133)
+    ],
+)
+
+dataset_halpe = dict(
+    type='HalpeDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='halpe/annotations/halpe_train_v1.json',
+    data_prefix=dict(img='pose/Halpe/hico_20160224_det/images/train2015'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=halpe_coco133)
+    ],
+)
+
+dataset_posetrack = dict(
+    type='PoseTrack18Dataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='posetrack18/annotations/posetrack18_train.json',
+    data_prefix=dict(img='pose/PoseChallenge2018/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=posetrack_coco133)
+    ],
+)
+
+dataset_humanart = dict(
+    type='HumanArt21Dataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='HumanArt/annotations/training_humanart.json',
+    filter_cfg=dict(scenes=['real_human']),
+    data_prefix=dict(img='pose/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=humanart_coco133)
+    ])
+
+ubody_scenes = [
+    'Magic_show', 'Entertainment', 'ConductMusic', 'Online_class', 'TalkShow',
+    'Speech', 'Fitness', 'Interview', 'Olympic', 'TVShow', 'Singing',
+    'SignLanguage', 'Movie', 'LiveVlog', 'VideoConference'
+]
+
+ubody_datasets = []
+for scene in ubody_scenes:
+    each = dict(
+        type='UBody2dDataset',
+        data_root=data_root,
+        data_mode=data_mode,
+        ann_file=f'Ubody/annotations/{scene}/train_annotations.json',
+        data_prefix=dict(img='pose/UBody/images/'),
+        pipeline=[],
+        sample_interval=10)
+    ubody_datasets.append(each)
+
+dataset_ubody = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/ubody2d.py'),
+    datasets=ubody_datasets,
+    pipeline=[],
+    test_mode=False,
+)
+
+face_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale', padding=1.25),
+    dict(
+        type='RandomBBoxTransform',
+        shift_factor=0.,
+        scale_factor=[1.5, 2.0],
+        rotate_factor=0),
+]
+
+wflw_coco133 = [(i * 2, 23 + i)
+                for i in range(17)] + [(33 + i, 40 + i) for i in range(5)] + [
+                    (42 + i, 45 + i) for i in range(5)
+                ] + [(51 + i, 50 + i)
+                     for i in range(9)] + [(60, 59), (61, 60), (63, 61),
+                                           (64, 62), (65, 63), (67, 64),
+                                           (68, 65), (69, 66), (71, 67),
+                                           (72, 68), (73, 69),
+                                           (75, 70)] + [(76 + i, 71 + i)
+                                                        for i in range(20)]
+dataset_wflw = dict(
+    type='WFLWDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='wflw/annotations/face_landmarks_wflw_train.json',
+    data_prefix=dict(img='pose/WFLW/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=wflw_coco133), *face_pipeline
+    ],
+)
+
+mapping_300w_coco133 = [(i, 23 + i) for i in range(68)]
+dataset_300w = dict(
+    type='Face300WDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='300w/annotations/face_landmarks_300w_train.json',
+    data_prefix=dict(img='pose/300w/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=mapping_300w_coco133), *face_pipeline
+    ],
+)
+
+cofw_coco133 = [(0, 40), (2, 44), (4, 42), (1, 49), (3, 45), (6, 47), (8, 59),
+                (10, 62), (9, 68), (11, 65), (18, 54), (19, 58), (20, 53),
+                (21, 56), (22, 71), (23, 77), (24, 74), (25, 85), (26, 89),
+                (27, 80), (28, 31)]
+dataset_cofw = dict(
+    type='COFWDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='cofw/annotations/cofw_train.json',
+    data_prefix=dict(img='pose/COFW/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=cofw_coco133), *face_pipeline
+    ],
+)
+
+lapa_coco133 = [(i * 2, 23 + i) for i in range(17)] + [
+    (33 + i, 40 + i) for i in range(5)
+] + [(42 + i, 45 + i) for i in range(5)] + [
+    (51 + i, 50 + i) for i in range(4)
+] + [(58 + i, 54 + i) for i in range(5)] + [(66, 59), (67, 60), (69, 61),
+                                            (70, 62), (71, 63), (73, 64),
+                                            (75, 65), (76, 66), (78, 67),
+                                            (79, 68), (80, 69),
+                                            (82, 70)] + [(84 + i, 71 + i)
+                                                         for i in range(20)]
+dataset_lapa = dict(
+    type='LapaDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='LaPa/annotations/lapa_trainval.json',
+    data_prefix=dict(img='pose/LaPa/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=lapa_coco133), *face_pipeline
+    ],
+)
+
+dataset_wb = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[dataset_coco, dataset_halpe, dataset_ubody],
+    pipeline=[],
+    test_mode=False,
+)
+
+dataset_body = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[
+        dataset_aic,
+        dataset_crowdpose,
+        dataset_mpii,
+        dataset_jhmdb,
+        dataset_posetrack,
+        dataset_humanart,
+    ],
+    pipeline=[],
+    test_mode=False,
+)
+
+dataset_face = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[
+        dataset_wflw,
+        dataset_300w,
+        dataset_cofw,
+        dataset_lapa,
+    ],
+    pipeline=[],
+    test_mode=False,
+)
+
+hand_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(
+        type='RandomBBoxTransform',
+        shift_factor=0.,
+        scale_factor=[1.5, 2.0],
+        rotate_factor=0),
+]
+
+interhand_left = [(21, 95), (22, 94), (23, 93), (24, 92), (25, 99), (26, 98),
+                  (27, 97), (28, 96), (29, 103), (30, 102), (31, 101),
+                  (32, 100), (33, 107), (34, 106), (35, 105), (36, 104),
+                  (37, 111), (38, 110), (39, 109), (40, 108), (41, 91)]
+interhand_right = [(i - 21, j + 21) for i, j in interhand_left]
+interhand_coco133 = interhand_right + interhand_left
+
+dataset_interhand2d = dict(
+    type='InterHand2DDoubleDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='interhand26m/annotations/all/InterHand2.6M_train_data.json',
+    camera_param_file='interhand26m/annotations/all/'
+    'InterHand2.6M_train_camera.json',
+    joint_file='interhand26m/annotations/all/'
+    'InterHand2.6M_train_joint_3d.json',
+    data_prefix=dict(img='interhand2.6m/images/train/'),
+    sample_interval=10,
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=interhand_coco133,
+        ), *hand_pipeline
+    ],
+)
+
+dataset_hand = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[dataset_interhand2d],
+    pipeline=[],
+    test_mode=False,
+)
+
+train_datasets = [dataset_wb, dataset_body, dataset_face, dataset_hand]
+
+# data loaders
+train_dataloader = dict(
+    batch_size=train_batch_size,
+    num_workers=4,
+    pin_memory=False,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dict(
+        type='CombinedDataset',
+        metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+        datasets=train_datasets,
+        pipeline=train_pipeline,
+        test_mode=False,
+    ))
+
+val_dataloader = dict(
+    batch_size=val_batch_size,
+    num_workers=4,
+    persistent_workers=True,
+    drop_last=False,
+    sampler=dict(type='DefaultSampler', shuffle=False, round_up=False),
+    dataset=dict(
+        type='CocoWholeBodyDataset',
+        ann_file='data/coco/annotations/coco_wholebody_val_v1.0.json',
+        data_prefix=dict(img='data/detection/coco/val2017/'),
+        pipeline=val_pipeline,
+        bbox_file='data/coco/person_detection_results/'
+        'COCO_val2017_detections_AP_H_56_person.json',
+        test_mode=True))
+
+test_dataloader = val_dataloader
+
+# hooks
+default_hooks = dict(
+    checkpoint=dict(
+        save_best='coco-wholebody/AP', rule='greater', max_keep_ckpts=1))
+
+custom_hooks = [
+    dict(
+        type='EMAHook',
+        ema_type='ExpMomentumEMA',
+        momentum=0.0002,
+        update_buffers=True,
+        priority=49),
+    dict(
+        type='mmdet.PipelineSwitchHook',
+        switch_epoch=max_epochs - stage2_num_epochs,
+        switch_pipeline=train_pipeline_stage2)
+]
+
+# evaluators
+val_evaluator = dict(
+    type='CocoWholeBodyMetric',
+    ann_file='data/coco/annotations/coco_wholebody_val_v1.0.json')
+test_evaluator = val_evaluator
diff --git a/configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw-l_8xb320-270e_cocktail14-384x288.py b/configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw-l_8xb320-270e_cocktail14-384x288.py
new file mode 100644
index 0000000000..a687f89ef6
--- /dev/null
+++ b/configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw-l_8xb320-270e_cocktail14-384x288.py
@@ -0,0 +1,617 @@
+_base_ = ['../../../_base_/default_runtime.py']
+
+# common setting
+num_keypoints = 133
+input_size = (288, 384)
+
+# runtime
+max_epochs = 270
+stage2_num_epochs = 10
+base_lr = 5e-4
+train_batch_size = 320
+val_batch_size = 32
+
+train_cfg = dict(max_epochs=max_epochs, val_interval=10)
+randomness = dict(seed=21)
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='AdamW', lr=base_lr, weight_decay=0.1),
+    clip_grad=dict(max_norm=35, norm_type=2),
+    paramwise_cfg=dict(
+        norm_decay_mult=0, bias_decay_mult=0, bypass_duplicate=True))
+
+# learning rate
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1.0e-5,
+        by_epoch=False,
+        begin=0,
+        end=1000),
+    dict(
+        # use cosine lr from 150 to 300 epoch
+        type='CosineAnnealingLR',
+        eta_min=base_lr * 0.05,
+        begin=max_epochs // 2,
+        end=max_epochs,
+        T_max=max_epochs // 2,
+        by_epoch=True,
+        convert_to_iter_based=True),
+]
+
+# automatically scaling LR based on the actual training batch size
+auto_scale_lr = dict(base_batch_size=2560)
+
+# codec settings
+codec = dict(
+    type='SimCCLabel',
+    input_size=input_size,
+    sigma=(6., 6.93),
+    simcc_split_ratio=2.0,
+    normalize=False,
+    use_dark=False,
+    decode_visibility=True)
+
+# model settings
+model = dict(
+    type='TopdownPoseEstimator',
+    data_preprocessor=dict(
+        type='PoseDataPreprocessor',
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        bgr_to_rgb=True),
+    backbone=dict(
+        type='CSPNeXt',
+        arch='P5',
+        expand_ratio=0.5,
+        deepen_factor=1.,
+        widen_factor=1.,
+        channel_attention=True,
+        norm_cfg=dict(type='BN'),
+        act_cfg=dict(type='SiLU'),
+        init_cfg=dict(
+            type='Pretrained',
+            prefix='backbone.',
+            checkpoint='https://download.openmmlab.com/mmpose/v1/projects/'
+            'rtmposev1/rtmpose-l_simcc-ucoco_dw-ucoco_270e-256x192-4d6dfc62_20230728.pth'  # noqa
+        )),
+    neck=dict(
+        type='CSPNeXtPAFPN',
+        in_channels=[256, 512, 1024],
+        out_channels=None,
+        out_indices=(
+            1,
+            2,
+        ),
+        num_csp_blocks=2,
+        expand_ratio=0.5,
+        norm_cfg=dict(type='SyncBN'),
+        act_cfg=dict(type='SiLU', inplace=True)),
+    head=dict(
+        type='RTMWHead',
+        in_channels=1024,
+        out_channels=num_keypoints,
+        input_size=input_size,
+        in_featuremap_size=tuple([s // 32 for s in input_size]),
+        simcc_split_ratio=codec['simcc_split_ratio'],
+        final_layer_kernel_size=7,
+        gau_cfg=dict(
+            hidden_dims=256,
+            s=128,
+            expansion_factor=2,
+            dropout_rate=0.,
+            drop_path=0.,
+            act_fn='SiLU',
+            use_rel_bias=False,
+            pos_enc=False),
+        loss=dict(
+            type='KLDiscretLoss',
+            use_target_weight=True,
+            beta=1.,
+            label_softmax=True,
+            label_beta=10.,
+            mask=list(range(23, 91)),
+            mask_weight=0.5,
+        ),
+        decoder=codec),
+    test_cfg=dict(flip_test=True))
+
+# base dataset settings
+dataset_type = 'CocoWholeBodyDataset'
+data_mode = 'topdown'
+data_root = 'data/'
+
+backend_args = dict(backend='local')
+
+# pipelines
+train_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(type='RandomFlip', direction='horizontal'),
+    dict(type='RandomHalfBody'),
+    dict(
+        type='RandomBBoxTransform', scale_factor=[0.5, 1.5], rotate_factor=90),
+    dict(type='TopdownAffine', input_size=codec['input_size']),
+    dict(type='PhotometricDistortion'),
+    dict(
+        type='Albumentation',
+        transforms=[
+            dict(type='Blur', p=0.1),
+            dict(type='MedianBlur', p=0.1),
+            dict(
+                type='CoarseDropout',
+                max_holes=1,
+                max_height=0.4,
+                max_width=0.4,
+                min_holes=1,
+                min_height=0.2,
+                min_width=0.2,
+                p=0.5),
+        ]),
+    dict(
+        type='GenerateTarget',
+        encoder=codec,
+        use_dataset_keypoint_weights=True),
+    dict(type='PackPoseInputs')
+]
+val_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(type='TopdownAffine', input_size=codec['input_size']),
+    dict(type='PackPoseInputs')
+]
+train_pipeline_stage2 = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(type='RandomFlip', direction='horizontal'),
+    dict(type='RandomHalfBody'),
+    dict(
+        type='RandomBBoxTransform',
+        shift_factor=0.,
+        scale_factor=[0.5, 1.5],
+        rotate_factor=90),
+    dict(type='TopdownAffine', input_size=codec['input_size']),
+    dict(
+        type='Albumentation',
+        transforms=[
+            dict(type='Blur', p=0.1),
+            dict(type='MedianBlur', p=0.1),
+        ]),
+    dict(
+        type='GenerateTarget',
+        encoder=codec,
+        use_dataset_keypoint_weights=True),
+    dict(type='PackPoseInputs')
+]
+
+# mapping
+
+aic_coco133 = [(0, 6), (1, 8), (2, 10), (3, 5), (4, 7), (5, 9), (6, 12),
+               (7, 14), (8, 16), (9, 11), (10, 13), (11, 15)]
+
+crowdpose_coco133 = [(0, 5), (1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (6, 11),
+                     (7, 12), (8, 13), (9, 14), (10, 15), (11, 16)]
+
+mpii_coco133 = [
+    (0, 16),
+    (1, 14),
+    (2, 12),
+    (3, 11),
+    (4, 13),
+    (5, 15),
+    (10, 10),
+    (11, 8),
+    (12, 6),
+    (13, 5),
+    (14, 7),
+    (15, 9),
+]
+
+jhmdb_coco133 = [
+    (3, 6),
+    (4, 5),
+    (5, 12),
+    (6, 11),
+    (7, 8),
+    (8, 7),
+    (9, 14),
+    (10, 13),
+    (11, 10),
+    (12, 9),
+    (13, 16),
+    (14, 15),
+]
+
+halpe_coco133 = [(i, i)
+                 for i in range(17)] + [(20, 17), (21, 20), (22, 18), (23, 21),
+                                        (24, 19),
+                                        (25, 22)] + [(i, i - 3)
+                                                     for i in range(26, 136)]
+
+posetrack_coco133 = [
+    (0, 0),
+    (3, 3),
+    (4, 4),
+    (5, 5),
+    (6, 6),
+    (7, 7),
+    (8, 8),
+    (9, 9),
+    (10, 10),
+    (11, 11),
+    (12, 12),
+    (13, 13),
+    (14, 14),
+    (15, 15),
+    (16, 16),
+]
+
+humanart_coco133 = [(i, i) for i in range(17)] + [(17, 99), (18, 120),
+                                                  (19, 17), (20, 20)]
+
+# train datasets
+dataset_coco = dict(
+    type=dataset_type,
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='coco/annotations/coco_wholebody_train_v1.0.json',
+    data_prefix=dict(img='detection/coco/train2017/'),
+    pipeline=[],
+)
+
+dataset_aic = dict(
+    type='AicDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='aic/annotations/aic_train.json',
+    data_prefix=dict(img='pose/ai_challenge/ai_challenger_keypoint'
+                     '_train_20170902/keypoint_train_images_20170902/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=aic_coco133)
+    ],
+)
+
+dataset_crowdpose = dict(
+    type='CrowdPoseDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='crowdpose/annotations/mmpose_crowdpose_trainval.json',
+    data_prefix=dict(img='pose/CrowdPose/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=crowdpose_coco133)
+    ],
+)
+
+dataset_mpii = dict(
+    type='MpiiDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='mpii/annotations/mpii_train.json',
+    data_prefix=dict(img='pose/MPI/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=mpii_coco133)
+    ],
+)
+
+dataset_jhmdb = dict(
+    type='JhmdbDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='jhmdb/annotations/Sub1_train.json',
+    data_prefix=dict(img='pose/JHMDB/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=jhmdb_coco133)
+    ],
+)
+
+dataset_halpe = dict(
+    type='HalpeDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='halpe/annotations/halpe_train_v1.json',
+    data_prefix=dict(img='pose/Halpe/hico_20160224_det/images/train2015'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=halpe_coco133)
+    ],
+)
+
+dataset_posetrack = dict(
+    type='PoseTrack18Dataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='posetrack18/annotations/posetrack18_train.json',
+    data_prefix=dict(img='pose/PoseChallenge2018/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=posetrack_coco133)
+    ],
+)
+
+dataset_humanart = dict(
+    type='HumanArt21Dataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='HumanArt/annotations/training_humanart.json',
+    filter_cfg=dict(scenes=['real_human']),
+    data_prefix=dict(img='pose/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=humanart_coco133)
+    ])
+
+ubody_scenes = [
+    'Magic_show', 'Entertainment', 'ConductMusic', 'Online_class', 'TalkShow',
+    'Speech', 'Fitness', 'Interview', 'Olympic', 'TVShow', 'Singing',
+    'SignLanguage', 'Movie', 'LiveVlog', 'VideoConference'
+]
+
+ubody_datasets = []
+for scene in ubody_scenes:
+    each = dict(
+        type='UBody2dDataset',
+        data_root=data_root,
+        data_mode=data_mode,
+        ann_file=f'Ubody/annotations/{scene}/train_annotations.json',
+        data_prefix=dict(img='pose/UBody/images/'),
+        pipeline=[],
+        sample_interval=10)
+    ubody_datasets.append(each)
+
+dataset_ubody = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/ubody2d.py'),
+    datasets=ubody_datasets,
+    pipeline=[],
+    test_mode=False,
+)
+
+face_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale', padding=1.25),
+    dict(
+        type='RandomBBoxTransform',
+        shift_factor=0.,
+        scale_factor=[1.5, 2.0],
+        rotate_factor=0),
+]
+
+wflw_coco133 = [(i * 2, 23 + i)
+                for i in range(17)] + [(33 + i, 40 + i) for i in range(5)] + [
+                    (42 + i, 45 + i) for i in range(5)
+                ] + [(51 + i, 50 + i)
+                     for i in range(9)] + [(60, 59), (61, 60), (63, 61),
+                                           (64, 62), (65, 63), (67, 64),
+                                           (68, 65), (69, 66), (71, 67),
+                                           (72, 68), (73, 69),
+                                           (75, 70)] + [(76 + i, 71 + i)
+                                                        for i in range(20)]
+dataset_wflw = dict(
+    type='WFLWDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='wflw/annotations/face_landmarks_wflw_train.json',
+    data_prefix=dict(img='pose/WFLW/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=wflw_coco133), *face_pipeline
+    ],
+)
+
+mapping_300w_coco133 = [(i, 23 + i) for i in range(68)]
+dataset_300w = dict(
+    type='Face300WDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='300w/annotations/face_landmarks_300w_train.json',
+    data_prefix=dict(img='pose/300w/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=mapping_300w_coco133), *face_pipeline
+    ],
+)
+
+cofw_coco133 = [(0, 40), (2, 44), (4, 42), (1, 49), (3, 45), (6, 47), (8, 59),
+                (10, 62), (9, 68), (11, 65), (18, 54), (19, 58), (20, 53),
+                (21, 56), (22, 71), (23, 77), (24, 74), (25, 85), (26, 89),
+                (27, 80), (28, 31)]
+dataset_cofw = dict(
+    type='COFWDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='cofw/annotations/cofw_train.json',
+    data_prefix=dict(img='pose/COFW/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=cofw_coco133), *face_pipeline
+    ],
+)
+
+lapa_coco133 = [(i * 2, 23 + i) for i in range(17)] + [
+    (33 + i, 40 + i) for i in range(5)
+] + [(42 + i, 45 + i) for i in range(5)] + [
+    (51 + i, 50 + i) for i in range(4)
+] + [(58 + i, 54 + i) for i in range(5)] + [(66, 59), (67, 60), (69, 61),
+                                            (70, 62), (71, 63), (73, 64),
+                                            (75, 65), (76, 66), (78, 67),
+                                            (79, 68), (80, 69),
+                                            (82, 70)] + [(84 + i, 71 + i)
+                                                         for i in range(20)]
+dataset_lapa = dict(
+    type='LapaDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='LaPa/annotations/lapa_trainval.json',
+    data_prefix=dict(img='pose/LaPa/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=lapa_coco133), *face_pipeline
+    ],
+)
+
+dataset_wb = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[dataset_coco, dataset_halpe, dataset_ubody],
+    pipeline=[],
+    test_mode=False,
+)
+
+dataset_body = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[
+        dataset_aic,
+        dataset_crowdpose,
+        dataset_mpii,
+        dataset_jhmdb,
+        dataset_posetrack,
+        dataset_humanart,
+    ],
+    pipeline=[],
+    test_mode=False,
+)
+
+dataset_face = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[
+        dataset_wflw,
+        dataset_300w,
+        dataset_cofw,
+        dataset_lapa,
+    ],
+    pipeline=[],
+    test_mode=False,
+)
+
+hand_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(
+        type='RandomBBoxTransform',
+        shift_factor=0.,
+        scale_factor=[1.5, 2.0],
+        rotate_factor=0),
+]
+
+interhand_left = [(21, 95), (22, 94), (23, 93), (24, 92), (25, 99), (26, 98),
+                  (27, 97), (28, 96), (29, 103), (30, 102), (31, 101),
+                  (32, 100), (33, 107), (34, 106), (35, 105), (36, 104),
+                  (37, 111), (38, 110), (39, 109), (40, 108), (41, 91)]
+interhand_right = [(i - 21, j + 21) for i, j in interhand_left]
+interhand_coco133 = interhand_right + interhand_left
+
+dataset_interhand2d = dict(
+    type='InterHand2DDoubleDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='interhand26m/annotations/all/InterHand2.6M_train_data.json',
+    camera_param_file='interhand26m/annotations/all/'
+    'InterHand2.6M_train_camera.json',
+    joint_file='interhand26m/annotations/all/'
+    'InterHand2.6M_train_joint_3d.json',
+    data_prefix=dict(img='interhand2.6m/images/train/'),
+    sample_interval=10,
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=interhand_coco133,
+        ), *hand_pipeline
+    ],
+)
+
+dataset_hand = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[dataset_interhand2d],
+    pipeline=[],
+    test_mode=False,
+)
+
+train_datasets = [dataset_wb, dataset_body, dataset_face, dataset_hand]
+
+# data loaders
+train_dataloader = dict(
+    batch_size=train_batch_size,
+    num_workers=4,
+    pin_memory=False,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dict(
+        type='CombinedDataset',
+        metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+        datasets=train_datasets,
+        pipeline=train_pipeline,
+        test_mode=False,
+    ))
+
+val_dataloader = dict(
+    batch_size=val_batch_size,
+    num_workers=4,
+    persistent_workers=True,
+    drop_last=False,
+    sampler=dict(type='DefaultSampler', shuffle=False, round_up=False),
+    dataset=dict(
+        type='CocoWholeBodyDataset',
+        ann_file='data/coco/annotations/coco_wholebody_val_v1.0.json',
+        data_prefix=dict(img='data/detection/coco/val2017/'),
+        pipeline=val_pipeline,
+        bbox_file='data/coco/person_detection_results/'
+        'COCO_val2017_detections_AP_H_56_person.json',
+        test_mode=True))
+
+test_dataloader = val_dataloader
+
+# hooks
+default_hooks = dict(
+    checkpoint=dict(
+        save_best='coco-wholebody/AP', rule='greater', max_keep_ckpts=1))
+
+custom_hooks = [
+    dict(
+        type='EMAHook',
+        ema_type='ExpMomentumEMA',
+        momentum=0.0002,
+        update_buffers=True,
+        priority=49),
+    dict(
+        type='mmdet.PipelineSwitchHook',
+        switch_epoch=max_epochs - stage2_num_epochs,
+        switch_pipeline=train_pipeline_stage2)
+]
+
+# evaluators
+val_evaluator = dict(
+    type='CocoWholeBodyMetric',
+    ann_file='data/coco/annotations/coco_wholebody_val_v1.0.json')
+test_evaluator = val_evaluator
diff --git a/configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw-m_8xb1024-270e_cocktail14-256x192.py b/configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw-m_8xb1024-270e_cocktail14-256x192.py
new file mode 100644
index 0000000000..fc9d90e5cd
--- /dev/null
+++ b/configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw-m_8xb1024-270e_cocktail14-256x192.py
@@ -0,0 +1,615 @@
+_base_ = ['../../../_base_/default_runtime.py']
+
+# common setting
+num_keypoints = 133
+input_size = (192, 256)
+
+# runtime
+max_epochs = 270
+stage2_num_epochs = 10
+base_lr = 5e-4
+train_batch_size = 1024
+val_batch_size = 32
+
+train_cfg = dict(max_epochs=max_epochs, val_interval=10)
+randomness = dict(seed=21)
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='AdamW', lr=base_lr, weight_decay=0.05),
+    clip_grad=dict(max_norm=35, norm_type=2),
+    paramwise_cfg=dict(
+        norm_decay_mult=0, bias_decay_mult=0, bypass_duplicate=True))
+
+# learning rate
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1.0e-5,
+        by_epoch=False,
+        begin=0,
+        end=1000),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=base_lr * 0.05,
+        begin=max_epochs // 2,
+        end=max_epochs,
+        T_max=max_epochs // 2,
+        by_epoch=True,
+        convert_to_iter_based=True),
+]
+
+# automatically scaling LR based on the actual training batch size
+auto_scale_lr = dict(base_batch_size=8192)
+
+# codec settings
+codec = dict(
+    type='SimCCLabel',
+    input_size=input_size,
+    sigma=(4.9, 5.66),
+    simcc_split_ratio=2.0,
+    normalize=False,
+    use_dark=False)
+
+# model settings
+model = dict(
+    type='TopdownPoseEstimator',
+    data_preprocessor=dict(
+        type='PoseDataPreprocessor',
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        bgr_to_rgb=True),
+    backbone=dict(
+        type='CSPNeXt',
+        arch='P5',
+        expand_ratio=0.5,
+        deepen_factor=0.67,
+        widen_factor=0.75,
+        channel_attention=True,
+        norm_cfg=dict(type='BN'),
+        act_cfg=dict(type='SiLU'),
+        init_cfg=dict(
+            type='Pretrained',
+            prefix='backbone.',
+            checkpoint='https://download.openmmlab.com/mmpose/v1/projects/'
+            'rtmposev1/rtmpose-m_simcc-ucoco_dw-ucoco_270e-256x192-c8b76419_20230728.pth'  # noqa
+        )),
+    neck=dict(
+        type='CSPNeXtPAFPN',
+        in_channels=[192, 384, 768],
+        out_channels=None,
+        out_indices=(
+            1,
+            2,
+        ),
+        num_csp_blocks=2,
+        expand_ratio=0.5,
+        norm_cfg=dict(type='SyncBN'),
+        act_cfg=dict(type='SiLU', inplace=True)),
+    head=dict(
+        type='RTMWHead',
+        in_channels=768,
+        out_channels=num_keypoints,
+        input_size=input_size,
+        in_featuremap_size=tuple([s // 32 for s in input_size]),
+        simcc_split_ratio=codec['simcc_split_ratio'],
+        final_layer_kernel_size=7,
+        gau_cfg=dict(
+            hidden_dims=256,
+            s=128,
+            expansion_factor=2,
+            dropout_rate=0.,
+            drop_path=0.,
+            act_fn='SiLU',
+            use_rel_bias=False,
+            pos_enc=False),
+        loss=dict(
+            type='KLDiscretLoss',
+            use_target_weight=True,
+            beta=1.,
+            label_softmax=True,
+            label_beta=10.,
+            mask=list(range(23, 91)),
+            mask_weight=0.5,
+        ),
+        decoder=codec),
+    test_cfg=dict(flip_test=True))
+
+# base dataset settings
+dataset_type = 'CocoWholeBodyDataset'
+data_mode = 'topdown'
+data_root = 'data/'
+
+backend_args = dict(backend='local')
+
+# pipelines
+train_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(type='RandomFlip', direction='horizontal'),
+    dict(type='RandomHalfBody'),
+    dict(
+        type='RandomBBoxTransform', scale_factor=[0.5, 1.5], rotate_factor=90),
+    dict(type='TopdownAffine', input_size=codec['input_size']),
+    dict(type='PhotometricDistortion'),
+    dict(
+        type='Albumentation',
+        transforms=[
+            dict(type='Blur', p=0.1),
+            dict(type='MedianBlur', p=0.1),
+            dict(
+                type='CoarseDropout',
+                max_holes=1,
+                max_height=0.4,
+                max_width=0.4,
+                min_holes=1,
+                min_height=0.2,
+                min_width=0.2,
+                p=0.5),
+        ]),
+    dict(
+        type='GenerateTarget',
+        encoder=codec,
+        use_dataset_keypoint_weights=True),
+    dict(type='PackPoseInputs')
+]
+val_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(type='TopdownAffine', input_size=codec['input_size']),
+    dict(type='PackPoseInputs')
+]
+train_pipeline_stage2 = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(type='RandomFlip', direction='horizontal'),
+    dict(type='RandomHalfBody'),
+    dict(
+        type='RandomBBoxTransform',
+        shift_factor=0.,
+        scale_factor=[0.5, 1.5],
+        rotate_factor=90),
+    dict(type='TopdownAffine', input_size=codec['input_size']),
+    dict(
+        type='Albumentation',
+        transforms=[
+            dict(type='Blur', p=0.1),
+            dict(type='MedianBlur', p=0.1),
+        ]),
+    dict(
+        type='GenerateTarget',
+        encoder=codec,
+        use_dataset_keypoint_weights=True),
+    dict(type='PackPoseInputs')
+]
+
+# mapping
+
+aic_coco133 = [(0, 6), (1, 8), (2, 10), (3, 5), (4, 7), (5, 9), (6, 12),
+               (7, 14), (8, 16), (9, 11), (10, 13), (11, 15)]
+
+crowdpose_coco133 = [(0, 5), (1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (6, 11),
+                     (7, 12), (8, 13), (9, 14), (10, 15), (11, 16)]
+
+mpii_coco133 = [
+    (0, 16),
+    (1, 14),
+    (2, 12),
+    (3, 11),
+    (4, 13),
+    (5, 15),
+    (10, 10),
+    (11, 8),
+    (12, 6),
+    (13, 5),
+    (14, 7),
+    (15, 9),
+]
+
+jhmdb_coco133 = [
+    (3, 6),
+    (4, 5),
+    (5, 12),
+    (6, 11),
+    (7, 8),
+    (8, 7),
+    (9, 14),
+    (10, 13),
+    (11, 10),
+    (12, 9),
+    (13, 16),
+    (14, 15),
+]
+
+halpe_coco133 = [(i, i)
+                 for i in range(17)] + [(20, 17), (21, 20), (22, 18), (23, 21),
+                                        (24, 19),
+                                        (25, 22)] + [(i, i - 3)
+                                                     for i in range(26, 136)]
+
+posetrack_coco133 = [
+    (0, 0),
+    (3, 3),
+    (4, 4),
+    (5, 5),
+    (6, 6),
+    (7, 7),
+    (8, 8),
+    (9, 9),
+    (10, 10),
+    (11, 11),
+    (12, 12),
+    (13, 13),
+    (14, 14),
+    (15, 15),
+    (16, 16),
+]
+
+humanart_coco133 = [(i, i) for i in range(17)] + [(17, 99), (18, 120),
+                                                  (19, 17), (20, 20)]
+
+# train datasets
+dataset_coco = dict(
+    type=dataset_type,
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='coco/annotations/coco_wholebody_train_v1.0.json',
+    data_prefix=dict(img='detection/coco/train2017/'),
+    pipeline=[],
+)
+
+dataset_aic = dict(
+    type='AicDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='aic/annotations/aic_train.json',
+    data_prefix=dict(img='pose/ai_challenge/ai_challenger_keypoint'
+                     '_train_20170902/keypoint_train_images_20170902/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=aic_coco133)
+    ],
+)
+
+dataset_crowdpose = dict(
+    type='CrowdPoseDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='crowdpose/annotations/mmpose_crowdpose_trainval.json',
+    data_prefix=dict(img='pose/CrowdPose/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=crowdpose_coco133)
+    ],
+)
+
+dataset_mpii = dict(
+    type='MpiiDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='mpii/annotations/mpii_train.json',
+    data_prefix=dict(img='pose/MPI/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=mpii_coco133)
+    ],
+)
+
+dataset_jhmdb = dict(
+    type='JhmdbDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='jhmdb/annotations/Sub1_train.json',
+    data_prefix=dict(img='pose/JHMDB/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=jhmdb_coco133)
+    ],
+)
+
+dataset_halpe = dict(
+    type='HalpeDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='halpe/annotations/halpe_train_v1.json',
+    data_prefix=dict(img='pose/Halpe/hico_20160224_det/images/train2015'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=halpe_coco133)
+    ],
+)
+
+dataset_posetrack = dict(
+    type='PoseTrack18Dataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='posetrack18/annotations/posetrack18_train.json',
+    data_prefix=dict(img='pose/PoseChallenge2018/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=posetrack_coco133)
+    ],
+)
+
+dataset_humanart = dict(
+    type='HumanArt21Dataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='HumanArt/annotations/training_humanart.json',
+    filter_cfg=dict(scenes=['real_human']),
+    data_prefix=dict(img='pose/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=humanart_coco133)
+    ])
+
+ubody_scenes = [
+    'Magic_show', 'Entertainment', 'ConductMusic', 'Online_class', 'TalkShow',
+    'Speech', 'Fitness', 'Interview', 'Olympic', 'TVShow', 'Singing',
+    'SignLanguage', 'Movie', 'LiveVlog', 'VideoConference'
+]
+
+ubody_datasets = []
+for scene in ubody_scenes:
+    each = dict(
+        type='UBody2dDataset',
+        data_root=data_root,
+        data_mode=data_mode,
+        ann_file=f'Ubody/annotations/{scene}/train_annotations.json',
+        data_prefix=dict(img='pose/UBody/images/'),
+        pipeline=[],
+        sample_interval=10)
+    ubody_datasets.append(each)
+
+dataset_ubody = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/ubody2d.py'),
+    datasets=ubody_datasets,
+    pipeline=[],
+    test_mode=False,
+)
+
+face_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale', padding=1.25),
+    dict(
+        type='RandomBBoxTransform',
+        shift_factor=0.,
+        scale_factor=[1.5, 2.0],
+        rotate_factor=0),
+]
+
+wflw_coco133 = [(i * 2, 23 + i)
+                for i in range(17)] + [(33 + i, 40 + i) for i in range(5)] + [
+                    (42 + i, 45 + i) for i in range(5)
+                ] + [(51 + i, 50 + i)
+                     for i in range(9)] + [(60, 59), (61, 60), (63, 61),
+                                           (64, 62), (65, 63), (67, 64),
+                                           (68, 65), (69, 66), (71, 67),
+                                           (72, 68), (73, 69),
+                                           (75, 70)] + [(76 + i, 71 + i)
+                                                        for i in range(20)]
+dataset_wflw = dict(
+    type='WFLWDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='wflw/annotations/face_landmarks_wflw_train.json',
+    data_prefix=dict(img='pose/WFLW/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=wflw_coco133), *face_pipeline
+    ],
+)
+
+mapping_300w_coco133 = [(i, 23 + i) for i in range(68)]
+dataset_300w = dict(
+    type='Face300WDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='300w/annotations/face_landmarks_300w_train.json',
+    data_prefix=dict(img='pose/300w/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=mapping_300w_coco133), *face_pipeline
+    ],
+)
+
+cofw_coco133 = [(0, 40), (2, 44), (4, 42), (1, 49), (3, 45), (6, 47), (8, 59),
+                (10, 62), (9, 68), (11, 65), (18, 54), (19, 58), (20, 53),
+                (21, 56), (22, 71), (23, 77), (24, 74), (25, 85), (26, 89),
+                (27, 80), (28, 31)]
+dataset_cofw = dict(
+    type='COFWDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='cofw/annotations/cofw_train.json',
+    data_prefix=dict(img='pose/COFW/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=cofw_coco133), *face_pipeline
+    ],
+)
+
+lapa_coco133 = [(i * 2, 23 + i) for i in range(17)] + [
+    (33 + i, 40 + i) for i in range(5)
+] + [(42 + i, 45 + i) for i in range(5)] + [
+    (51 + i, 50 + i) for i in range(4)
+] + [(58 + i, 54 + i) for i in range(5)] + [(66, 59), (67, 60), (69, 61),
+                                            (70, 62), (71, 63), (73, 64),
+                                            (75, 65), (76, 66), (78, 67),
+                                            (79, 68), (80, 69),
+                                            (82, 70)] + [(84 + i, 71 + i)
+                                                         for i in range(20)]
+dataset_lapa = dict(
+    type='LapaDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='LaPa/annotations/lapa_trainval.json',
+    data_prefix=dict(img='pose/LaPa/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=lapa_coco133), *face_pipeline
+    ],
+)
+
+dataset_wb = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[dataset_coco, dataset_halpe, dataset_ubody],
+    pipeline=[],
+    test_mode=False,
+)
+
+dataset_body = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[
+        dataset_aic,
+        dataset_crowdpose,
+        dataset_mpii,
+        dataset_jhmdb,
+        dataset_posetrack,
+        dataset_humanart,
+    ],
+    pipeline=[],
+    test_mode=False,
+)
+
+dataset_face = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[
+        dataset_wflw,
+        dataset_300w,
+        dataset_cofw,
+        dataset_lapa,
+    ],
+    pipeline=[],
+    test_mode=False,
+)
+
+hand_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(
+        type='RandomBBoxTransform',
+        shift_factor=0.,
+        scale_factor=[1.5, 2.0],
+        rotate_factor=0),
+]
+
+interhand_left = [(21, 95), (22, 94), (23, 93), (24, 92), (25, 99), (26, 98),
+                  (27, 97), (28, 96), (29, 103), (30, 102), (31, 101),
+                  (32, 100), (33, 107), (34, 106), (35, 105), (36, 104),
+                  (37, 111), (38, 110), (39, 109), (40, 108), (41, 91)]
+interhand_right = [(i - 21, j + 21) for i, j in interhand_left]
+interhand_coco133 = interhand_right + interhand_left
+
+dataset_interhand2d = dict(
+    type='InterHand2DDoubleDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='interhand26m/annotations/all/InterHand2.6M_train_data.json',
+    camera_param_file='interhand26m/annotations/all/'
+    'InterHand2.6M_train_camera.json',
+    joint_file='interhand26m/annotations/all/'
+    'InterHand2.6M_train_joint_3d.json',
+    data_prefix=dict(img='interhand2.6m/images/train/'),
+    sample_interval=10,
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=interhand_coco133,
+        ), *hand_pipeline
+    ],
+)
+
+dataset_hand = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[dataset_interhand2d],
+    pipeline=[],
+    test_mode=False,
+)
+
+train_datasets = [dataset_wb, dataset_body, dataset_face, dataset_hand]
+
+# data loaders
+train_dataloader = dict(
+    batch_size=train_batch_size,
+    num_workers=4,
+    pin_memory=False,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dict(
+        type='CombinedDataset',
+        metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+        datasets=train_datasets,
+        pipeline=train_pipeline,
+        test_mode=False,
+    ))
+
+val_dataloader = dict(
+    batch_size=val_batch_size,
+    num_workers=4,
+    persistent_workers=True,
+    drop_last=False,
+    sampler=dict(type='DefaultSampler', shuffle=False, round_up=False),
+    dataset=dict(
+        type='CocoWholeBodyDataset',
+        ann_file='data/coco/annotations/coco_wholebody_val_v1.0.json',
+        data_prefix=dict(img='data/detection/coco/val2017/'),
+        pipeline=val_pipeline,
+        bbox_file='data/coco/person_detection_results/'
+        'COCO_val2017_detections_AP_H_56_person.json',
+        test_mode=True))
+
+test_dataloader = val_dataloader
+
+# hooks
+default_hooks = dict(
+    checkpoint=dict(
+        save_best='coco-wholebody/AP', rule='greater', max_keep_ckpts=1))
+
+custom_hooks = [
+    dict(
+        type='EMAHook',
+        ema_type='ExpMomentumEMA',
+        momentum=0.0002,
+        update_buffers=True,
+        priority=49),
+    dict(
+        type='mmdet.PipelineSwitchHook',
+        switch_epoch=max_epochs - stage2_num_epochs,
+        switch_pipeline=train_pipeline_stage2)
+]
+
+# evaluators
+val_evaluator = dict(
+    type='CocoWholeBodyMetric',
+    ann_file='data/coco/annotations/coco_wholebody_val_v1.0.json')
+test_evaluator = val_evaluator
diff --git a/configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw-x_8xb320-270e_cocktail14-384x288.py b/configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw-x_8xb320-270e_cocktail14-384x288.py
new file mode 100644
index 0000000000..115dc9408b
--- /dev/null
+++ b/configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw-x_8xb320-270e_cocktail14-384x288.py
@@ -0,0 +1,617 @@
+_base_ = ['../../../_base_/default_runtime.py']
+
+# common setting
+num_keypoints = 133
+input_size = (288, 384)
+
+# runtime
+max_epochs = 270
+stage2_num_epochs = 10
+base_lr = 5e-4
+train_batch_size = 320
+val_batch_size = 32
+
+train_cfg = dict(max_epochs=max_epochs, val_interval=10)
+randomness = dict(seed=21)
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='AdamW', lr=base_lr, weight_decay=0.1),
+    clip_grad=dict(max_norm=35, norm_type=2),
+    paramwise_cfg=dict(
+        norm_decay_mult=0, bias_decay_mult=0, bypass_duplicate=True))
+
+# learning rate
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1.0e-5,
+        by_epoch=False,
+        begin=0,
+        end=1000),
+    dict(
+        # use cosine lr from 150 to 300 epoch
+        type='CosineAnnealingLR',
+        eta_min=base_lr * 0.05,
+        begin=max_epochs // 2,
+        end=max_epochs,
+        T_max=max_epochs // 2,
+        by_epoch=True,
+        convert_to_iter_based=True),
+]
+
+# automatically scaling LR based on the actual training batch size
+auto_scale_lr = dict(base_batch_size=2560)
+
+# codec settings
+codec = dict(
+    type='SimCCLabel',
+    input_size=input_size,
+    sigma=(6., 6.93),
+    simcc_split_ratio=2.0,
+    normalize=False,
+    use_dark=False,
+    decode_visibility=True)
+
+# model settings
+model = dict(
+    type='TopdownPoseEstimator',
+    data_preprocessor=dict(
+        type='PoseDataPreprocessor',
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        bgr_to_rgb=True),
+    backbone=dict(
+        type='CSPNeXt',
+        arch='P5',
+        expand_ratio=0.5,
+        deepen_factor=1.33,
+        widen_factor=1.25,
+        channel_attention=True,
+        norm_cfg=dict(type='BN'),
+        act_cfg=dict(type='SiLU'),
+        init_cfg=dict(
+            type='Pretrained',
+            prefix='backbone.',
+            checkpoint='https://download.openmmlab.com/mmpose/v1/'
+            'wholebody_2d_keypoint/rtmpose/ubody/rtmpose-x_simcc-ucoco_pt-aic-coco_270e-384x288-f5b50679_20230822.pth'  # noqa
+        )),
+    neck=dict(
+        type='CSPNeXtPAFPN',
+        in_channels=[320, 640, 1280],
+        out_channels=None,
+        out_indices=(
+            1,
+            2,
+        ),
+        num_csp_blocks=2,
+        expand_ratio=0.5,
+        norm_cfg=dict(type='SyncBN'),
+        act_cfg=dict(type='SiLU', inplace=True)),
+    head=dict(
+        type='RTMWHead',
+        in_channels=1280,
+        out_channels=num_keypoints,
+        input_size=input_size,
+        in_featuremap_size=tuple([s // 32 for s in input_size]),
+        simcc_split_ratio=codec['simcc_split_ratio'],
+        final_layer_kernel_size=7,
+        gau_cfg=dict(
+            hidden_dims=256,
+            s=128,
+            expansion_factor=2,
+            dropout_rate=0.,
+            drop_path=0.,
+            act_fn='SiLU',
+            use_rel_bias=False,
+            pos_enc=False),
+        loss=dict(
+            type='KLDiscretLoss',
+            use_target_weight=True,
+            beta=1.,
+            label_softmax=True,
+            label_beta=10.,
+            mask=list(range(23, 91)),
+            mask_weight=0.5,
+        ),
+        decoder=codec),
+    test_cfg=dict(flip_test=True))
+
+# base dataset settings
+dataset_type = 'CocoWholeBodyDataset'
+data_mode = 'topdown'
+data_root = 'data/'
+
+backend_args = dict(backend='local')
+
+# pipelines
+train_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(type='RandomFlip', direction='horizontal'),
+    dict(type='RandomHalfBody'),
+    dict(
+        type='RandomBBoxTransform', scale_factor=[0.5, 1.5], rotate_factor=90),
+    dict(type='TopdownAffine', input_size=codec['input_size']),
+    dict(type='PhotometricDistortion'),
+    dict(
+        type='Albumentation',
+        transforms=[
+            dict(type='Blur', p=0.1),
+            dict(type='MedianBlur', p=0.1),
+            dict(
+                type='CoarseDropout',
+                max_holes=1,
+                max_height=0.4,
+                max_width=0.4,
+                min_holes=1,
+                min_height=0.2,
+                min_width=0.2,
+                p=0.5),
+        ]),
+    dict(
+        type='GenerateTarget',
+        encoder=codec,
+        use_dataset_keypoint_weights=True),
+    dict(type='PackPoseInputs')
+]
+val_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(type='TopdownAffine', input_size=codec['input_size']),
+    dict(type='PackPoseInputs')
+]
+train_pipeline_stage2 = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(type='RandomFlip', direction='horizontal'),
+    dict(type='RandomHalfBody'),
+    dict(
+        type='RandomBBoxTransform',
+        shift_factor=0.,
+        scale_factor=[0.5, 1.5],
+        rotate_factor=90),
+    dict(type='TopdownAffine', input_size=codec['input_size']),
+    dict(
+        type='Albumentation',
+        transforms=[
+            dict(type='Blur', p=0.1),
+            dict(type='MedianBlur', p=0.1),
+        ]),
+    dict(
+        type='GenerateTarget',
+        encoder=codec,
+        use_dataset_keypoint_weights=True),
+    dict(type='PackPoseInputs')
+]
+
+# mapping
+
+aic_coco133 = [(0, 6), (1, 8), (2, 10), (3, 5), (4, 7), (5, 9), (6, 12),
+               (7, 14), (8, 16), (9, 11), (10, 13), (11, 15)]
+
+crowdpose_coco133 = [(0, 5), (1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (6, 11),
+                     (7, 12), (8, 13), (9, 14), (10, 15), (11, 16)]
+
+mpii_coco133 = [
+    (0, 16),
+    (1, 14),
+    (2, 12),
+    (3, 11),
+    (4, 13),
+    (5, 15),
+    (10, 10),
+    (11, 8),
+    (12, 6),
+    (13, 5),
+    (14, 7),
+    (15, 9),
+]
+
+jhmdb_coco133 = [
+    (3, 6),
+    (4, 5),
+    (5, 12),
+    (6, 11),
+    (7, 8),
+    (8, 7),
+    (9, 14),
+    (10, 13),
+    (11, 10),
+    (12, 9),
+    (13, 16),
+    (14, 15),
+]
+
+halpe_coco133 = [(i, i)
+                 for i in range(17)] + [(20, 17), (21, 20), (22, 18), (23, 21),
+                                        (24, 19),
+                                        (25, 22)] + [(i, i - 3)
+                                                     for i in range(26, 136)]
+
+posetrack_coco133 = [
+    (0, 0),
+    (3, 3),
+    (4, 4),
+    (5, 5),
+    (6, 6),
+    (7, 7),
+    (8, 8),
+    (9, 9),
+    (10, 10),
+    (11, 11),
+    (12, 12),
+    (13, 13),
+    (14, 14),
+    (15, 15),
+    (16, 16),
+]
+
+humanart_coco133 = [(i, i) for i in range(17)] + [(17, 99), (18, 120),
+                                                  (19, 17), (20, 20)]
+
+# train datasets
+dataset_coco = dict(
+    type=dataset_type,
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='coco/annotations/coco_wholebody_train_v1.0.json',
+    data_prefix=dict(img='detection/coco/train2017/'),
+    pipeline=[],
+)
+
+dataset_aic = dict(
+    type='AicDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='aic/annotations/aic_train.json',
+    data_prefix=dict(img='pose/ai_challenge/ai_challenger_keypoint'
+                     '_train_20170902/keypoint_train_images_20170902/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=aic_coco133)
+    ],
+)
+
+dataset_crowdpose = dict(
+    type='CrowdPoseDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='crowdpose/annotations/mmpose_crowdpose_trainval.json',
+    data_prefix=dict(img='pose/CrowdPose/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=crowdpose_coco133)
+    ],
+)
+
+dataset_mpii = dict(
+    type='MpiiDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='mpii/annotations/mpii_train.json',
+    data_prefix=dict(img='pose/MPI/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=mpii_coco133)
+    ],
+)
+
+dataset_jhmdb = dict(
+    type='JhmdbDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='jhmdb/annotations/Sub1_train.json',
+    data_prefix=dict(img='pose/JHMDB/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=jhmdb_coco133)
+    ],
+)
+
+dataset_halpe = dict(
+    type='HalpeDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='halpe/annotations/halpe_train_v1.json',
+    data_prefix=dict(img='pose/Halpe/hico_20160224_det/images/train2015'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=halpe_coco133)
+    ],
+)
+
+dataset_posetrack = dict(
+    type='PoseTrack18Dataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='posetrack18/annotations/posetrack18_train.json',
+    data_prefix=dict(img='pose/PoseChallenge2018/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=posetrack_coco133)
+    ],
+)
+
+dataset_humanart = dict(
+    type='HumanArt21Dataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='HumanArt/annotations/training_humanart.json',
+    filter_cfg=dict(scenes=['real_human']),
+    data_prefix=dict(img='pose/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=humanart_coco133)
+    ])
+
+ubody_scenes = [
+    'Magic_show', 'Entertainment', 'ConductMusic', 'Online_class', 'TalkShow',
+    'Speech', 'Fitness', 'Interview', 'Olympic', 'TVShow', 'Singing',
+    'SignLanguage', 'Movie', 'LiveVlog', 'VideoConference'
+]
+
+ubody_datasets = []
+for scene in ubody_scenes:
+    each = dict(
+        type='UBody2dDataset',
+        data_root=data_root,
+        data_mode=data_mode,
+        ann_file=f'Ubody/annotations/{scene}/train_annotations.json',
+        data_prefix=dict(img='pose/UBody/images/'),
+        pipeline=[],
+        sample_interval=10)
+    ubody_datasets.append(each)
+
+dataset_ubody = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/ubody2d.py'),
+    datasets=ubody_datasets,
+    pipeline=[],
+    test_mode=False,
+)
+
+face_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale', padding=1.25),
+    dict(
+        type='RandomBBoxTransform',
+        shift_factor=0.,
+        scale_factor=[1.5, 2.0],
+        rotate_factor=0),
+]
+
+wflw_coco133 = [(i * 2, 23 + i)
+                for i in range(17)] + [(33 + i, 40 + i) for i in range(5)] + [
+                    (42 + i, 45 + i) for i in range(5)
+                ] + [(51 + i, 50 + i)
+                     for i in range(9)] + [(60, 59), (61, 60), (63, 61),
+                                           (64, 62), (65, 63), (67, 64),
+                                           (68, 65), (69, 66), (71, 67),
+                                           (72, 68), (73, 69),
+                                           (75, 70)] + [(76 + i, 71 + i)
+                                                        for i in range(20)]
+dataset_wflw = dict(
+    type='WFLWDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='wflw/annotations/face_landmarks_wflw_train.json',
+    data_prefix=dict(img='pose/WFLW/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=wflw_coco133), *face_pipeline
+    ],
+)
+
+mapping_300w_coco133 = [(i, 23 + i) for i in range(68)]
+dataset_300w = dict(
+    type='Face300WDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='300w/annotations/face_landmarks_300w_train.json',
+    data_prefix=dict(img='pose/300w/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=mapping_300w_coco133), *face_pipeline
+    ],
+)
+
+cofw_coco133 = [(0, 40), (2, 44), (4, 42), (1, 49), (3, 45), (6, 47), (8, 59),
+                (10, 62), (9, 68), (11, 65), (18, 54), (19, 58), (20, 53),
+                (21, 56), (22, 71), (23, 77), (24, 74), (25, 85), (26, 89),
+                (27, 80), (28, 31)]
+dataset_cofw = dict(
+    type='COFWDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='cofw/annotations/cofw_train.json',
+    data_prefix=dict(img='pose/COFW/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=cofw_coco133), *face_pipeline
+    ],
+)
+
+lapa_coco133 = [(i * 2, 23 + i) for i in range(17)] + [
+    (33 + i, 40 + i) for i in range(5)
+] + [(42 + i, 45 + i) for i in range(5)] + [
+    (51 + i, 50 + i) for i in range(4)
+] + [(58 + i, 54 + i) for i in range(5)] + [(66, 59), (67, 60), (69, 61),
+                                            (70, 62), (71, 63), (73, 64),
+                                            (75, 65), (76, 66), (78, 67),
+                                            (79, 68), (80, 69),
+                                            (82, 70)] + [(84 + i, 71 + i)
+                                                         for i in range(20)]
+dataset_lapa = dict(
+    type='LapaDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='LaPa/annotations/lapa_trainval.json',
+    data_prefix=dict(img='pose/LaPa/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=lapa_coco133), *face_pipeline
+    ],
+)
+
+dataset_wb = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[dataset_coco, dataset_halpe, dataset_ubody],
+    pipeline=[],
+    test_mode=False,
+)
+
+dataset_body = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[
+        dataset_aic,
+        dataset_crowdpose,
+        dataset_mpii,
+        dataset_jhmdb,
+        dataset_posetrack,
+        dataset_humanart,
+    ],
+    pipeline=[],
+    test_mode=False,
+)
+
+dataset_face = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[
+        dataset_wflw,
+        dataset_300w,
+        dataset_cofw,
+        dataset_lapa,
+    ],
+    pipeline=[],
+    test_mode=False,
+)
+
+hand_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(
+        type='RandomBBoxTransform',
+        shift_factor=0.,
+        scale_factor=[1.5, 2.0],
+        rotate_factor=0),
+]
+
+interhand_left = [(21, 95), (22, 94), (23, 93), (24, 92), (25, 99), (26, 98),
+                  (27, 97), (28, 96), (29, 103), (30, 102), (31, 101),
+                  (32, 100), (33, 107), (34, 106), (35, 105), (36, 104),
+                  (37, 111), (38, 110), (39, 109), (40, 108), (41, 91)]
+interhand_right = [(i - 21, j + 21) for i, j in interhand_left]
+interhand_coco133 = interhand_right + interhand_left
+
+dataset_interhand2d = dict(
+    type='InterHand2DDoubleDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='interhand26m/annotations/all/InterHand2.6M_train_data.json',
+    camera_param_file='interhand26m/annotations/all/'
+    'InterHand2.6M_train_camera.json',
+    joint_file='interhand26m/annotations/all/'
+    'InterHand2.6M_train_joint_3d.json',
+    data_prefix=dict(img='interhand2.6m/images/train/'),
+    sample_interval=10,
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=interhand_coco133,
+        ), *hand_pipeline
+    ],
+)
+
+dataset_hand = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[dataset_interhand2d],
+    pipeline=[],
+    test_mode=False,
+)
+
+train_datasets = [dataset_wb, dataset_body, dataset_face, dataset_hand]
+
+# data loaders
+train_dataloader = dict(
+    batch_size=train_batch_size,
+    num_workers=4,
+    pin_memory=False,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dict(
+        type='CombinedDataset',
+        metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+        datasets=train_datasets,
+        pipeline=train_pipeline,
+        test_mode=False,
+    ))
+
+val_dataloader = dict(
+    batch_size=val_batch_size,
+    num_workers=4,
+    persistent_workers=True,
+    drop_last=False,
+    sampler=dict(type='DefaultSampler', shuffle=False, round_up=False),
+    dataset=dict(
+        type='CocoWholeBodyDataset',
+        ann_file='data/coco/annotations/coco_wholebody_val_v1.0.json',
+        data_prefix=dict(img='data/detection/coco/val2017/'),
+        pipeline=val_pipeline,
+        bbox_file='data/coco/person_detection_results/'
+        'COCO_val2017_detections_AP_H_56_person.json',
+        test_mode=True))
+
+test_dataloader = val_dataloader
+
+# hooks
+default_hooks = dict(
+    checkpoint=dict(
+        save_best='coco-wholebody/AP', rule='greater', max_keep_ckpts=1))
+
+custom_hooks = [
+    dict(
+        type='EMAHook',
+        ema_type='ExpMomentumEMA',
+        momentum=0.0002,
+        update_buffers=True,
+        priority=49),
+    dict(
+        type='mmdet.PipelineSwitchHook',
+        switch_epoch=max_epochs - stage2_num_epochs,
+        switch_pipeline=train_pipeline_stage2)
+]
+
+# evaluators
+val_evaluator = dict(
+    type='CocoWholeBodyMetric',
+    ann_file='data/coco/annotations/coco_wholebody_val_v1.0.json')
+test_evaluator = val_evaluator
diff --git a/configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw-x_8xb704-270e_cocktail14-256x192.py b/configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw-x_8xb704-270e_cocktail14-256x192.py
new file mode 100644
index 0000000000..750ad46d3d
--- /dev/null
+++ b/configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw-x_8xb704-270e_cocktail14-256x192.py
@@ -0,0 +1,615 @@
+_base_ = ['../../../_base_/default_runtime.py']
+
+# common setting
+num_keypoints = 133
+input_size = (192, 256)
+
+# runtime
+max_epochs = 270
+stage2_num_epochs = 10
+base_lr = 5e-4
+train_batch_size = 704
+val_batch_size = 32
+
+train_cfg = dict(max_epochs=max_epochs, val_interval=10)
+randomness = dict(seed=21)
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='AdamW', lr=base_lr, weight_decay=0.1),
+    clip_grad=dict(max_norm=35, norm_type=2),
+    paramwise_cfg=dict(
+        norm_decay_mult=0, bias_decay_mult=0, bypass_duplicate=True))
+
+# learning rate
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1.0e-5,
+        by_epoch=False,
+        begin=0,
+        end=1000),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=base_lr * 0.05,
+        begin=max_epochs // 2,
+        end=max_epochs,
+        T_max=max_epochs // 2,
+        by_epoch=True,
+        convert_to_iter_based=True),
+]
+
+# automatically scaling LR based on the actual training batch size
+auto_scale_lr = dict(base_batch_size=5632)
+
+# codec settings
+codec = dict(
+    type='SimCCLabel',
+    input_size=input_size,
+    sigma=(4.9, 5.66),
+    simcc_split_ratio=2.0,
+    normalize=False,
+    use_dark=False)
+
+# model settings
+model = dict(
+    type='TopdownPoseEstimator',
+    data_preprocessor=dict(
+        type='PoseDataPreprocessor',
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        bgr_to_rgb=True),
+    backbone=dict(
+        type='CSPNeXt',
+        arch='P5',
+        expand_ratio=0.5,
+        deepen_factor=1.33,
+        widen_factor=1.25,
+        channel_attention=True,
+        norm_cfg=dict(type='BN'),
+        act_cfg=dict(type='SiLU'),
+        init_cfg=dict(
+            type='Pretrained',
+            prefix='backbone.',
+            checkpoint='https://download.openmmlab.com/mmpose/v1/'
+            'wholebody_2d_keypoint/rtmpose/ubody/rtmpose-x_simcc-ucoco_pt-aic-coco_270e-256x192-05f5bcb7_20230822.pth'  # noqa
+        )),
+    neck=dict(
+        type='CSPNeXtPAFPN',
+        in_channels=[320, 640, 1280],
+        out_channels=None,
+        out_indices=(
+            1,
+            2,
+        ),
+        num_csp_blocks=2,
+        expand_ratio=0.5,
+        norm_cfg=dict(type='SyncBN'),
+        act_cfg=dict(type='SiLU', inplace=True)),
+    head=dict(
+        type='RTMWHead',
+        in_channels=1280,
+        out_channels=num_keypoints,
+        input_size=input_size,
+        in_featuremap_size=tuple([s // 32 for s in input_size]),
+        simcc_split_ratio=codec['simcc_split_ratio'],
+        final_layer_kernel_size=7,
+        gau_cfg=dict(
+            hidden_dims=256,
+            s=128,
+            expansion_factor=2,
+            dropout_rate=0.,
+            drop_path=0.,
+            act_fn='SiLU',
+            use_rel_bias=False,
+            pos_enc=False),
+        loss=dict(
+            type='KLDiscretLoss',
+            use_target_weight=True,
+            beta=1.,
+            label_softmax=True,
+            label_beta=10.,
+            mask=list(range(23, 91)),
+            mask_weight=0.5,
+        ),
+        decoder=codec),
+    test_cfg=dict(flip_test=True))
+
+# base dataset settings
+dataset_type = 'CocoWholeBodyDataset'
+data_mode = 'topdown'
+data_root = 'data/'
+
+backend_args = dict(backend='local')
+
+# pipelines
+train_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(type='RandomFlip', direction='horizontal'),
+    dict(type='RandomHalfBody'),
+    dict(
+        type='RandomBBoxTransform', scale_factor=[0.5, 1.5], rotate_factor=90),
+    dict(type='TopdownAffine', input_size=codec['input_size']),
+    dict(type='PhotometricDistortion'),
+    dict(
+        type='Albumentation',
+        transforms=[
+            dict(type='Blur', p=0.1),
+            dict(type='MedianBlur', p=0.1),
+            dict(
+                type='CoarseDropout',
+                max_holes=1,
+                max_height=0.4,
+                max_width=0.4,
+                min_holes=1,
+                min_height=0.2,
+                min_width=0.2,
+                p=0.5),
+        ]),
+    dict(
+        type='GenerateTarget',
+        encoder=codec,
+        use_dataset_keypoint_weights=True),
+    dict(type='PackPoseInputs')
+]
+val_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(type='TopdownAffine', input_size=codec['input_size']),
+    dict(type='PackPoseInputs')
+]
+train_pipeline_stage2 = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(type='RandomFlip', direction='horizontal'),
+    dict(type='RandomHalfBody'),
+    dict(
+        type='RandomBBoxTransform',
+        shift_factor=0.,
+        scale_factor=[0.5, 1.5],
+        rotate_factor=90),
+    dict(type='TopdownAffine', input_size=codec['input_size']),
+    dict(
+        type='Albumentation',
+        transforms=[
+            dict(type='Blur', p=0.1),
+            dict(type='MedianBlur', p=0.1),
+        ]),
+    dict(
+        type='GenerateTarget',
+        encoder=codec,
+        use_dataset_keypoint_weights=True),
+    dict(type='PackPoseInputs')
+]
+
+# mapping
+
+aic_coco133 = [(0, 6), (1, 8), (2, 10), (3, 5), (4, 7), (5, 9), (6, 12),
+               (7, 14), (8, 16), (9, 11), (10, 13), (11, 15)]
+
+crowdpose_coco133 = [(0, 5), (1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (6, 11),
+                     (7, 12), (8, 13), (9, 14), (10, 15), (11, 16)]
+
+mpii_coco133 = [
+    (0, 16),
+    (1, 14),
+    (2, 12),
+    (3, 11),
+    (4, 13),
+    (5, 15),
+    (10, 10),
+    (11, 8),
+    (12, 6),
+    (13, 5),
+    (14, 7),
+    (15, 9),
+]
+
+jhmdb_coco133 = [
+    (3, 6),
+    (4, 5),
+    (5, 12),
+    (6, 11),
+    (7, 8),
+    (8, 7),
+    (9, 14),
+    (10, 13),
+    (11, 10),
+    (12, 9),
+    (13, 16),
+    (14, 15),
+]
+
+halpe_coco133 = [(i, i)
+                 for i in range(17)] + [(20, 17), (21, 20), (22, 18), (23, 21),
+                                        (24, 19),
+                                        (25, 22)] + [(i, i - 3)
+                                                     for i in range(26, 136)]
+
+posetrack_coco133 = [
+    (0, 0),
+    (3, 3),
+    (4, 4),
+    (5, 5),
+    (6, 6),
+    (7, 7),
+    (8, 8),
+    (9, 9),
+    (10, 10),
+    (11, 11),
+    (12, 12),
+    (13, 13),
+    (14, 14),
+    (15, 15),
+    (16, 16),
+]
+
+humanart_coco133 = [(i, i) for i in range(17)] + [(17, 99), (18, 120),
+                                                  (19, 17), (20, 20)]
+
+# train datasets
+dataset_coco = dict(
+    type=dataset_type,
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='coco/annotations/coco_wholebody_train_v1.0.json',
+    data_prefix=dict(img='detection/coco/train2017/'),
+    pipeline=[],
+)
+
+dataset_aic = dict(
+    type='AicDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='aic/annotations/aic_train.json',
+    data_prefix=dict(img='pose/ai_challenge/ai_challenger_keypoint'
+                     '_train_20170902/keypoint_train_images_20170902/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=aic_coco133)
+    ],
+)
+
+dataset_crowdpose = dict(
+    type='CrowdPoseDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='crowdpose/annotations/mmpose_crowdpose_trainval.json',
+    data_prefix=dict(img='pose/CrowdPose/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=crowdpose_coco133)
+    ],
+)
+
+dataset_mpii = dict(
+    type='MpiiDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='mpii/annotations/mpii_train.json',
+    data_prefix=dict(img='pose/MPI/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=mpii_coco133)
+    ],
+)
+
+dataset_jhmdb = dict(
+    type='JhmdbDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='jhmdb/annotations/Sub1_train.json',
+    data_prefix=dict(img='pose/JHMDB/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=jhmdb_coco133)
+    ],
+)
+
+dataset_halpe = dict(
+    type='HalpeDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='halpe/annotations/halpe_train_v1.json',
+    data_prefix=dict(img='pose/Halpe/hico_20160224_det/images/train2015'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=halpe_coco133)
+    ],
+)
+
+dataset_posetrack = dict(
+    type='PoseTrack18Dataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='posetrack18/annotations/posetrack18_train.json',
+    data_prefix=dict(img='pose/PoseChallenge2018/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=posetrack_coco133)
+    ],
+)
+
+dataset_humanart = dict(
+    type='HumanArt21Dataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='HumanArt/annotations/training_humanart.json',
+    filter_cfg=dict(scenes=['real_human']),
+    data_prefix=dict(img='pose/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=humanart_coco133)
+    ])
+
+ubody_scenes = [
+    'Magic_show', 'Entertainment', 'ConductMusic', 'Online_class', 'TalkShow',
+    'Speech', 'Fitness', 'Interview', 'Olympic', 'TVShow', 'Singing',
+    'SignLanguage', 'Movie', 'LiveVlog', 'VideoConference'
+]
+
+ubody_datasets = []
+for scene in ubody_scenes:
+    each = dict(
+        type='UBody2dDataset',
+        data_root=data_root,
+        data_mode=data_mode,
+        ann_file=f'Ubody/annotations/{scene}/train_annotations.json',
+        data_prefix=dict(img='pose/UBody/images/'),
+        pipeline=[],
+        sample_interval=10)
+    ubody_datasets.append(each)
+
+dataset_ubody = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/ubody2d.py'),
+    datasets=ubody_datasets,
+    pipeline=[],
+    test_mode=False,
+)
+
+face_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale', padding=1.25),
+    dict(
+        type='RandomBBoxTransform',
+        shift_factor=0.,
+        scale_factor=[1.5, 2.0],
+        rotate_factor=0),
+]
+
+wflw_coco133 = [(i * 2, 23 + i)
+                for i in range(17)] + [(33 + i, 40 + i) for i in range(5)] + [
+                    (42 + i, 45 + i) for i in range(5)
+                ] + [(51 + i, 50 + i)
+                     for i in range(9)] + [(60, 59), (61, 60), (63, 61),
+                                           (64, 62), (65, 63), (67, 64),
+                                           (68, 65), (69, 66), (71, 67),
+                                           (72, 68), (73, 69),
+                                           (75, 70)] + [(76 + i, 71 + i)
+                                                        for i in range(20)]
+dataset_wflw = dict(
+    type='WFLWDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='wflw/annotations/face_landmarks_wflw_train.json',
+    data_prefix=dict(img='pose/WFLW/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=wflw_coco133), *face_pipeline
+    ],
+)
+
+mapping_300w_coco133 = [(i, 23 + i) for i in range(68)]
+dataset_300w = dict(
+    type='Face300WDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='300w/annotations/face_landmarks_300w_train.json',
+    data_prefix=dict(img='pose/300w/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=mapping_300w_coco133), *face_pipeline
+    ],
+)
+
+cofw_coco133 = [(0, 40), (2, 44), (4, 42), (1, 49), (3, 45), (6, 47), (8, 59),
+                (10, 62), (9, 68), (11, 65), (18, 54), (19, 58), (20, 53),
+                (21, 56), (22, 71), (23, 77), (24, 74), (25, 85), (26, 89),
+                (27, 80), (28, 31)]
+dataset_cofw = dict(
+    type='COFWDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='cofw/annotations/cofw_train.json',
+    data_prefix=dict(img='pose/COFW/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=cofw_coco133), *face_pipeline
+    ],
+)
+
+lapa_coco133 = [(i * 2, 23 + i) for i in range(17)] + [
+    (33 + i, 40 + i) for i in range(5)
+] + [(42 + i, 45 + i) for i in range(5)] + [
+    (51 + i, 50 + i) for i in range(4)
+] + [(58 + i, 54 + i) for i in range(5)] + [(66, 59), (67, 60), (69, 61),
+                                            (70, 62), (71, 63), (73, 64),
+                                            (75, 65), (76, 66), (78, 67),
+                                            (79, 68), (80, 69),
+                                            (82, 70)] + [(84 + i, 71 + i)
+                                                         for i in range(20)]
+dataset_lapa = dict(
+    type='LapaDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='LaPa/annotations/lapa_trainval.json',
+    data_prefix=dict(img='pose/LaPa/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=lapa_coco133), *face_pipeline
+    ],
+)
+
+dataset_wb = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[dataset_coco, dataset_halpe, dataset_ubody],
+    pipeline=[],
+    test_mode=False,
+)
+
+dataset_body = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[
+        dataset_aic,
+        dataset_crowdpose,
+        dataset_mpii,
+        dataset_jhmdb,
+        dataset_posetrack,
+        dataset_humanart,
+    ],
+    pipeline=[],
+    test_mode=False,
+)
+
+dataset_face = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[
+        dataset_wflw,
+        dataset_300w,
+        dataset_cofw,
+        dataset_lapa,
+    ],
+    pipeline=[],
+    test_mode=False,
+)
+
+hand_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(
+        type='RandomBBoxTransform',
+        shift_factor=0.,
+        scale_factor=[1.5, 2.0],
+        rotate_factor=0),
+]
+
+interhand_left = [(21, 95), (22, 94), (23, 93), (24, 92), (25, 99), (26, 98),
+                  (27, 97), (28, 96), (29, 103), (30, 102), (31, 101),
+                  (32, 100), (33, 107), (34, 106), (35, 105), (36, 104),
+                  (37, 111), (38, 110), (39, 109), (40, 108), (41, 91)]
+interhand_right = [(i - 21, j + 21) for i, j in interhand_left]
+interhand_coco133 = interhand_right + interhand_left
+
+dataset_interhand2d = dict(
+    type='InterHand2DDoubleDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='interhand26m/annotations/all/InterHand2.6M_train_data.json',
+    camera_param_file='interhand26m/annotations/all/'
+    'InterHand2.6M_train_camera.json',
+    joint_file='interhand26m/annotations/all/'
+    'InterHand2.6M_train_joint_3d.json',
+    data_prefix=dict(img='interhand2.6m/images/train/'),
+    sample_interval=10,
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=interhand_coco133,
+        ), *hand_pipeline
+    ],
+)
+
+dataset_hand = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[dataset_interhand2d],
+    pipeline=[],
+    test_mode=False,
+)
+
+train_datasets = [dataset_wb, dataset_body, dataset_face, dataset_hand]
+
+# data loaders
+train_dataloader = dict(
+    batch_size=train_batch_size,
+    num_workers=4,
+    pin_memory=False,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dict(
+        type='CombinedDataset',
+        metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+        datasets=train_datasets,
+        pipeline=train_pipeline,
+        test_mode=False,
+    ))
+
+val_dataloader = dict(
+    batch_size=val_batch_size,
+    num_workers=4,
+    persistent_workers=True,
+    drop_last=False,
+    sampler=dict(type='DefaultSampler', shuffle=False, round_up=False),
+    dataset=dict(
+        type='CocoWholeBodyDataset',
+        ann_file='data/coco/annotations/coco_wholebody_val_v1.0.json',
+        data_prefix=dict(img='data/detection/coco/val2017/'),
+        pipeline=val_pipeline,
+        bbox_file='data/coco/person_detection_results/'
+        'COCO_val2017_detections_AP_H_56_person.json',
+        test_mode=True))
+
+test_dataloader = val_dataloader
+
+# hooks
+default_hooks = dict(
+    checkpoint=dict(
+        save_best='coco-wholebody/AP', rule='greater', max_keep_ckpts=1))
+
+custom_hooks = [
+    dict(
+        type='EMAHook',
+        ema_type='ExpMomentumEMA',
+        momentum=0.0002,
+        update_buffers=True,
+        priority=49),
+    dict(
+        type='mmdet.PipelineSwitchHook',
+        switch_epoch=max_epochs - stage2_num_epochs,
+        switch_pipeline=train_pipeline_stage2)
+]
+
+# evaluators
+val_evaluator = dict(
+    type='CocoWholeBodyMetric',
+    ann_file='data/coco/annotations/coco_wholebody_val_v1.0.json')
+test_evaluator = val_evaluator
diff --git a/configs/wholebody_2d_keypoint/rtmpose/cocktail13/rtmw_cocktail13.md b/configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw_cocktail14.md
similarity index 55%
rename from configs/wholebody_2d_keypoint/rtmpose/cocktail13/rtmw_cocktail13.md
rename to configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw_cocktail14.md
index 54e75383ba..b53522226a 100644
--- a/configs/wholebody_2d_keypoint/rtmpose/cocktail13/rtmw_cocktail13.md
+++ b/configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw_cocktail14.md
@@ -53,7 +53,7 @@
 
 </details>
 
-- `Cocktail13` denotes model trained on 13 public datasets:
+- `Cocktail14` denotes model trained on 14 public datasets:
   - [AI Challenger](https://mmpose.readthedocs.io/en/latest/dataset_zoo/2d_body_keypoint.html#aic)
   - [CrowdPose](https://mmpose.readthedocs.io/en/latest/dataset_zoo/2d_body_keypoint.html#crowdpose)
   - [MPII](https://mmpose.readthedocs.io/en/latest/dataset_zoo/2d_body_keypoint.html#mpii)
@@ -67,10 +67,14 @@
   - [300W](https://ibug.doc.ic.ac.uk/resources/300-W/)
   - [COFW](http://www.vision.caltech.edu/xpburgos/ICCV13/)
   - [LaPa](https://github.com/JDAI-CV/lapa-dataset)
+  - [InterHand](https://mks0601.github.io/InterHand2.6M/)
 
 Results on COCO-WholeBody v1.0 val with detector having human AP of 56.4 on COCO val2017 dataset
 
-| Arch                                    | Input Size | Body AP | Body AR | Foot AP | Foot AR | Face AP | Face AR | Hand AP | Hand AR | Whole AP | Whole AR |                   ckpt                   |                   log                   |
-| :-------------------------------------- | :--------: | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: | :------: | :------: | :--------------------------------------: | :-------------------------------------: |
-| [rtmw-x](/configs/wholebody_2d_keypoint/rtmpose/cocktail13/rtmw-x_8xb320-270e_cocktail13-256x192.py) |  256x192   |  0.753  |  0.815  |  0.773  |  0.869  |  0.843  |  0.894  |  0.602  |  0.703  |  0.672   |  0.754   | [ckpt](https://download.openmmlab.com/mmpose/v1/projects/rtmw/rtmw-x_simcc-cocktail13_pt-ucoco_270e-256x192-fbef0d61_20230925.pth) | [log](https://download.openmmlab.com/mmpose/v1/projects/rtmw/rtmw-x_simcc-cocktail13_pt-ucoco_270e-256x192-fbef0d61_20230925.json) |
-| [rtmw-x](/configs/wholebody_2d_keypoint/rtmpose/cocktail13/rtmw-x_8xb320-270e_cocktail13-384x288.py) |  384x288   |  0.764  |  0.825  |  0.791  |  0.883  |  0.882  |  0.922  |  0.654  |  0.744  |  0.702   |  0.779   | [ckpt](https://download.openmmlab.com/mmpose/v1/projects/rtmw/rtmw-x_simcc-cocktail13_pt-ucoco_270e-384x288-0949e3a9_20230925.pth) | [log](https://download.openmmlab.com/mmpose/v1/projects/rtmw/rtmw-x_simcc-cocktail13_pt-ucoco_270e-384x288-0949e3a9_20230925.json) |
+| Arch                                                      | Input Size | Body AP | Body AR | Foot AP | Foot AR | Face AP | Face AR | Hand AP | Hand AR | Whole AP | Whole AR |                            ckpt                            | log |
+| :-------------------------------------------------------- | :--------: | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: | :------: | :------: | :--------------------------------------------------------: | :-: |
+| [rtmw-m](/configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw-m_8xb1024-270e_cocktail14-256x192.py) |  256x192   |  0.676  |  0.747  |  0.671  |  0.794  |  0.783  |  0.854  |  0.491  |  0.604  |  0.582   |  0.673   | [ckpt](https://download.openmmlab.com/mmpose/v1/projects/rtmw/rtmw-dw-l-m_simcc-cocktail14_270e-256x192-20231122.pth) |  -  |
+| [rtmw-l](/configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw-l_8xb1024-270e_cocktail14-256x192.py) |  256x192   |  0.743  |  0.807  |  0.763  |  0.868  |  0.834  |  0.889  |  0.598  |  0.701  |  0.660   |  0.746   | [ckpt](https://download.openmmlab.com/mmpose/v1/projects/rtmw/rtmw-dw-x-l_simcc-cocktail14_270e-256x192-20231122.pth) |  -  |
+| [rtmw-x](/configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw-x_8xb704-270e_cocktail14-256x192.py) |  256x192   |  0.746  |  0.808  |  0.770  |  0.869  |  0.844  |  0.896  |  0.610  |  0.710  |  0.672   |  0.752   | [ckpt](https://download.openmmlab.com/mmpose/v1/projects/rtmw/rtmw-x_simcc-cocktail14_pt-ucoco_270e-256x192-13a2546d_20231208.pth) |  -  |
+| [rtmw-l](/configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw-l_8xb320-270e_cocktail14-384x288.py) |  384x288   |  0.761  |  0.824  |  0.793  |  0.885  |  0.884  |  0.921  |  0.663  |  0.752  |  0.701   |  0.780   | [ckpt](https://download.openmmlab.com/mmpose/v1/projects/rtmw/rtmw-dw-x-l_simcc-cocktail14_270e-384x288-20231122.pth) |  -  |
+| [rtmw-x](/configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw-x_8xb320-270e_cocktail14-384x288.py) |  384x288   |  0.763  |  0.826  |  0.796  |  0.888  |  0.884  |  0.923  |  0.664  |  0.755  |  0.702   |  0.781   | [ckpt](https://download.openmmlab.com/mmpose/v1/projects/rtmw/rtmw-x_simcc-cocktail14_pt-ucoco_270e-384x288-f840f204_20231122.pth) |  -  |
diff --git a/configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw_cocktail14.yml b/configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw_cocktail14.yml
new file mode 100644
index 0000000000..799a966dc8
--- /dev/null
+++ b/configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw_cocktail14.yml
@@ -0,0 +1,108 @@
+Models:
+- Config: configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw-m_8xb1024-270e_cocktail14-256x192.py
+  In Collection: RTMPose
+  Alias: wholebody
+  Metadata:
+    Architecture: &id001
+    - RTMPose
+    Training Data: COCO-WholeBody
+  Name: rtmw-m_8xb1024-270e_cocktail14-256x192
+  Results:
+  - Dataset: COCO-WholeBody
+    Metrics:
+      Body AP: 0.676
+      Body AR: 0.747
+      Face AP: 0.783
+      Face AR: 0.854
+      Foot AP: 0.671
+      Foot AR: 0.794
+      Hand AP: 0.491
+      Hand AR: 0.604
+      Whole AP: 0.582
+      Whole AR: 0.673
+    Task: Wholebody 2D Keypoint
+  Weights: https://download.openmmlab.com/mmpose/v1/projects/rtmw/rtmw-dw-l-m_simcc-cocktail14_270e-256x192-20231122.pth
+- Config: configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw-l_8xb1024-270e_cocktail14-256x192.py
+  In Collection: RTMPose
+  Metadata:
+    Architecture: *id001
+    Training Data: COCO-WholeBody
+  Name: rtmw-l_8xb1024-270e_cocktail14-256x192
+  Results:
+  - Dataset: COCO-WholeBody
+    Metrics:
+      Body AP: 0.743
+      Body AR: 0.807
+      Face AP: 0.834
+      Face AR: 0.889
+      Foot AP: 0.763
+      Foot AR: 0.868
+      Hand AP: 0.598
+      Hand AR: 0.701
+      Whole AP: 0.660
+      Whole AR: 0.746
+    Task: Wholebody 2D Keypoint
+  Weights: https://download.openmmlab.com/mmpose/v1/projects/rtmw/rtmw-dw-x-l_simcc-cocktail14_270e-256x192-20231122.pth
+- Config: configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw-x_8xb704-270e_cocktail14-256x192.py
+  In Collection: RTMPose
+  Metadata:
+    Architecture: *id001
+    Training Data: COCO-WholeBody
+  Name: rtmw-x_8xb704-270e_cocktail14-256x192
+  Results:
+  - Dataset: COCO-WholeBody
+    Metrics:
+      Body AP: 0.746
+      Body AR: 0.808
+      Face AP: 0.844
+      Face AR: 0.896
+      Foot AP: 0.770
+      Foot AR: 0.869
+      Hand AP: 0.610
+      Hand AR: 0.710
+      Whole AP: 0.672
+      Whole AR: 0.752
+    Task: Wholebody 2D Keypoint
+  Weights: https://download.openmmlab.com/mmpose/v1/projects/rtmw/rtmw-x_simcc-cocktail14_pt-ucoco_270e-256x192-13a2546d_20231208.pth
+- Config: configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw-l_8xb320-270e_cocktail14-384x288.py
+  In Collection: RTMPose
+  Metadata:
+    Architecture: *id001
+    Training Data: COCO-WholeBody
+  Name: rtmw-l_8xb320-270e_cocktail14-384x288
+  Results:
+  - Dataset: COCO-WholeBody
+    Metrics:
+      Body AP: 0.761
+      Body AR: 0.824
+      Face AP: 0.884
+      Face AR: 0.921
+      Foot AP: 0.793
+      Foot AR: 0.885
+      Hand AP: 0.663
+      Hand AR: 0.752
+      Whole AP: 0.701
+      Whole AR: 0.780
+    Task: Wholebody 2D Keypoint
+  Weights: https://download.openmmlab.com/mmpose/v1/projects/rtmw/rtmw-dw-x-l_simcc-cocktail14_270e-384x288-20231122.pth
+- Config: configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw-x_8xb320-270e_cocktail14-384x288.py
+  In Collection: RTMPose
+  Metadata:
+    Architecture: *id001
+    Training Data: COCO-WholeBody
+  Name: rtmw-x_8xb320-270e_cocktail14-384x288
+  Results:
+  - Dataset: COCO-WholeBody
+    Metrics:
+      Body AP: 0.763
+      Body AR: 0.826
+      Face AP: 0.884
+      Face AR: 0.923
+      Foot AP: 0.796
+      Foot AR: 0.888
+      Hand AP: 0.664
+      Hand AR: 0.755
+      Whole AP: 0.702
+      Whole AR: 0.781
+    Task: Wholebody 2D Keypoint
+  Weights: https://download.openmmlab.com/mmpose/v1/projects/rtmw/rtmw-x_simcc-cocktail14_pt-ucoco_270e-384x288-f840f204_20231122.pth
diff --git a/configs/wholebody_2d_keypoint/rtmpose/coco-wholebody/rtmpose_coco-wholebody.yml b/configs/wholebody_2d_keypoint/rtmpose/coco-wholebody/rtmpose_coco-wholebody.yml
index 049f348899..0c1fa437f5 100644
--- a/configs/wholebody_2d_keypoint/rtmpose/coco-wholebody/rtmpose_coco-wholebody.yml
+++ b/configs/wholebody_2d_keypoint/rtmpose/coco-wholebody/rtmpose_coco-wholebody.yml
@@ -4,7 +4,7 @@ Models:
   Alias: wholebody
   Metadata:
     Architecture: &id001
-    - HRNet
+    - RTMPose
     Training Data: COCO-WholeBody
   Name: rtmpose-m_8xb64-270e_coco-wholebody-256x192
   Results:
@@ -48,7 +48,7 @@ Models:
   Metadata:
     Architecture: *id001
     Training Data: COCO-WholeBody
-  Name: rtmpose-l_8xb32-270e_coco-wholebody-384x288.py
+  Name: rtmpose-l_8xb32-270e_coco-wholebody-384x288
   Results:
   - Dataset: COCO-WholeBody
     Metrics:
diff --git a/demo/body3d_pose_lifter_demo.py b/demo/body3d_pose_lifter_demo.py
index a6c1d394e9..dbb51a4b9d 100644
--- a/demo/body3d_pose_lifter_demo.py
+++ b/demo/body3d_pose_lifter_demo.py
@@ -101,7 +101,7 @@ def parse_args():
     parser.add_argument(
         '--bbox-thr',
         type=float,
-        default=0.9,
+        default=0.3,
         help='Bounding box score threshold')
     parser.add_argument('--kpt-thr', type=float, default=0.3)
     parser.add_argument(
diff --git a/demo/inferencer_demo.py b/demo/inferencer_demo.py
index 0ab816e9fb..d20c433f4e 100644
--- a/demo/inferencer_demo.py
+++ b/demo/inferencer_demo.py
@@ -4,6 +4,12 @@
 
 from mmpose.apis.inferencers import MMPoseInferencer, get_model_aliases
 
+filter_args = dict(bbox_thr=0.3, nms_thr=0.3, pose_based_nms=False)
+POSE2D_SPECIFIC_ARGS = dict(
+    yoloxpose=dict(bbox_thr=0.01, nms_thr=0.65, pose_based_nms=True),
+    rtmo=dict(bbox_thr=0.1, nms_thr=0.65, pose_based_nms=True),
+)
+
 
 def parse_args():
     parser = ArgumentParser()
@@ -12,6 +18,8 @@ def parse_args():
         type=str,
         nargs='?',
         help='Input image/video path or folder path.')
+
+    # init args
     parser.add_argument(
         '--pose2d',
         type=str,
@@ -65,6 +73,21 @@ def parse_args():
         default=None,
         help='Device used for inference. '
         'If not specified, the available device will be automatically used.')
+    parser.add_argument(
+        '--show-progress',
+        action='store_true',
+        help='Display the progress bar during inference.')
+
+    # The default arguments for prediction filtering differ for top-down
+    # and bottom-up models. We assign the default arguments according to the
+    # selected pose2d model
+    args, _ = parser.parse_known_args()
+    for model in POSE2D_SPECIFIC_ARGS:
+        if model in args.pose2d:
+            filter_args.update(POSE2D_SPECIFIC_ARGS[model])
+            break
+
+    # call args
     parser.add_argument(
         '--show',
         action='store_true',
@@ -81,13 +104,18 @@ def parse_args():
     parser.add_argument(
         '--bbox-thr',
         type=float,
-        default=0.3,
+        default=filter_args['bbox_thr'],
         help='Bounding box score threshold')
     parser.add_argument(
         '--nms-thr',
         type=float,
-        default=0.3,
+        default=filter_args['nms_thr'],
         help='IoU threshold for bounding box NMS')
+    parser.add_argument(
+        '--pose-based-nms',
+        type=lambda arg: arg.lower() in ('true', 'yes', 't', 'y', '1'),
+        default=filter_args['pose_based_nms'],
+        help='Whether to use pose-based NMS')
     parser.add_argument(
         '--kpt-thr', type=float, default=0.3, help='Keypoint score threshold')
     parser.add_argument(
@@ -157,15 +185,16 @@ def parse_args():
 
     init_kws = [
         'pose2d', 'pose2d_weights', 'scope', 'device', 'det_model',
-        'det_weights', 'det_cat_ids', 'pose3d', 'pose3d_weights'
+        'det_weights', 'det_cat_ids', 'pose3d', 'pose3d_weights',
+        'show_progress'
     ]
     init_args = {}
     for init_kw in init_kws:
         init_args[init_kw] = call_args.pop(init_kw)
 
-    diaplay_alias = call_args.pop('show_alias')
+    display_alias = call_args.pop('show_alias')
 
-    return init_args, call_args, diaplay_alias
+    return init_args, call_args, display_alias
 
 
 def display_model_aliases(model_aliases: Dict[str, str]) -> None:
@@ -179,8 +208,8 @@ def display_model_aliases(model_aliases: Dict[str, str]) -> None:
 
 
 def main():
-    init_args, call_args, diaplay_alias = parse_args()
-    if diaplay_alias:
+    init_args, call_args, display_alias = parse_args()
+    if display_alias:
         model_alises = get_model_aliases(init_args['scope'])
         display_model_aliases(model_alises)
     else:
diff --git a/docs/en/dataset_zoo/2d_body_keypoint.md b/docs/en/dataset_zoo/2d_body_keypoint.md
index 4448ebe8f4..c3104644f7 100644
--- a/docs/en/dataset_zoo/2d_body_keypoint.md
+++ b/docs/en/dataset_zoo/2d_body_keypoint.md
@@ -14,6 +14,7 @@ MMPose supported datasets:
   - [OCHuman](#ochuman) \[ [Homepage](https://github.com/liruilong940607/OCHumanApi) \]
   - [MHP](#mhp) \[ [Homepage](https://lv-mhp.github.io/dataset) \]
   - [Human-Art](#humanart) \[ [Homepage](https://idea-research.github.io/HumanArt/) \]
+  - [ExLPose](#exlpose-dataset) \[ [Homepage](http://cg.postech.ac.kr/research/ExLPose/) \]
 - Videos
   - [PoseTrack18](#posetrack18) \[ [Homepage](https://posetrack.net/users/download.php) \]
   - [sub-JHMDB](#sub-jhmdb-dataset) \[ [Homepage](http://jhmdb.is.tue.mpg.de/dataset) \]
@@ -438,6 +439,54 @@ mmpose
 
 You can choose whether to download other annotation files in Human-Art. If you want to use additional annotation files (e.g. validation set of cartoon), you need to edit the corresponding code in config file.
 
+## ExLPose dataset
+
+<!-- [DATASET] -->
+
+<details>
+<summary align="right"><a href="http://cg.postech.ac.kr/research/ExLPose/">ExLPose (2023)</a></summary>
+
+```bibtex
+@inproceedings{ExLPose_2023_CVPR,
+ title={Human Pose Estimation in Extremely Low-Light Conditions},
+ author={Sohyun Lee, Jaesung Rim, Boseung Jeong, Geonu Kim, ByungJu Woo, Haechan Lee, Sunghyun Cho, Suha Kwak},
+ booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
+ year={2023}
+}
+```
+
+</details>
+
+<div align="center">
+  <img src="https://github.com/open-mmlab/mmpose/assets/71805205/d2c7d552-249a-4ac0-8ac3-1467ace59f2f" height="300px">
+</div>
+
+For [ExLPose](http://cg.postech.ac.kr/research/ExLPose/) data, please download from [ExLPose](https://drive.google.com/drive/folders/1E0Is4_cShxvsbJlep_aNEYLJpmHzq9FL),
+Move them under $MMPOSE/data, and make them look like this:
+
+```text
+mmpose
+├── mmpose
+├── docs
+├── tests
+├── tools
+├── configs
+`── data
+    │── ExLPose
+        │-- annotations
+        |	|-- ExLPose
+        │   |-- ExLPose_train_LL.json
+        │   |-- ExLPose_test_LL-A.json
+        │   |-- ExLPose_test_LL-E.json
+        │   |-- ExLPose_test_LL-H.json
+        │   |-- ExLPose_test_LL-N.json
+        |-- dark
+            |--00001.png
+            |--00002.png
+            |--...
+
+```
+
 ## PoseTrack18
 
 <!-- [DATASET] -->
diff --git a/docs/en/dataset_zoo/3d_wholebody_keypoint.md b/docs/en/dataset_zoo/3d_wholebody_keypoint.md
new file mode 100644
index 0000000000..7edd5f4173
--- /dev/null
+++ b/docs/en/dataset_zoo/3d_wholebody_keypoint.md
@@ -0,0 +1,77 @@
+# 3D Body Keypoint Datasets
+
+It is recommended to symlink the dataset root to `$MMPOSE/data`.
+If your folder structure is different, you may need to change the corresponding paths in config files.
+
+MMPose supported datasets:
+
+- [H3WB](#h3wb) \[ [Homepage](https://github.com/wholebody3d/wholebody3d) \]
+
+## H3WB
+
+<!-- [DATASET] -->
+
+<details>
+<summary align="right"><a href="https://arxiv.org/abs/2211.15692">H3WB (ICCV'2023)</a></summary>
+
+```bibtex
+@InProceedings{Zhu_2023_ICCV,
+    author    = {Zhu, Yue and Samet, Nermin and Picard, David},
+    title     = {H3WB: Human3.6M 3D WholeBody Dataset and Benchmark},
+    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
+    month     = {October},
+    year      = {2023},
+    pages     = {20166-20177}
+}
+```
+
+</details>
+
+<div align="center">
+  <img src="https://user-images.githubusercontent.com/100993824/227770977-c8f00355-c43a-467e-8444-d307789cf4b2.png" height="300px">
+</div>
+
+For [H3WB](https://github.com/wholebody3d/wholebody3d), please follow the [document](3d_body_keypoint.md#human36m) to download [Human3.6M](http://vision.imar.ro/human3.6m/description.php) dataset, then download the H3WB annotations from the official [webpage](https://github.com/wholebody3d/wholebody3d). NOTES: please follow their [updates](https://github.com/wholebody3d/wholebody3d?tab=readme-ov-file#updates) to download the annotations.
+
+The data should have the following structure:
+
+```text
+mmpose
+├── mmpose
+├── docs
+├── tests
+├── tools
+├── configs
+`── data
+    ├── h36m
+        ├── annotation_body3d
+        |   ├── cameras.pkl
+        |   ├── h3wb_train.npz
+        |   ├── fps50
+        |   |   ├── h36m_test.npz
+        |   |   ├── h36m_train.npz
+        |   |   ├── joint2d_rel_stats.pkl
+        |   |   ├── joint2d_stats.pkl
+        |   |   ├── joint3d_rel_stats.pkl
+        |   |   `── joint3d_stats.pkl
+        |   `── fps10
+        |       ├── h36m_test.npz
+        |       ├── h36m_train.npz
+        |       ├── joint2d_rel_stats.pkl
+        |       ├── joint2d_stats.pkl
+        |       ├── joint3d_rel_stats.pkl
+        |       `── joint3d_stats.pkl
+        `── images
+            ├── S1
+            |   ├── S1_Directions_1.54138969
+            |   |   ├── S1_Directions_1.54138969_00001.jpg
+            |   |   ├── S1_Directions_1.54138969_00002.jpg
+            |   |   ├── ...
+            |   ├── ...
+            ├── S5
+            ├── S6
+            ├── S7
+            ├── S8
+            ├── S9
+            `── S11
+```
diff --git a/docs/en/faq.md b/docs/en/faq.md
index 3e81a312ca..7fe218bf6e 100644
--- a/docs/en/faq.md
+++ b/docs/en/faq.md
@@ -19,6 +19,8 @@ Detailed compatible MMPose and MMCV versions are shown as below. Please choose t
 
 | MMPose version |      MMCV/MMEngine version      |
 | :------------: | :-----------------------------: |
+|     1.3.0      |  mmcv>=2.0.1, mmengine>=0.9.0   |
+|     1.2.0      |  mmcv>=2.0.1, mmengine>=0.8.0   |
 |     1.1.0      |  mmcv>=2.0.1, mmengine>=0.8.0   |
 |     1.0.0      |  mmcv>=2.0.0, mmengine>=0.7.0   |
 |    1.0.0rc1    | mmcv>=2.0.0rc4, mmengine>=0.6.0 |
@@ -157,3 +159,7 @@ Detailed compatible MMPose and MMCV versions are shown as below. Please choose t
 
   1. Set `flip_test=False` in `init_cfg` in the config file.
   2. For top-down models, use faster human bounding box detector, see [MMDetection](https://mmdetection.readthedocs.io/en/3.x/model_zoo.html).
+
+- **What is the definition of each keypoint index?**
+
+  Check the [meta information file](https://github.com/open-mmlab/mmpose/tree/main/configs/_base_/datasets) for the dataset used to train the model you are using. They key `keypoint_info` includes the definition of each keypoint.
diff --git a/docs/en/index.rst b/docs/en/index.rst
index cc3782925e..2c96751d7a 100644
--- a/docs/en/index.rst
+++ b/docs/en/index.rst
@@ -81,6 +81,7 @@ You can change the documentation language at the lower-left corner of the page.
    dataset_zoo/2d_animal_keypoint.md
    dataset_zoo/3d_body_keypoint.md
    dataset_zoo/3d_hand_keypoint.md
+   dataset_zoo/3d_wholebody_keypoint.md
 
 .. toctree::
    :maxdepth: 1
diff --git a/docs/en/notes/changelog.md b/docs/en/notes/changelog.md
index 47a73ae7c1..ea0a51c91d 100644
--- a/docs/en/notes/changelog.md
+++ b/docs/en/notes/changelog.md
@@ -1,5 +1,9 @@
 # Changelog
 
+## **v1.3.0 (04/01/2024)**
+
+Release note: https://github.com/open-mmlab/mmpose/releases/tag/v1.3.0
+
 ## **v1.2.0 (12/10/2023)**
 
 Release note: https://github.com/open-mmlab/mmpose/releases/tag/v1.2.0
diff --git a/docs/en/user_guides/inference.md b/docs/en/user_guides/inference.md
index 3263b392e2..2081f39c75 100644
--- a/docs/en/user_guides/inference.md
+++ b/docs/en/user_guides/inference.md
@@ -128,6 +128,26 @@ inferencer = MMPoseInferencer(
 
 The complere list of model alias can be found in the [Model Alias](#model-alias) section.
 
+**Custom Inferencer for 3D Pose Estimation Models**
+
+The code shown above provides examples for creating 2D pose estimator inferencers. You can similarly construct a 3D model inferencer by using the `pose3d` argument:
+
+```python
+# build the inferencer with 3d model alias
+inferencer = MMPoseInferencer(pose3d="human3d")
+
+# build the inferencer with 3d model config name
+inferencer = MMPoseInferencer(pose3d="motionbert_dstformer-ft-243frm_8xb32-120e_h36m")
+
+# build the inferencer with 3d model config path and checkpoint path/URL
+inferencer = MMPoseInferencer(
+    pose3d='configs/body_3d_keypoint/motionbert/h36m/' \
+           'motionbert_dstformer-ft-243frm_8xb32-120e_h36m.py',
+    pose3d_weights='https://download.openmmlab.com/mmpose/v1/body_3d_keypoint/' \
+                   'pose_lift/h36m/motionbert_ft_h36m-d80af323_20230531.pth'
+)
+```
+
 **Custom Object Detector for Top-down Pose Estimation Models**
 
 In addition, top-down pose estimators also require an object detection model. The inferencer is capable of inferring the instance type for models trained with datasets supported in MMPose, and subsequently constructing the necessary object detection model. Alternatively, users may also manually specify the detection model using the following methods:
diff --git a/docs/en/user_guides/train_and_test.md b/docs/en/user_guides/train_and_test.md
index 8b40ff9f57..c11a3988cb 100644
--- a/docs/en/user_guides/train_and_test.md
+++ b/docs/en/user_guides/train_and_test.md
@@ -508,3 +508,16 @@ test_evaluator = val_evaluator
 ```
 
 For further clarification on converting AIC keypoints to COCO keypoints, please consult [this guide](https://mmpose.readthedocs.io/en/latest/user_guides/mixed_datasets.html#merge-aic-into-coco).
+
+### Evaluating Top-down Models with Custom Detector
+
+To evaluate top-down models, you can use either ground truth or pre-detected bounding boxes. The `bbox_file` provides these boxes, generated by a specific detector. For instance, `COCO_val2017_detections_AP_H_56_person.json` contains bounding boxes for the COCO val2017 dataset, generated using a detector with a human AP of 56.4. To create your own `bbox_file` using a custom detector supported by MMDetection, run the following command:
+
+```sh
+python tools/misc/generate_bbox_file.py \
+    ${DET_CONFIG} ${DET_WEIGHT} ${OUTPUT_FILE_NAME} \
+    [--pose-config ${POSE_CONFIG}] \
+    [--score-thr ${SCORE_THRESHOLD}] [--nms-thr ${NMS_THRESHOLD}]
+```
+
+Here, the `DET_CONFIG` and `DET_WEIGHT` initialize the detector. `POSE_CONFIG` specifies the test dataset requiring bounding box detection, while  `SCORE_THRESHOLD` and `NMS_THRESHOLD` arguments are used for bounding box filtering.
diff --git a/docs/src/papers/algorithms/rtmo.md b/docs/src/papers/algorithms/rtmo.md
new file mode 100644
index 0000000000..dae0e4c576
--- /dev/null
+++ b/docs/src/papers/algorithms/rtmo.md
@@ -0,0 +1,31 @@
+# RTMO: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation
+
+<!-- [ALGORITHM] -->
+
+<details>
+<summary align="right"><a href="https://arxiv.org/abs/2312.07526">RTMO</a></summary>
+
+```bibtex
+@misc{lu2023rtmo,
+      title={{RTMO}: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation},
+      author={Peng Lu and Tao Jiang and Yining Li and Xiangtai Li and Kai Chen and Wenming Yang},
+      year={2023},
+      eprint={2312.07526},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+```
+
+</details>
+
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+Real-time multi-person pose estimation presents significant challenges in balancing speed and precision. While two-stage top-down methods slow down as the number of people in the image increases, existing one-stage methods often fail to simultaneously deliver high accuracy and real-time performance. This paper introduces RTMO, a one-stage pose estimation framework that seamlessly integrates coordinate classification by representing keypoints using dual 1-D heatmaps within the YOLO architecture, achieving accuracy comparable to top-down methods while maintaining high speed. We propose a dynamic coordinate classifier and a tailored loss function for heatmap learning, specifically designed to address the incompatibilities between coordinate classification and dense prediction models. RTMO outperforms state-of-the-art one-stage pose estimators, achieving 1.1% higher AP on COCO while operating about 9 times faster with the same backbone. Our largest model, RTMO-l, attains 74.8% AP on COCO val2017 and 141 FPS on a single V100 GPU, demonstrating its efficiency and accuracy.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://github.com/open-mmlab/mmpose/assets/26127467/ad94c097-7d51-4b91-b885-d8605e22a0e6" height="360px" alt><br>
+</div>
diff --git a/docs/src/papers/datasets/300wlp.md b/docs/src/papers/datasets/300wlp.md
new file mode 100644
index 0000000000..e96776c57c
--- /dev/null
+++ b/docs/src/papers/datasets/300wlp.md
@@ -0,0 +1,18 @@
+# **Face Alignment Across Large Poses: A 3D Solution**
+
+<!-- [DATASET] -->
+
+<details>
+<summary align="right"><a href="http://www.cbsr.ia.ac.cn/users/xiangyuzhu/projects/3DDFA/main.htm">300WLP (IEEE'2017)</a></summary>
+
+```bibtex
+@article{zhu2017face,
+  title={Face alignment in full pose range: A 3d total solution},
+  author={Zhu, Xiangyu and Liu, Xiaoming and Lei, Zhen and Li, Stan Z},
+  journal={IEEE transactions on pattern analysis and machine intelligence},
+  year={2017},
+  publisher={IEEE}
+}
+```
+
+</details>
diff --git a/docs/src/papers/datasets/exlpose.md b/docs/src/papers/datasets/exlpose.md
new file mode 100644
index 0000000000..eb6957dc9c
--- /dev/null
+++ b/docs/src/papers/datasets/exlpose.md
@@ -0,0 +1,17 @@
+# Human Pose Estimation in Extremely Low-Light Conditions
+
+<!-- [DATASET] -->
+
+<details>
+<summary align="right"><a href="http://cg.postech.ac.kr/research/ExLPose/">ExLPose (CVPR'2023)</a></summary>
+
+```bibtex
+@inproceedings{ExLPose_2023_CVPR,
+ title={Human Pose Estimation in Extremely Low-Light Conditions},
+ author={Sohyun Lee, Jaesung Rim, Boseung Jeong, Geonu Kim, ByungJu Woo, Haechan Lee, Sunghyun Cho, Suha Kwak},
+ booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
+ year={2023}
+}
+```
+
+</details>
diff --git a/docs/zh_cn/dataset_zoo/2d_body_keypoint.md b/docs/zh_cn/dataset_zoo/2d_body_keypoint.md
index 4448ebe8f4..4b24b5a7fe 100644
--- a/docs/zh_cn/dataset_zoo/2d_body_keypoint.md
+++ b/docs/zh_cn/dataset_zoo/2d_body_keypoint.md
@@ -1,22 +1,23 @@
-# 2D Body Keypoint Datasets
+# 2D 人体关键点数据集
 
-It is recommended to symlink the dataset root to `$MMPOSE/data`.
-If your folder structure is different, you may need to change the corresponding paths in config files.
+建议将数据集的根目录链接到 `$MMPOSE/data`。
+如果你的文件夹结构不同，你可能需要在配置文件中更改相应的路径。
 
-MMPose supported datasets:
+MMPose 支持的数据集：
 
 - Images
-  - [COCO](#coco) \[ [Homepage](http://cocodataset.org/) \]
-  - [MPII](#mpii) \[ [Homepage](http://human-pose.mpi-inf.mpg.de/) \]
-  - [MPII-TRB](#mpii-trb) \[ [Homepage](https://github.com/kennymckormick/Triplet-Representation-of-human-Body) \]
-  - [AI Challenger](#aic) \[ [Homepage](https://github.com/AIChallenger/AI_Challenger_2017) \]
-  - [CrowdPose](#crowdpose) \[ [Homepage](https://github.com/Jeff-sjtu/CrowdPose) \]
-  - [OCHuman](#ochuman) \[ [Homepage](https://github.com/liruilong940607/OCHumanApi) \]
-  - [MHP](#mhp) \[ [Homepage](https://lv-mhp.github.io/dataset) \]
-  - [Human-Art](#humanart) \[ [Homepage](https://idea-research.github.io/HumanArt/) \]
+  - [COCO](#coco) \[ [主页](http://cocodataset.org/) \]
+  - [MPII](#mpii) \[ [主页](http://human-pose.mpi-inf.mpg.de/) \]
+  - [MPII-TRB](#mpii-trb) \[ [主页](https://github.com/kennymckormick/Triplet-Representation-of-human-Body) \]
+  - [AI Challenger](#aic) \[ [主页](https://github.com/AIChallenger/AI_Challenger_2017) \]
+  - [CrowdPose](#crowdpose) \[ [主页](https://github.com/Jeff-sjtu/CrowdPose) \]
+  - [OCHuman](#ochuman) \[ [主页](https://github.com/liruilong940607/OCHumanApi) \]
+  - [MHP](#mhp) \[ [主页](https://lv-mhp.github.io/dataset) \]
+  - [Human-Art](#humanart) \[ [主页](https://idea-research.github.io/HumanArt/) \]
+  - [ExLPose](#exlpose-dataset) \[ [Homepage](http://cg.postech.ac.kr/research/ExLPose/) \]
 - Videos
-  - [PoseTrack18](#posetrack18) \[ [Homepage](https://posetrack.net/users/download.php) \]
-  - [sub-JHMDB](#sub-jhmdb-dataset) \[ [Homepage](http://jhmdb.is.tue.mpg.de/dataset) \]
+  - [PoseTrack18](#posetrack18) \[ [主页](https://posetrack.net/users/download.php) \]
+  - [sub-JHMDB](#sub-jhmdb-dataset) \[ [主页](http://jhmdb.is.tue.mpg.de/dataset) \]
 
 ## COCO
 
@@ -42,11 +43,10 @@ MMPose supported datasets:
   <img src="https://user-images.githubusercontent.com/100993824/227864552-489d03de-e1b8-4ca2-8ac1-80dd99826cb7.png" height="300px">
 </div>
 
-For [COCO](http://cocodataset.org/) data, please download from [COCO download](http://cocodataset.org/#download), 2017 Train/Val is needed for COCO keypoints training and validation.
-[HRNet-Human-Pose-Estimation](https://github.com/HRNet/HRNet-Human-Pose-Estimation) provides person detection result of COCO val2017 to reproduce our multi-person pose estimation results.
-Please download from [OneDrive](https://1drv.ms/f/s!AhIXJn_J-blWzzDXoz5BeFl8sWM-) or [GoogleDrive](https://drive.google.com/drive/folders/1fRUDNUDxe9fjqcRZ2bnF_TKMlO0nB_dk?usp=sharing).
-Optionally, to evaluate on COCO'2017 test-dev, please download the [image-info](https://download.openmmlab.com/mmpose/datasets/person_keypoints_test-dev-2017.json).
-Download and extract them under $MMPOSE/data, and make them look like this:
+对于 [COCO](http://cocodataset.org/) 数据，请从 [COCO 下载](http://cocodataset.org/#download) 中下载，需要 2017 训练/验证集进行 COCO 关键点的训练和验证。
+[HRNet-Human-Pose-Estimation](https://github.com/HRNet/HRNet-Human-Pose-Estimation) 提供了 COCO val2017 的人体检测结果，以便重现我们的多人姿态估计结果。
+请从 [OneDrive](https://1drv.ms/f/s!AhIXJn_J-blWzzDXoz5BeFl8sWM-) 或 [GoogleDrive](https://drive.google.com/drive/folders/1fRUDNUDxe9fjqcRZ2bnF_TKMlO0nB_dk?usp=sharing) 下载。
+如果要在 COCO'2017 test-dev 上进行评估，请下载 [image-info](https://download.openmmlab.com/mmpose/datasets/person_keypoints_test-dev-2017.json)。
 
 ```text
 mmpose
@@ -100,9 +100,9 @@ mmpose
   <img src="https://user-images.githubusercontent.com/100993824/227864660-e5f51e7d-deca-41d8-9725-8b5432bcc0e6.png" height="300px">
 </div>
 
-For [MPII](http://human-pose.mpi-inf.mpg.de/) data, please download from [MPII Human Pose Dataset](http://human-pose.mpi-inf.mpg.de/).
-We have converted the original annotation files into json format, please download them from [mpii_annotations](https://download.openmmlab.com/mmpose/datasets/mpii_annotations.tar).
-Extract them under {MMPose}/data, and make them look like this:
+对于 [MPII](http://human-pose.mpi-inf.mpg.de/) 数据，请从 [MPII Human Pose Dataset](http://human-pose.mpi-inf.mpg.de/) 下载。
+我们已将原始的注释文件转换为 json 格式，请从 [mpii_annotations](https://download.openmmlab.com/mmpose/datasets/mpii_annotations.tar) 下载。
+请将它们解压到 {MMPose}/data 下，并确保目录结构如下：
 
 ```text
 mmpose
@@ -125,13 +125,13 @@ mmpose
 
 ```
 
-During training and inference, the prediction result will be saved as '.mat' format by default. We also provide a tool to convert this '.mat' to more readable '.json' format.
+在训练和推理期间，默认情况下，预测结果将以 '.mat' 格式保存。我们还提供了一个工具，以将这个 '.mat' 转换为更易读的 '.json' 格式。
 
 ```shell
 python tools/dataset/mat2json ${PRED_MAT_FILE} ${GT_JSON_FILE} ${OUTPUT_PRED_JSON_FILE}
 ```
 
-For example,
+例如，
 
 ```shell
 python tools/dataset/mat2json work_dirs/res50_mpii_256x256/pred.mat data/mpii/annotations/mpii_val.json pred.json
@@ -160,9 +160,9 @@ python tools/dataset/mat2json work_dirs/res50_mpii_256x256/pred.mat data/mpii/an
   <img src="https://user-images.githubusercontent.com/100993824/227864382-ab722299-6806-4ae4-babb-7bcc5fb09662.png" height="300px">
 </div>
 
-For [MPII-TRB](https://github.com/kennymckormick/Triplet-Representation-of-human-Body) data, please download from [MPII Human Pose Dataset](http://human-pose.mpi-inf.mpg.de/).
-Please download the annotation files from [mpii_trb_annotations](https://download.openmmlab.com/mmpose/datasets/mpii_trb_annotations.tar).
-Extract them under {MMPose}/data, and make them look like this:
+对于 [MPII-TRB](https://github.com/kennymckormick/Triplet-Representation-of-human-Body) 数据，请从 [MPII Human Pose Dataset](http://human-pose.mpi-inf.mpg.de/) 下载。
+请从 [mpii_trb_annotations](https://download.openmmlab.com/mmpose/datasets/mpii_trb_annotations.tar) 下载注释文件。
+将它们解压到 {MMPose}/data 下，并确保它们的结构如下：
 
 ```text
 mmpose
@@ -204,9 +204,9 @@ mmpose
   <img src="https://user-images.githubusercontent.com/100993824/227864755-dd19644e-fccb-458b-a8c0-de55920261f5.png" height="300px">
 </div>
 
-For [AIC](https://github.com/AIChallenger/AI_Challenger_2017) data, please download from [AI Challenger 2017](https://github.com/AIChallenger/AI_Challenger_2017), 2017 Train/Val is needed for keypoints training and validation.
-Please download the annotation files from [aic_annotations](https://download.openmmlab.com/mmpose/datasets/aic_annotations.tar).
-Download and extract them under $MMPOSE/data, and make them look like this:
+对于 [AIC](https://github.com/AIChallenger/AI_Challenger_2017) 数据，请从 [AI Challenger 2017](https://github.com/AIChallenger/AI_Challenger_2017) 下载。其中 2017 Train/Val 数据适用于关键点的训练和验证。
+请从 [aic_annotations](https://download.openmmlab.com/mmpose/datasets/aic_annotations.tar) 下载注释文件。
+下载并解压到 $MMPOSE/data 下，并确保它们的结构如下：
 
 ```text
 mmpose
@@ -254,11 +254,11 @@ mmpose
   <img src="https://user-images.githubusercontent.com/100993824/227864868-54a98493-df3a-44d8-acbc-6ec22043dfb9.png" height="300px">
 </div>
 
-For [CrowdPose](https://github.com/Jeff-sjtu/CrowdPose) data, please download from [CrowdPose](https://github.com/Jeff-sjtu/CrowdPose).
-Please download the annotation files and human detection results from [crowdpose_annotations](https://download.openmmlab.com/mmpose/datasets/crowdpose_annotations.tar).
-For top-down approaches, we follow [CrowdPose](https://arxiv.org/abs/1812.00324) to use the [pre-trained weights](https://pjreddie.com/media/files/yolov3.weights) of [YOLOv3](https://github.com/eriklindernoren/PyTorch-YOLOv3) to generate the detected human bounding boxes.
-For model training, we follow [HigherHRNet](https://github.com/HRNet/HigherHRNet-Human-Pose-Estimation) to train models on CrowdPose train/val dataset, and evaluate models on CrowdPose test dataset.
-Download and extract them under $MMPOSE/data, and make them look like this:
+对于 [CrowdPose](https://github.com/Jeff-sjtu/CrowdPose)数据，请从 [CrowdPose](https://github.com/Jeff-sjtu/CrowdPose) 下载。
+请下载标注文件和人体检测结果从 [crowdpose_annotations](https://download.openmmlab.com/mmpose/datasets/crowdpose_annotations.tar)。
+对于自上而下的方法，我们按照 [CrowdPose](https://arxiv.org/abs/1812.00324)使用[YOLOv3](https://github.com/eriklindernoren/PyTorch-YOLOv3) 的 [预训练权重](https://pjreddie.com/media/files/yolov3.weights) 来生成检测到的人体边界框。
+对于模型训练，我们按照 [HigherHRNet](https://github.com/HRNet/HigherHRNet-Human-Pose-Estimation) 在 CrowdPose 的训练/验证数据集上进行训练，并在 CrowdPose 测试数据集上评估模型。
+下载并解压缩它们到 $MMPOSE/data 目录下，结构应如下：
 
 ```text
 mmpose
@@ -305,8 +305,8 @@ mmpose
   <img src="https://user-images.githubusercontent.com/100993824/227864552-489d03de-e1b8-4ca2-8ac1-80dd99826cb7.png" height="300px">
 </div>
 
-For [OCHuman](https://github.com/liruilong940607/OCHumanApi) data, please download the images and annotations from [OCHuman](https://github.com/liruilong940607/OCHumanApi),
-Move them under $MMPOSE/data, and make them look like this:
+对于 [OCHuman](https://github.com/liruilong940607/OCHumanApi) 数据，请从 [OCHuman](https://github.com/liruilong940607/OCHumanApi) 下载图像和标注。
+将它们移动到 $MMPOSE/data 目录下，结构应如下：
 
 ```text
 mmpose
@@ -351,9 +351,9 @@ mmpose
   <img src="https://user-images.githubusercontent.com/100993824/227865030-2fd33ade-2cc2-4b67-aca0-6dea2124b63c.png" height="300px">
 </div>
 
-For [MHP](https://lv-mhp.github.io/dataset) data, please download from [MHP](https://lv-mhp.github.io/dataset).
-Please download the annotation files from [mhp_annotations](https://download.openmmlab.com/mmpose/datasets/mhp_annotations.tar.gz).
-Please download and extract them under $MMPOSE/data, and make them look like this:
+对于 [MHP](https://lv-mhp.github.io/dataset) 数据，请从 [MHP](https://lv-mhp.github.io/dataset) 下载。
+请从 [mhp_annotations](https://download.openmmlab.com/mmpose/datasets/mhp_annotations.tar.gz) 下载标注文件。
+请下载并解压到 $MMPOSE/data 目录下，并确保目录结构如下：
 
 ```text
 mmpose
@@ -408,8 +408,9 @@ mmpose
   <img src="https://user-images.githubusercontent.com/100993824/227864552-489d03de-e1b8-4ca2-8ac1-80dd99826cb7.png" height="300px">
 </div>
 
-For [Human-Art](https://idea-research.github.io/HumanArt/) data, please download the images and annotation files from [its website](https://idea-research.github.io/HumanArt/). You need to fill in the [data form](https://docs.google.com/forms/d/e/1FAIpQLScroT_jvw6B9U2Qca1_cl5Kmmu1ceKtlh6DJNmWLte8xNEhEw/viewform) to get access to the data.
-Move them under $MMPOSE/data, and make them look like this:
+对于 [Human-Art](https://idea-research.github.io/HumanArt/) 数据，请从 [其网站](https://idea-research.github.io/HumanArt/) 下载图像和标注文件。
+您需要填写 [申请表](https://docs.google.com/forms/d/e/1FAIpQLScroT_jvw6B9U2Qca1_cl5Kmmu1ceKtlh6DJNmWLte8xNEhEw/viewform) 以获取数据访问权限。
+请将它们移动到 $MMPOSE/data 目录下，目录结构应如下：
 
 ```text
 mmpose
@@ -436,7 +437,54 @@ mmpose
         │   │-- HumanArt_validation_detections_AP_H_56_person.json
 ```
 
-You can choose whether to download other annotation files in Human-Art. If you want to use additional annotation files (e.g. validation set of cartoon), you need to edit the corresponding code in config file.
+您可以选择是否下载 Human-Art 的其他标注文件。如果你想使用其他标注文件（例如，卡通的验证集），你需要在配置文件中编辑相应的代码。
+
+## ExLPose dataset
+
+<!-- [DATASET] -->
+
+<details>
+<summary align="right"><a href="http://cg.postech.ac.kr/research/ExLPose/">ExLPose (2023)</a></summary>
+
+```bibtex
+@inproceedings{ExLPose_2023_CVPR,
+ title={Human Pose Estimation in Extremely Low-Light Conditions},
+ author={Sohyun Lee, Jaesung Rim, Boseung Jeong, Geonu Kim, ByungJu Woo, Haechan Lee, Sunghyun Cho, Suha Kwak},
+ booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
+ year={2023}
+}
+```
+
+</details>
+
+<div align="center">
+  <img src="https://github.com/open-mmlab/mmpose/assets/71805205/d2c7d552-249a-4ac0-8ac3-1467ace59f2f" height="300px">
+</div>
+
+请从 [ExLPose](https://drive.google.com/drive/folders/1E0Is4_cShxvsbJlep_aNEYLJpmHzq9FL) 下载数据，将其移动到 $MMPOSE/data 目录下，并使其结构如下：
+
+```text
+mmpose
+├── mmpose
+├── docs
+├── tests
+├── tools
+├── configs
+`── data
+    │── ExLPose
+        │-- annotations
+        |	|-- ExLPose
+        │   |-- ExLPose_train_LL.json
+        │   |-- ExLPose_test_LL-A.json
+        │   |-- ExLPose_test_LL-E.json
+        │   |-- ExLPose_test_LL-H.json
+        │   |-- ExLPose_test_LL-N.json
+        |-- dark
+            |--00001.png
+            |--00002.png
+            |--...
+
+```
 
 ## PoseTrack18
 
@@ -461,11 +509,11 @@ You can choose whether to download other annotation files in Human-Art. If you w
   <img src="https://user-images.githubusercontent.com/100993824/227865114-3f98c673-f6d0-4518-ae99-653f475f9fc8.png" height="300px">
 </div>
 
-For [PoseTrack18](https://posetrack.net/users/download.php) data, please download from [PoseTrack18](https://posetrack.net/users/download.php).
-Please download the annotation files from [posetrack18_annotations](https://download.openmmlab.com/mmpose/datasets/posetrack18_annotations.tar).
-We have merged the video-wise separated official annotation files into two json files (posetrack18_train & posetrack18_val.json). We also generate the [mask files](https://download.openmmlab.com/mmpose/datasets/posetrack18_mask.tar) to speed up training.
-For top-down approaches, we use [MMDetection](https://github.com/open-mmlab/mmdetection) pre-trained [Cascade R-CNN](https://download.openmmlab.com/mmdetection/v2.0/cascade_rcnn/cascade_rcnn_x101_64x4d_fpn_20e_coco/cascade_rcnn_x101_64x4d_fpn_20e_coco_20200509_224357-051557b1.pth) (X-101-64x4d-FPN) to generate the detected human bounding boxes.
-Please download and extract them under $MMPOSE/data, and make them look like this:
+关于 [PoseTrack18](https://posetrack.net/users/download.php) 的数据，请从 [PoseTrack18](https://posetrack.net/users/download.php) 下载。
+请从 [posetrack18_annotations](https://download.openmmlab.com/mmpose/datasets/posetrack18_annotations.tar) 下载标注文件。
+我们已经将分散在各个视频中的官方标注文件合并为两个 json 文件（posetrack18_train & posetrack18_val.json）。我们还生成了 [mask 文件](https://download.openmmlab.com/mmpose/datasets/posetrack18_mask.tar) 以加速训练。
+对于自上而下的方法，我们使用 [MMDetection](https://github.com/open-mmlab/mmdetection) 预训练的 [Cascade R-CNN](https://download.openmmlab.com/mmdetection/v2.0/cascade_rcnn/cascade_rcnn_x101_64x4d_fpn_20e_coco/cascade_rcnn_x101_64x4d_fpn_20e_coco_20200509_224357-051557b1.pth)（X-101-64x4d-FPN）来生成检测到的人体边界框。
+请下载并将它们解压到 $MMPOSE/data 下，目录结构应如下所示：
 
 ```text
 mmpose
@@ -527,7 +575,7 @@ mmpose
                 │-- ...
 ```
 
-The official evaluation tool for PoseTrack should be installed from GitHub.
+官方的 PoseTrack 评估工具可以使用以下命令安装。
 
 ```shell
 pip install git+https://github.com/svenkreiss/poseval.git
@@ -557,9 +605,9 @@ pip install git+https://github.com/svenkreiss/poseval.git
   <img src="https://user-images.githubusercontent.com/100993824/227865619-d65f64ae-991d-4693-99c2-caecd1beb1fc.png" height="300px">
 </div>
 
-For [sub-JHMDB](http://jhmdb.is.tue.mpg.de/dataset) data, please download the [images](<(http://files.is.tue.mpg.de/jhmdb/Rename_Images.tar.gz)>) from [JHMDB](http://jhmdb.is.tue.mpg.de/dataset),
-Please download the annotation files from [jhmdb_annotations](https://download.openmmlab.com/mmpose/datasets/jhmdb_annotations.tar).
-Move them under $MMPOSE/data, and make them look like this:
+对于 [sub-JHMDB](http://jhmdb.is.tue.mpg.de/dataset) 的数据，请从 [JHMDB](http://jhmdb.is.tue.mpg.de/dataset) 下载 [图像](<(http://files.is.tue.mpg.de/jhmdb/Rename_Images.tar.gz)>)，
+请从 [jhmdb_annotations](https://download.openmmlab.com/mmpose/datasets/jhmdb_annotations.tar) 下载标注文件。
+将它们移至 $MMPOSE/data 下，并使目录结构如下所示：
 
 ```text
 mmpose
diff --git a/docs/zh_cn/faq.md b/docs/zh_cn/faq.md
index b1e6998396..1b198b0b61 100644
--- a/docs/zh_cn/faq.md
+++ b/docs/zh_cn/faq.md
@@ -19,6 +19,8 @@ Detailed compatible MMPose and MMCV versions are shown as below. Please choose t
 
 | MMPose version |      MMCV/MMEngine version      |
 | :------------: | :-----------------------------: |
+|     1.3.0      |  mmcv>=2.0.1, mmengine>=0.9.0   |
+|     1.2.0      |  mmcv>=2.0.1, mmengine>=0.8.0   |
 |     1.1.0      |  mmcv>=2.0.1, mmengine>=0.8.0   |
 |     1.0.0      |  mmcv>=2.0.0, mmengine>=0.7.0   |
 |    1.0.0rc1    | mmcv>=2.0.0rc4, mmengine>=0.6.0 |
@@ -146,3 +148,7 @@ Detailed compatible MMPose and MMCV versions are shown as below. Please choose t
 
   1. set `flip_test=False` in `init_cfg` in the config file.
   2. use faster human bounding box detector, see [MMDetection](https://mmdetection.readthedocs.io/zh_CN/3.x/model_zoo.html).
+
+- **What is the definition of each keypoint index?**
+
+  Check the [meta information file](https://github.com/open-mmlab/mmpose/tree/main/configs/_base_/datasets) for the dataset used to train the model you are using. They key `keypoint_info` includes the definition of each keypoint.
diff --git a/docs/zh_cn/notes/changelog.md b/docs/zh_cn/notes/changelog.md
index bd89688bc0..846286d022 100644
--- a/docs/zh_cn/notes/changelog.md
+++ b/docs/zh_cn/notes/changelog.md
@@ -1,5 +1,9 @@
 # Changelog
 
+## **v1.3.0 (04/01/2024)**
+
+Release note: https://github.com/open-mmlab/mmpose/releases/tag/v1.3.0
+
 ## **v1.2.0 (12/10/2023)**
 
 Release note: https://github.com/open-mmlab/mmpose/releases/tag/v1.2.0
diff --git a/docs/zh_cn/user_guides/inference.md b/docs/zh_cn/user_guides/inference.md
index 24ba42974b..9fc1612e11 100644
--- a/docs/zh_cn/user_guides/inference.md
+++ b/docs/zh_cn/user_guides/inference.md
@@ -22,7 +22,7 @@ from mmpose.apis import MMPoseInferencer
 
 img_path = 'tests/data/coco/000000000785.jpg'   # 将img_path替换给你自己的路径
 
-# 使用模型别名创建推断器
+# 使用模型别名创建推理器
 inferencer = MMPoseInferencer('human')
 
 # MMPoseInferencer采用了惰性推断方法，在给定输入时创建一个预测生成器
@@ -108,13 +108,15 @@ results = [result for result in result_generator]
 [MMPoseInferencer](https://github.com/open-mmlab/mmpose/blob/dev-1.x/mmpose/apis/inferencers/mmpose_inferencer.py#L24) 提供了几种可用于自定义所使用的模型的方法：
 
 ```python
-# 使用模型别名构建推断器
+# 使用模型别名构建推理器
 inferencer = MMPoseInferencer('human')
 
-# 使用模型配置名构建推断器
+# 使用模型配置名构建推理器
 inferencer = MMPoseInferencer('td-hm_hrnet-w32_8xb64-210e_coco-256x192')
+# 使用 3D 模型配置名构建推理器
+inferencer = MMPoseInferencer(pose3d="motionbert_dstformer-ft-243frm_8xb32-120e_h36m")
 
-# 使用模型配置文件和权重文件的路径或 URL 构建推断器
+# 使用模型配置文件和权重文件的路径或 URL 构建推理器
 inferencer = MMPoseInferencer(
     pose2d='configs/body_2d_keypoint/topdown_heatmap/coco/' \
            'td-hm_hrnet-w32_8xb64-210e_coco-256x192.py',
@@ -125,6 +127,24 @@ inferencer = MMPoseInferencer(
 
 模型别名的完整列表可以在模型别名部分中找到。
 
+上述代码为 2D 模型推理器的构建例子。3D 模型的推理器可以用类似的方式通过 `pose3d` 参数构建：
+
+```python
+# 使用 3D 模型别名构建推理器
+inferencer = MMPoseInferencer(pose3d="human3d")
+
+# 使用 3D 模型配置名构建推理器
+inferencer = MMPoseInferencer(pose3d="motionbert_dstformer-ft-243frm_8xb32-120e_h36m")
+
+# 使用 3D 模型配置文件和权重文件的路径或 URL 构建推理器
+inferencer = MMPoseInferencer(
+    pose3d='configs/body_3d_keypoint/motionbert/h36m/' \
+           'motionbert_dstformer-ft-243frm_8xb32-120e_h36m.py',
+    pose3d_weights='https://download.openmmlab.com/mmpose/v1/body_3d_keypoint/' \
+                   'pose_lift/h36m/motionbert_ft_h36m-d80af323_20230531.pth'
+)
+```
+
 此外，自顶向下的姿态估计器还需要一个对象检测模型。[MMPoseInferencer](https://github.com/open-mmlab/mmpose/blob/dev-1.x/mmpose/apis/inferencers/mmpose_inferencer.py#L24) 能够推断用 MMPose 支持的数据集训练的模型的实例类型，然后构建必要的对象检测模型。用户也可以通过以下方式手动指定检测模型:
 
 ```python
@@ -145,7 +165,7 @@ inferencer = MMPoseInferencer(
     det_cat_ids=[0],  # 指定'human'类的类别id
 )
 
-# 使用模型配置文件和权重文件的路径或URL构建推断器
+# 使用模型配置文件和权重文件的路径或URL构建推理器
 inferencer = MMPoseInferencer(
     pose2d='human',
     det_model=f'{PATH_TO_MMDET}/configs/yolox/yolox_l_8x8_300e_coco.py',
@@ -210,7 +230,7 @@ result = next(result_generator)
 
 ### 推理器参数
 
-[MMPoseInferencer](https://github.com/open-mmlab/mmpose/blob/dev-1.x/mmpose/apis/inferencers/mmpose_inferencer.py#L24) 提供了各种自定义姿态估计、可视化和保存预测结果的参数。下面是<mark>初始化</mark>推断器时可用的参数列表及对这些参数的描述：
+[MMPoseInferencer](https://github.com/open-mmlab/mmpose/blob/dev-1.x/mmpose/apis/inferencers/mmpose_inferencer.py#L24) 提供了各种自定义姿态估计、可视化和保存预测结果的参数。下面是<mark>初始化</mark>推理器时可用的参数列表及对这些参数的描述：
 
 | Argument         | Description                                                  |
 | ---------------- | ------------------------------------------------------------ |
diff --git a/docs/zh_cn/user_guides/train_and_test.md b/docs/zh_cn/user_guides/train_and_test.md
index 6cadeab0a3..b95ae965f1 100644
--- a/docs/zh_cn/user_guides/train_and_test.md
+++ b/docs/zh_cn/user_guides/train_and_test.md
@@ -496,3 +496,16 @@ test_evaluator = val_evaluator
 ```
 
 如需进一步了解如何将 AIC 关键点转换为 COCO 关键点，请查阅 [该指南](https://mmpose.readthedocs.io/zh_CN/dev-1.x/user_guides/mixed_datasets.html#aic-coco)。
+
+### 使用自定义检测器评估 Top-down 模型
+
+要评估 Top-down 模型，您可以使用人工标注的或预先检测到的边界框。 `bbox_file` 提供了由特定检测器生成的这些框。例如，`COCO_val2017_detections_AP_H_56_person.json` 包含了使用具有 56.4 人类 AP 的检测器捕获的 COCO val2017 数据集的边界框。要使用 MMDetection 支持的自定义检测器创建您自己的 `bbox_file`，请运行以下命令：
+
+```sh
+python tools/misc/generate_bbox_file.py \
+    ${DET_CONFIG} ${DET_WEIGHT} ${OUTPUT_FILE_NAME} \
+    [--pose-config ${POSE_CONFIG}] \
+    [--score-thr ${SCORE_THRESHOLD}] [--nms-thr ${NMS_THRESHOLD}]
+```
+
+其中，`DET_CONFIG` 和 `DET_WEIGHT` 用于创建目标检测器。 `POSE_CONFIG` 指定需要边界框检测的测试数据集。`SCORE_THRESHOLD` 和 `NMS_THRESHOLD` 用于边界框过滤。
diff --git a/mmpose/__init__.py b/mmpose/__init__.py
index 583ede0a4d..dda49513fa 100644
--- a/mmpose/__init__.py
+++ b/mmpose/__init__.py
@@ -6,7 +6,7 @@
 from .version import __version__, short_version
 
 mmcv_minimum_version = '2.0.0rc4'
-mmcv_maximum_version = '2.2.0'
+mmcv_maximum_version = '2.3.0'
 mmcv_version = digit_version(mmcv.__version__)
 
 mmengine_minimum_version = '0.6.0'
diff --git a/mmpose/apis/inference_3d.py b/mmpose/apis/inference_3d.py
index ae6428f187..b4151e804a 100644
--- a/mmpose/apis/inference_3d.py
+++ b/mmpose/apis/inference_3d.py
@@ -23,13 +23,13 @@ def convert_keypoint_definition(keypoints, pose_det_dataset,
         ndarray[K, 2 or 3]: the transformed 2D keypoints.
     """
     assert pose_lift_dataset in [
-        'h36m'], '`pose_lift_dataset` should be ' \
+        'h36m', 'h3wb'], '`pose_lift_dataset` should be ' \
         f'`h36m`, but got {pose_lift_dataset}.'
 
     keypoints_new = np.zeros((keypoints.shape[0], 17, keypoints.shape[2]),
                              dtype=keypoints.dtype)
-    if pose_lift_dataset == 'h36m':
-        if pose_det_dataset in ['h36m']:
+    if pose_lift_dataset in ['h36m', 'h3wb']:
+        if pose_det_dataset in ['h36m', 'coco_wholebody']:
             keypoints_new = keypoints
         elif pose_det_dataset in ['coco', 'posetrack18']:
             # pelvis (root) is in the middle of l_hip and r_hip
@@ -265,8 +265,26 @@ def inference_pose_lifter_model(model,
             bbox_center = dataset_info['stats_info']['bbox_center']
             bbox_scale = dataset_info['stats_info']['bbox_scale']
         else:
-            bbox_center = None
-            bbox_scale = None
+            if norm_pose_2d:
+                # compute the average bbox center and scale from the
+                # datasamples in pose_results_2d
+                bbox_center = np.zeros((1, 2), dtype=np.float32)
+                bbox_scale = 0
+                num_bbox = 0
+                for pose_res in pose_results_2d:
+                    for data_sample in pose_res:
+                        for bbox in data_sample.pred_instances.bboxes:
+                            bbox_center += np.array([[(bbox[0] + bbox[2]) / 2,
+                                                      (bbox[1] + bbox[3]) / 2]
+                                                     ])
+                            bbox_scale += max(bbox[2] - bbox[0],
+                                              bbox[3] - bbox[1])
+                            num_bbox += 1
+                bbox_center /= num_bbox
+                bbox_scale /= num_bbox
+            else:
+                bbox_center = None
+                bbox_scale = None
 
     pose_results_2d_copy = []
     for i, pose_res in enumerate(pose_results_2d):
diff --git a/mmpose/apis/inferencers/base_mmpose_inferencer.py b/mmpose/apis/inferencers/base_mmpose_inferencer.py
index d7d5eb8c19..574063e824 100644
--- a/mmpose/apis/inferencers/base_mmpose_inferencer.py
+++ b/mmpose/apis/inferencers/base_mmpose_inferencer.py
@@ -1,4 +1,5 @@
 # Copyright (c) OpenMMLab. All rights reserved.
+import inspect
 import logging
 import mimetypes
 import os
@@ -21,6 +22,7 @@
 from mmengine.runner.checkpoint import _load_checkpoint_to_model
 from mmengine.structures import InstanceData
 from mmengine.utils import mkdir_or_exist
+from rich.progress import track
 
 from mmpose.apis.inference import dataset_meta_from_config
 from mmpose.registry import DATASETS
@@ -53,6 +55,15 @@ class BaseMMPoseInferencer(BaseInferencer):
     }
     postprocess_kwargs: set = {'pred_out_dir', 'return_datasample'}
 
+    def __init__(self,
+                 model: Union[ModelType, str, None] = None,
+                 weights: Optional[str] = None,
+                 device: Optional[str] = None,
+                 scope: Optional[str] = None,
+                 show_progress: bool = False) -> None:
+        super().__init__(
+            model, weights, device, scope, show_progress=show_progress)
+
     def _init_detector(
         self,
         det_model: Optional[Union[ModelType, str]] = None,
@@ -79,8 +90,18 @@ def _init_detector(
                 det_scope = det_cfg.default_scope
 
             if has_mmdet:
-                self.detector = DetInferencer(
-                    det_model, det_weights, device=device, scope=det_scope)
+                det_kwargs = dict(
+                    model=det_model,
+                    weights=det_weights,
+                    device=device,
+                    scope=det_scope,
+                )
+                # for compatibility with low version of mmdet
+                if 'show_progress' in inspect.signature(
+                        DetInferencer).parameters:
+                    det_kwargs['show_progress'] = False
+
+                self.detector = DetInferencer(**det_kwargs)
             else:
                 raise RuntimeError(
                     'MMDetection (v3.0.0 or above) is required to build '
@@ -293,22 +314,45 @@ def preprocess(self,
                    inputs: InputsType,
                    batch_size: int = 1,
                    bboxes: Optional[List] = None,
+                   bbox_thr: float = 0.3,
+                   nms_thr: float = 0.3,
                    **kwargs):
         """Process the inputs into a model-feedable format.
 
         Args:
             inputs (InputsType): Inputs given by user.
             batch_size (int): batch size. Defaults to 1.
+            bbox_thr (float): threshold for bounding box detection.
+                Defaults to 0.3.
+            nms_thr (float): IoU threshold for bounding box NMS.
+                Defaults to 0.3.
 
         Yields:
             Any: Data processed by the ``pipeline`` and ``collate_fn``.
             List[str or np.ndarray]: List of original inputs in the batch
         """
 
+        # One-stage pose estimators perform prediction filtering within the
+        # head's `predict` method. Here, we set the arguments for filtering
+        if self.cfg.model.type == 'BottomupPoseEstimator':
+            # 1. init with default arguments
+            test_cfg = self.model.head.test_cfg.copy()
+            # 2. update the score_thr and nms_thr in the test_cfg of the head
+            if 'score_thr' in test_cfg:
+                test_cfg['score_thr'] = bbox_thr
+            if 'nms_thr' in test_cfg:
+                test_cfg['nms_thr'] = nms_thr
+            self.model.test_cfg = test_cfg
+
         for i, input in enumerate(inputs):
             bbox = bboxes[i] if bboxes else []
             data_infos = self.preprocess_single(
-                input, index=i, bboxes=bbox, **kwargs)
+                input,
+                index=i,
+                bboxes=bbox,
+                bbox_thr=bbox_thr,
+                nms_thr=nms_thr,
+                **kwargs)
             # only supports inference with batch size 1
             yield self.collate_fn(data_infos), [input]
 
@@ -384,7 +428,8 @@ def __call__(
 
         preds = []
 
-        for proc_inputs, ori_inputs in inputs:
+        for proc_inputs, ori_inputs in (track(inputs, description='Inference')
+                                        if self.show_progress else inputs):
             preds = self.forward(proc_inputs, **forward_kwargs)
 
             visualization = self.visualize(ori_inputs, preds,
diff --git a/mmpose/apis/inferencers/hand3d_inferencer.py b/mmpose/apis/inferencers/hand3d_inferencer.py
index 57f1eb06eb..a7db53cb84 100644
--- a/mmpose/apis/inferencers/hand3d_inferencer.py
+++ b/mmpose/apis/inferencers/hand3d_inferencer.py
@@ -78,11 +78,16 @@ def __init__(self,
                  scope: Optional[str] = 'mmpose',
                  det_model: Optional[Union[ModelType, str]] = None,
                  det_weights: Optional[str] = None,
-                 det_cat_ids: Optional[Union[int, Tuple]] = None) -> None:
+                 det_cat_ids: Optional[Union[int, Tuple]] = None,
+                 show_progress: bool = False) -> None:
 
         init_default_scope(scope)
         super().__init__(
-            model=model, weights=weights, device=device, scope=scope)
+            model=model,
+            weights=weights,
+            device=device,
+            scope=scope,
+            show_progress=show_progress)
         self.model = revert_sync_batchnorm(self.model)
 
         # assign dataset metainfo to self.visualizer
diff --git a/mmpose/apis/inferencers/mmpose_inferencer.py b/mmpose/apis/inferencers/mmpose_inferencer.py
index cd08d8f6cb..4ade56cb04 100644
--- a/mmpose/apis/inferencers/mmpose_inferencer.py
+++ b/mmpose/apis/inferencers/mmpose_inferencer.py
@@ -7,6 +7,7 @@
 from mmengine.config import Config, ConfigDict
 from mmengine.infer.infer import ModelType
 from mmengine.structures import InstanceData
+from rich.progress import track
 
 from .base_mmpose_inferencer import BaseMMPoseInferencer
 from .hand3d_inferencer import Hand3DInferencer
@@ -59,7 +60,9 @@ class MMPoseInferencer(BaseMMPoseInferencer):
         'bbox_thr', 'nms_thr', 'bboxes', 'use_oks_tracking', 'tracking_thr',
         'disable_norm_pose_2d'
     }
-    forward_kwargs: set = {'disable_rebase_keypoint'}
+    forward_kwargs: set = {
+        'merge_results', 'disable_rebase_keypoint', 'pose_based_nms'
+    }
     visualize_kwargs: set = {
         'return_vis', 'show', 'wait_time', 'draw_bbox', 'radius', 'thickness',
         'kpt_thr', 'vis_out_dir', 'skeleton_style', 'draw_heatmap',
@@ -76,23 +79,27 @@ def __init__(self,
                  scope: str = 'mmpose',
                  det_model: Optional[Union[ModelType, str]] = None,
                  det_weights: Optional[str] = None,
-                 det_cat_ids: Optional[Union[int, List]] = None) -> None:
+                 det_cat_ids: Optional[Union[int, List]] = None,
+                 show_progress: bool = False) -> None:
 
         self.visualizer = None
+        self.show_progress = show_progress
         if pose3d is not None:
             if 'hand3d' in pose3d:
                 self.inferencer = Hand3DInferencer(pose3d, pose3d_weights,
                                                    device, scope, det_model,
-                                                   det_weights, det_cat_ids)
+                                                   det_weights, det_cat_ids,
+                                                   show_progress)
             else:
                 self.inferencer = Pose3DInferencer(pose3d, pose3d_weights,
                                                    pose2d, pose2d_weights,
                                                    device, scope, det_model,
-                                                   det_weights, det_cat_ids)
+                                                   det_weights, det_cat_ids,
+                                                   show_progress)
         elif pose2d is not None:
             self.inferencer = Pose2DInferencer(pose2d, pose2d_weights, device,
                                                scope, det_model, det_weights,
-                                               det_cat_ids)
+                                               det_cat_ids, show_progress)
         else:
             raise ValueError('Either 2d or 3d pose estimation algorithm '
                              'should be provided.')
@@ -108,14 +115,8 @@ def preprocess(self, inputs: InputsType, batch_size: int = 1, **kwargs):
             Any: Data processed by the ``pipeline`` and ``collate_fn``.
             List[str or np.ndarray]: List of original inputs in the batch
         """
-
-        for i, input in enumerate(inputs):
-            data_batch = {}
-            data_infos = self.inferencer.preprocess_single(
-                input, index=i, **kwargs)
-            data_batch = self.inferencer.collate_fn(data_infos)
-            # only supports inference with batch size 1
-            yield data_batch, [input]
+        for data in self.inferencer.preprocess(inputs, batch_size, **kwargs):
+            yield data
 
     @torch.no_grad()
     def forward(self, inputs: InputType, **forward_kwargs) -> PredType:
@@ -202,7 +203,8 @@ def __call__(
 
         preds = []
 
-        for proc_inputs, ori_inputs in inputs:
+        for proc_inputs, ori_inputs in (track(inputs, description='Inference')
+                                        if self.show_progress else inputs):
             preds = self.forward(proc_inputs, **forward_kwargs)
 
             visualization = self.visualize(ori_inputs, preds,
diff --git a/mmpose/apis/inferencers/pose2d_inferencer.py b/mmpose/apis/inferencers/pose2d_inferencer.py
index 5a0bbad004..8b6a2c3e96 100644
--- a/mmpose/apis/inferencers/pose2d_inferencer.py
+++ b/mmpose/apis/inferencers/pose2d_inferencer.py
@@ -12,7 +12,7 @@
 from mmengine.registry import init_default_scope
 from mmengine.structures import InstanceData
 
-from mmpose.evaluation.functional import nms
+from mmpose.evaluation.functional import nearby_joints_nms, nms
 from mmpose.registry import INFERENCERS
 from mmpose.structures import merge_data_samples
 from .base_mmpose_inferencer import BaseMMPoseInferencer
@@ -56,7 +56,7 @@ class Pose2DInferencer(BaseMMPoseInferencer):
     """
 
     preprocess_kwargs: set = {'bbox_thr', 'nms_thr', 'bboxes'}
-    forward_kwargs: set = {'merge_results'}
+    forward_kwargs: set = {'merge_results', 'pose_based_nms'}
     visualize_kwargs: set = {
         'return_vis',
         'show',
@@ -79,11 +79,16 @@ def __init__(self,
                  scope: Optional[str] = 'mmpose',
                  det_model: Optional[Union[ModelType, str]] = None,
                  det_weights: Optional[str] = None,
-                 det_cat_ids: Optional[Union[int, Tuple]] = None) -> None:
+                 det_cat_ids: Optional[Union[int, Tuple]] = None,
+                 show_progress: bool = False) -> None:
 
         init_default_scope(scope)
         super().__init__(
-            model=model, weights=weights, device=device, scope=scope)
+            model=model,
+            weights=weights,
+            device=device,
+            scope=scope,
+            show_progress=show_progress)
         self.model = revert_sync_batchnorm(self.model)
 
         # assign dataset metainfo to self.visualizer
@@ -208,7 +213,8 @@ def preprocess_single(self,
     def forward(self,
                 inputs: Union[dict, tuple],
                 merge_results: bool = True,
-                bbox_thr: float = -1):
+                bbox_thr: float = -1,
+                pose_based_nms: bool = False):
         """Performs a forward pass through the model.
 
         Args:
@@ -228,9 +234,29 @@ def forward(self,
         data_samples = self.model.test_step(inputs)
         if self.cfg.data_mode == 'topdown' and merge_results:
             data_samples = [merge_data_samples(data_samples)]
+
         if bbox_thr > 0:
             for ds in data_samples:
                 if 'bbox_scores' in ds.pred_instances:
                     ds.pred_instances = ds.pred_instances[
                         ds.pred_instances.bbox_scores > bbox_thr]
+
+        if pose_based_nms:
+            for ds in data_samples:
+                if len(ds.pred_instances) == 0:
+                    continue
+
+                kpts = ds.pred_instances.keypoints
+                scores = ds.pred_instances.bbox_scores
+                num_keypoints = kpts.shape[-2]
+
+                kept_indices = nearby_joints_nms(
+                    [
+                        dict(keypoints=kpts[i], score=scores[i])
+                        for i in range(len(kpts))
+                    ],
+                    num_nearby_joints_thr=num_keypoints // 3,
+                )
+                ds.pred_instances = ds.pred_instances[kept_indices]
+
         return data_samples
diff --git a/mmpose/apis/inferencers/pose3d_inferencer.py b/mmpose/apis/inferencers/pose3d_inferencer.py
index b0c88c4e7d..f372438298 100644
--- a/mmpose/apis/inferencers/pose3d_inferencer.py
+++ b/mmpose/apis/inferencers/pose3d_inferencer.py
@@ -88,11 +88,16 @@ def __init__(self,
                  scope: Optional[str] = 'mmpose',
                  det_model: Optional[Union[ModelType, str]] = None,
                  det_weights: Optional[str] = None,
-                 det_cat_ids: Optional[Union[int, Tuple]] = None) -> None:
+                 det_cat_ids: Optional[Union[int, Tuple]] = None,
+                 show_progress: bool = False) -> None:
 
         init_default_scope(scope)
         super().__init__(
-            model=model, weights=weights, device=device, scope=scope)
+            model=model,
+            weights=weights,
+            device=device,
+            scope=scope,
+            show_progress=show_progress)
         self.model = revert_sync_batchnorm(self.model)
 
         # assign dataset metainfo to self.visualizer
diff --git a/mmpose/apis/inferencers/utils/default_det_models.py b/mmpose/apis/inferencers/utils/default_det_models.py
index ea02097be0..a2deca961b 100644
--- a/mmpose/apis/inferencers/utils/default_det_models.py
+++ b/mmpose/apis/inferencers/utils/default_det_models.py
@@ -7,7 +7,13 @@
 mmpose_path = get_installed_path(MODULE2PACKAGE['mmpose'])
 
 default_det_models = dict(
-    human=dict(model='rtmdet-m', weights=None, cat_ids=(0, )),
+    human=dict(
+        model=osp.join(
+            mmpose_path, '.mim', 'demo/mmdetection_cfg/'
+            'rtmdet_m_640-8xb32_coco-person.py'),
+        weights='https://download.openmmlab.com/mmpose/v1/projects/'
+        'rtmposev1/rtmdet_m_8xb32-100e_coco-obj365-person-235e8209.pth',
+        cat_ids=(0, )),
     face=dict(
         model=osp.join(mmpose_path, '.mim',
                        'demo/mmdetection_cfg/yolox-s_8xb8-300e_coco-face.py'),
diff --git a/mmpose/codecs/annotation_processors.py b/mmpose/codecs/annotation_processors.py
index e857cdc0e4..b16a4e67ca 100644
--- a/mmpose/codecs/annotation_processors.py
+++ b/mmpose/codecs/annotation_processors.py
@@ -39,6 +39,14 @@ class YOLOXPoseAnnotationProcessor(BaseAnnotationProcessor):
         keypoints_visible='keypoints_visible',
         area='areas',
     )
+    instance_mapping_table = dict(
+        bbox='bboxes',
+        bbox_score='bbox_scores',
+        keypoints='keypoints',
+        keypoints_visible='keypoints_visible',
+        # remove 'bbox_scales' in default instance_mapping_table to avoid
+        # length mismatch during training with multiple datasets
+    )
 
     def __init__(self,
                  expand_bbox: bool = False,
diff --git a/mmpose/codecs/image_pose_lifting.py b/mmpose/codecs/image_pose_lifting.py
index 81bd192eb3..1665d88e1d 100644
--- a/mmpose/codecs/image_pose_lifting.py
+++ b/mmpose/codecs/image_pose_lifting.py
@@ -1,5 +1,5 @@
 # Copyright (c) OpenMMLab. All rights reserved.
-from typing import Optional, Tuple
+from typing import List, Optional, Tuple, Union
 
 import numpy as np
 
@@ -20,7 +20,7 @@ class ImagePoseLifting(BaseKeypointCodec):
 
     Args:
         num_keypoints (int): The number of keypoints in the dataset.
-        root_index (int): Root keypoint index in the pose.
+        root_index (Union[int, List]): Root keypoint index in the pose.
         remove_root (bool): If true, remove the root keypoint from the pose.
             Default: ``False``.
         save_index (bool): If true, store the root position separated from the
@@ -52,7 +52,7 @@ class ImagePoseLifting(BaseKeypointCodec):
 
     def __init__(self,
                  num_keypoints: int,
-                 root_index: int,
+                 root_index: Union[int, List] = 0,
                  remove_root: bool = False,
                  save_index: bool = False,
                  reshape_keypoints: bool = True,
@@ -60,10 +60,13 @@ def __init__(self,
                  keypoints_mean: Optional[np.ndarray] = None,
                  keypoints_std: Optional[np.ndarray] = None,
                  target_mean: Optional[np.ndarray] = None,
-                 target_std: Optional[np.ndarray] = None):
+                 target_std: Optional[np.ndarray] = None,
+                 additional_encode_keys: Optional[List[str]] = None):
         super().__init__()
 
         self.num_keypoints = num_keypoints
+        if isinstance(root_index, int):
+            root_index = [root_index]
         self.root_index = root_index
         self.remove_root = remove_root
         self.save_index = save_index
@@ -96,6 +99,9 @@ def __init__(self,
         self.target_mean = target_mean
         self.target_std = target_std
 
+        if additional_encode_keys is not None:
+            self.auxiliary_encode_keys.update(additional_encode_keys)
+
     def encode(self,
                keypoints: np.ndarray,
                keypoints_visible: Optional[np.ndarray] = None,
@@ -161,18 +167,19 @@ def encode(self,
 
         # Zero-center the target pose around a given root keypoint
         assert (lifting_target.ndim >= 2 and
-                lifting_target.shape[-2] > self.root_index), \
+                lifting_target.shape[-2] > max(self.root_index)), \
             f'Got invalid joint shape {lifting_target.shape}'
 
-        root = lifting_target[..., self.root_index, :]
-        lifting_target_label = lifting_target - lifting_target[
-            ..., self.root_index:self.root_index + 1, :]
+        root = np.mean(
+            lifting_target[..., self.root_index, :], axis=-2, dtype=np.float32)
+        lifting_target_label = lifting_target - root[np.newaxis, ...]
 
-        if self.remove_root:
+        if self.remove_root and len(self.root_index) == 1:
+            root_index = self.root_index[0]
             lifting_target_label = np.delete(
-                lifting_target_label, self.root_index, axis=-2)
+                lifting_target_label, root_index, axis=-2)
             lifting_target_visible = np.delete(
-                lifting_target_visible, self.root_index, axis=-2)
+                lifting_target_visible, root_index, axis=-2)
             assert lifting_target_weight.ndim in {
                 2, 3
             }, (f'lifting_target_weight.ndim {lifting_target_weight.ndim} '
@@ -180,17 +187,18 @@ def encode(self,
 
             axis_to_remove = -2 if lifting_target_weight.ndim == 3 else -1
             lifting_target_weight = np.delete(
-                lifting_target_weight, self.root_index, axis=axis_to_remove)
+                lifting_target_weight, root_index, axis=axis_to_remove)
             # Add a flag to avoid latter transforms that rely on the root
             # joint or the original joint index
             encoded['target_root_removed'] = True
 
             # Save the root index which is necessary to restore the global pose
             if self.save_index:
-                encoded['target_root_index'] = self.root_index
+                encoded['target_root_index'] = root_index
 
         # Normalize the 2D keypoint coordinate with mean and std
         keypoint_labels = keypoints.copy()
+
         if self.keypoints_mean is not None:
             assert self.keypoints_mean.shape[1:] == keypoints.shape[1:], (
                 f'self.keypoints_mean.shape[1:] {self.keypoints_mean.shape[1:]} '  # noqa
@@ -203,7 +211,8 @@ def encode(self,
         if self.target_mean is not None:
             assert self.target_mean.shape == lifting_target_label.shape, (
                 f'self.target_mean.shape {self.target_mean.shape} '
-                f'!= lifting_target_label.shape {lifting_target_label.shape}')
+                f'!= lifting_target_label.shape {lifting_target_label.shape}'  # noqa
+            )
             encoded['target_mean'] = self.target_mean.copy()
             encoded['target_std'] = self.target_std.copy()
 
@@ -263,7 +272,7 @@ def decode(self,
 
         if target_root is not None and target_root.size > 0:
             keypoints = keypoints + target_root
-            if self.remove_root:
+            if self.remove_root and len(self.root_index) == 1:
                 keypoints = np.insert(
                     keypoints, self.root_index, target_root, axis=1)
         scores = np.ones(keypoints.shape[:-1], dtype=np.float32)
diff --git a/mmpose/codecs/simcc_label.py b/mmpose/codecs/simcc_label.py
index 6183e2be73..e83960faaf 100644
--- a/mmpose/codecs/simcc_label.py
+++ b/mmpose/codecs/simcc_label.py
@@ -47,6 +47,12 @@ class SimCCLabel(BaseKeypointCodec):
             will be :math:`w*simcc_split_ratio`. Defaults to 2.0
         label_smooth_weight (float): Label Smoothing weight. Defaults to 0.0
         normalize (bool): Whether to normalize the heatmaps. Defaults to True.
+        use_dark (bool): Whether to use the DARK post processing. Defaults to
+            False.
+        decode_visibility (bool): Whether to decode the visibility. Defaults
+            to False.
+        decode_beta (float): The beta value for decoding visibility. Defaults
+            to 150.0.
 
     .. _`SimCC: a Simple Coordinate Classification Perspective for Human Pose
     Estimation`: https://arxiv.org/abs/2107.03332
@@ -58,14 +64,18 @@ class SimCCLabel(BaseKeypointCodec):
         keypoint_weights='keypoint_weights',
     )
 
-    def __init__(self,
-                 input_size: Tuple[int, int],
-                 smoothing_type: str = 'gaussian',
-                 sigma: Union[float, int, Tuple[float]] = 6.0,
-                 simcc_split_ratio: float = 2.0,
-                 label_smooth_weight: float = 0.0,
-                 normalize: bool = True,
-                 use_dark: bool = False) -> None:
+    def __init__(
+        self,
+        input_size: Tuple[int, int],
+        smoothing_type: str = 'gaussian',
+        sigma: Union[float, int, Tuple[float]] = 6.0,
+        simcc_split_ratio: float = 2.0,
+        label_smooth_weight: float = 0.0,
+        normalize: bool = True,
+        use_dark: bool = False,
+        decode_visibility: bool = False,
+        decode_beta: float = 150.0,
+    ) -> None:
         super().__init__()
 
         self.input_size = input_size
@@ -74,6 +84,8 @@ def __init__(self,
         self.label_smooth_weight = label_smooth_weight
         self.normalize = normalize
         self.use_dark = use_dark
+        self.decode_visibility = decode_visibility
+        self.decode_beta = decode_beta
 
         if isinstance(sigma, (float, int)):
             self.sigma = np.array([sigma, sigma])
@@ -178,7 +190,14 @@ def decode(self, simcc_x: np.ndarray,
 
         keypoints /= self.simcc_split_ratio
 
-        return keypoints, scores
+        if self.decode_visibility:
+            _, visibility = get_simcc_maximum(
+                simcc_x * self.decode_beta * self.sigma[0],
+                simcc_y * self.decode_beta * self.sigma[1],
+                apply_softmax=True)
+            return keypoints, (scores, visibility)
+        else:
+            return keypoints, scores
 
     def _map_coordinates(
         self,
diff --git a/mmpose/codecs/spr.py b/mmpose/codecs/spr.py
index 8e09b185c7..fba17f1598 100644
--- a/mmpose/codecs/spr.py
+++ b/mmpose/codecs/spr.py
@@ -138,7 +138,7 @@ def _get_heatmap_weights(self,
         Returns:
             np.ndarray: Heatmap weight array in the same shape with heatmaps
         """
-        heatmap_weights = np.ones(heatmaps.shape) * bg_weight
+        heatmap_weights = np.ones(heatmaps.shape, dtype=np.float32) * bg_weight
         heatmap_weights[heatmaps > 0] = fg_weight
         return heatmap_weights
 
diff --git a/mmpose/codecs/utils/post_processing.py b/mmpose/codecs/utils/post_processing.py
index 7bb447e199..d8637e2bc2 100644
--- a/mmpose/codecs/utils/post_processing.py
+++ b/mmpose/codecs/utils/post_processing.py
@@ -39,7 +39,9 @@ def get_simcc_normalized(batch_pred_simcc, sigma=None):
 
 
 def get_simcc_maximum(simcc_x: np.ndarray,
-                      simcc_y: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
+                      simcc_y: np.ndarray,
+                      apply_softmax: bool = False
+                      ) -> Tuple[np.ndarray, np.ndarray]:
     """Get maximum response location and value from simcc representations.
 
     Note:
@@ -51,6 +53,8 @@ def get_simcc_maximum(simcc_x: np.ndarray,
     Args:
         simcc_x (np.ndarray): x-axis SimCC in shape (K, Wx) or (N, K, Wx)
         simcc_y (np.ndarray): y-axis SimCC in shape (K, Wy) or (N, K, Wy)
+        apply_softmax (bool): whether to apply softmax on the heatmap.
+            Defaults to False.
 
     Returns:
         tuple:
@@ -76,6 +80,13 @@ def get_simcc_maximum(simcc_x: np.ndarray,
     else:
         N = None
 
+    if apply_softmax:
+        simcc_x = simcc_x - np.max(simcc_x, axis=1, keepdims=True)
+        simcc_y = simcc_y - np.max(simcc_y, axis=1, keepdims=True)
+        ex, ey = np.exp(simcc_x), np.exp(simcc_y)
+        simcc_x = ex / np.sum(ex, axis=1, keepdims=True)
+        simcc_y = ey / np.sum(ey, axis=1, keepdims=True)
+
     x_locs = np.argmax(simcc_x, axis=1)
     y_locs = np.argmax(simcc_y, axis=1)
     locs = np.stack((x_locs, y_locs), axis=-1).astype(np.float32)
diff --git a/mmpose/codecs/video_pose_lifting.py b/mmpose/codecs/video_pose_lifting.py
index 2b08b4da85..5a5a7b1983 100644
--- a/mmpose/codecs/video_pose_lifting.py
+++ b/mmpose/codecs/video_pose_lifting.py
@@ -1,7 +1,7 @@
 # Copyright (c) OpenMMLab. All rights reserved.
 
 from copy import deepcopy
-from typing import Optional, Tuple
+from typing import List, Optional, Tuple, Union
 
 import numpy as np
 
@@ -24,7 +24,8 @@ class VideoPoseLifting(BaseKeypointCodec):
         num_keypoints (int): The number of keypoints in the dataset.
         zero_center: Whether to zero-center the target around root. Default:
             ``True``.
-        root_index (int): Root keypoint index in the pose. Default: 0.
+        root_index (Union[int, List]): Root keypoint index in the pose.
+            Default: 0.
         remove_root (bool): If true, remove the root keypoint from the pose.
             Default: ``False``.
         save_index (bool): If true, store the root position separated from the
@@ -54,7 +55,7 @@ class VideoPoseLifting(BaseKeypointCodec):
     def __init__(self,
                  num_keypoints: int,
                  zero_center: bool = True,
-                 root_index: int = 0,
+                 root_index: Union[int, List] = 0,
                  remove_root: bool = False,
                  save_index: bool = False,
                  reshape_keypoints: bool = True,
@@ -64,6 +65,8 @@ def __init__(self,
 
         self.num_keypoints = num_keypoints
         self.zero_center = zero_center
+        if isinstance(root_index, int):
+            root_index = [root_index]
         self.root_index = root_index
         self.remove_root = remove_root
         self.save_index = save_index
@@ -143,19 +146,19 @@ def encode(self,
         # Zero-center the target pose around a given root keypoint
         if self.zero_center:
             assert (lifting_target.ndim >= 2 and
-                    lifting_target.shape[-2] > self.root_index), \
+                    lifting_target.shape[-2] > max(self.root_index)), \
                 f'Got invalid joint shape {lifting_target.shape}'
 
-            root = lifting_target[..., self.root_index, :]
-            lifting_target_label -= lifting_target_label[
-                ..., self.root_index:self.root_index + 1, :]
+            root = np.mean(lifting_target[..., self.root_index, :], axis=-2)
+            lifting_target_label -= root[..., np.newaxis, :]
             encoded['target_root'] = root
 
-            if self.remove_root:
+            if self.remove_root and len(self.root_index) == 1:
+                root_index = self.root_index[0]
                 lifting_target_label = np.delete(
-                    lifting_target_label, self.root_index, axis=-2)
+                    lifting_target_label, root_index, axis=-2)
                 lifting_target_visible = np.delete(
-                    lifting_target_visible, self.root_index, axis=-2)
+                    lifting_target_visible, root_index, axis=-2)
                 assert lifting_target_weight.ndim in {
                     2, 3
                 }, (f'Got invalid lifting target weights shape '
@@ -163,16 +166,14 @@ def encode(self,
 
                 axis_to_remove = -2 if lifting_target_weight.ndim == 3 else -1
                 lifting_target_weight = np.delete(
-                    lifting_target_weight,
-                    self.root_index,
-                    axis=axis_to_remove)
+                    lifting_target_weight, root_index, axis=axis_to_remove)
                 # Add a flag to avoid latter transforms that rely on the root
                 # joint or the original joint index
                 encoded['target_root_removed'] = True
 
                 # Save the root index for restoring the global pose
                 if self.save_index:
-                    encoded['target_root_index'] = self.root_index
+                    encoded['target_root_index'] = root_index
 
         # Normalize the 2D keypoint coordinate with image width and height
         _camera_param = deepcopy(camera_param)
@@ -237,7 +238,7 @@ def decode(self,
 
         if target_root is not None and target_root.size > 0:
             keypoints = keypoints + target_root
-            if self.remove_root:
+            if self.remove_root and len(self.root_index) == 1:
                 keypoints = np.insert(
                     keypoints, self.root_index, target_root, axis=1)
         scores = np.ones(keypoints.shape[:-1], dtype=np.float32)
diff --git a/mmpose/configs/wholebody_2d_keypoint/rtmpose/cocktail13/rtmw-l_8xb1024-270e_cocktail14-256x192.py b/mmpose/configs/wholebody_2d_keypoint/rtmpose/cocktail13/rtmw-l_8xb1024-270e_cocktail14-256x192.py
new file mode 100644
index 0000000000..a87290ecb6
--- /dev/null
+++ b/mmpose/configs/wholebody_2d_keypoint/rtmpose/cocktail13/rtmw-l_8xb1024-270e_cocktail14-256x192.py
@@ -0,0 +1,637 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmengine.config import read_base
+
+with read_base():
+    from mmpose.configs._base_.default_runtime import *  # noqa
+
+from albumentations.augmentations import Blur, CoarseDropout, MedianBlur
+from mmdet.engine.hooks import PipelineSwitchHook
+from mmengine.dataset import DefaultSampler
+from mmengine.hooks import EMAHook
+from mmengine.model import PretrainedInit
+from mmengine.optim import CosineAnnealingLR, LinearLR, OptimWrapper
+from torch.nn import SiLU, SyncBatchNorm
+from torch.optim import AdamW
+
+from mmpose.codecs import SimCCLabel
+from mmpose.datasets import (AicDataset, CocoWholeBodyDataset, COFWDataset,
+                             CombinedDataset, CrowdPoseDataset,
+                             Face300WDataset, GenerateTarget,
+                             GetBBoxCenterScale, HalpeDataset,
+                             HumanArt21Dataset, InterHand2DDoubleDataset,
+                             JhmdbDataset, KeypointConverter, LapaDataset,
+                             LoadImage, MpiiDataset, PackPoseInputs,
+                             PoseTrack18Dataset, RandomFlip, RandomHalfBody,
+                             TopdownAffine, UBody2dDataset, WFLWDataset)
+from mmpose.datasets.transforms.common_transforms import (
+    Albumentation, PhotometricDistortion, RandomBBoxTransform)
+from mmpose.engine.hooks import ExpMomentumEMA
+from mmpose.evaluation import CocoWholeBodyMetric
+from mmpose.models import (CSPNeXt, CSPNeXtPAFPN, KLDiscretLoss,
+                           PoseDataPreprocessor, RTMWHead,
+                           TopdownPoseEstimator)
+
+# common setting
+num_keypoints = 133
+input_size = (192, 256)
+
+# runtime
+max_epochs = 270
+stage2_num_epochs = 10
+base_lr = 5e-4
+train_batch_size = 1024
+val_batch_size = 32
+
+train_cfg.update(max_epochs=max_epochs, val_interval=10)  # noqa
+randomness = dict(seed=21)
+
+# optimizer
+optim_wrapper = dict(
+    type=OptimWrapper,
+    optimizer=dict(type=AdamW, lr=base_lr, weight_decay=0.1),
+    clip_grad=dict(max_norm=35, norm_type=2),
+    paramwise_cfg=dict(
+        norm_decay_mult=0, bias_decay_mult=0, bypass_duplicate=True))
+
+# learning rate
+param_scheduler = [
+    dict(
+        type=LinearLR, start_factor=1.0e-5, by_epoch=False, begin=0, end=1000),
+    dict(
+        type=CosineAnnealingLR,
+        eta_min=base_lr * 0.05,
+        begin=max_epochs // 2,
+        end=max_epochs,
+        T_max=max_epochs // 2,
+        by_epoch=True,
+        convert_to_iter_based=True),
+]
+
+# automatically scaling LR based on the actual training batch size
+auto_scale_lr = dict(base_batch_size=8192)
+
+# codec settings
+codec = dict(
+    type=SimCCLabel,
+    input_size=input_size,
+    sigma=(4.9, 5.66),
+    simcc_split_ratio=2.0,
+    normalize=False,
+    use_dark=False)
+
+# model settings
+model = dict(
+    type=TopdownPoseEstimator,
+    data_preprocessor=dict(
+        type=PoseDataPreprocessor,
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        bgr_to_rgb=True),
+    backbone=dict(
+        type=CSPNeXt,
+        arch='P5',
+        expand_ratio=0.5,
+        deepen_factor=1.,
+        widen_factor=1.,
+        channel_attention=True,
+        norm_cfg=dict(type='BN'),
+        act_cfg=dict(type=SiLU),
+        init_cfg=dict(
+            type=PretrainedInit,
+            prefix='backbone.',
+            checkpoint='https://download.openmmlab.com/mmpose/v1/projects/'
+            'rtmposev1/rtmpose-l_simcc-ucoco_dw-ucoco_270e-256x192-4d6dfc62_20230728.pth'  # noqa
+        )),
+    neck=dict(
+        type=CSPNeXtPAFPN,
+        in_channels=[256, 512, 1024],
+        out_channels=None,
+        out_indices=(
+            1,
+            2,
+        ),
+        num_csp_blocks=2,
+        expand_ratio=0.5,
+        norm_cfg=dict(type=SyncBatchNorm),
+        act_cfg=dict(type=SiLU, inplace=True)),
+    head=dict(
+        type=RTMWHead,
+        in_channels=1024,
+        out_channels=num_keypoints,
+        input_size=input_size,
+        in_featuremap_size=tuple([s // 32 for s in input_size]),
+        simcc_split_ratio=codec['simcc_split_ratio'],
+        final_layer_kernel_size=7,
+        gau_cfg=dict(
+            hidden_dims=256,
+            s=128,
+            expansion_factor=2,
+            dropout_rate=0.,
+            drop_path=0.,
+            act_fn='SiLU',
+            use_rel_bias=False,
+            pos_enc=False),
+        loss=dict(
+            type=KLDiscretLoss,
+            use_target_weight=True,
+            beta=1.,
+            label_softmax=True,
+            label_beta=10.,
+            mask=list(range(23, 91)),
+            mask_weight=0.5,
+        ),
+        decoder=codec),
+    test_cfg=dict(flip_test=True))
+
+# base dataset settings
+dataset_type = CocoWholeBodyDataset
+data_mode = 'topdown'
+data_root = 'data/'
+
+backend_args = dict(backend='local')
+
+# pipelines
+train_pipeline = [
+    dict(type=LoadImage, backend_args=backend_args),
+    dict(type=GetBBoxCenterScale),
+    dict(type=RandomFlip, direction='horizontal'),
+    dict(type=RandomHalfBody),
+    dict(type=RandomBBoxTransform, scale_factor=[0.5, 1.5], rotate_factor=90),
+    dict(type=TopdownAffine, input_size=codec['input_size']),
+    dict(type=PhotometricDistortion),
+    dict(
+        type=Albumentation,
+        transforms=[
+            dict(type=Blur, p=0.1),
+            dict(type=MedianBlur, p=0.1),
+            dict(
+                type=CoarseDropout,
+                max_holes=1,
+                max_height=0.4,
+                max_width=0.4,
+                min_holes=1,
+                min_height=0.2,
+                min_width=0.2,
+                p=0.5),
+        ]),
+    dict(
+        type=GenerateTarget, encoder=codec, use_dataset_keypoint_weights=True),
+    dict(type=PackPoseInputs)
+]
+val_pipeline = [
+    dict(type=LoadImage, backend_args=backend_args),
+    dict(type=GetBBoxCenterScale),
+    dict(type=TopdownAffine, input_size=codec['input_size']),
+    dict(type=PackPoseInputs)
+]
+train_pipeline_stage2 = [
+    dict(type=LoadImage, backend_args=backend_args),
+    dict(type=GetBBoxCenterScale),
+    dict(type=RandomFlip, direction='horizontal'),
+    dict(type=RandomHalfBody),
+    dict(
+        type=RandomBBoxTransform,
+        shift_factor=0.,
+        scale_factor=[0.5, 1.5],
+        rotate_factor=90),
+    dict(type=TopdownAffine, input_size=codec['input_size']),
+    dict(
+        type=Albumentation,
+        transforms=[
+            dict(type=Blur, p=0.1),
+            dict(type=MedianBlur, p=0.1),
+        ]),
+    dict(
+        type=GenerateTarget, encoder=codec, use_dataset_keypoint_weights=True),
+    dict(type=PackPoseInputs)
+]
+
+# mapping
+
+aic_coco133 = [(0, 6), (1, 8), (2, 10), (3, 5), (4, 7), (5, 9), (6, 12),
+               (7, 14), (8, 16), (9, 11), (10, 13), (11, 15)]
+
+crowdpose_coco133 = [(0, 5), (1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (6, 11),
+                     (7, 12), (8, 13), (9, 14), (10, 15), (11, 16)]
+
+mpii_coco133 = [
+    (0, 16),
+    (1, 14),
+    (2, 12),
+    (3, 11),
+    (4, 13),
+    (5, 15),
+    (10, 10),
+    (11, 8),
+    (12, 6),
+    (13, 5),
+    (14, 7),
+    (15, 9),
+]
+
+jhmdb_coco133 = [
+    (3, 6),
+    (4, 5),
+    (5, 12),
+    (6, 11),
+    (7, 8),
+    (8, 7),
+    (9, 14),
+    (10, 13),
+    (11, 10),
+    (12, 9),
+    (13, 16),
+    (14, 15),
+]
+
+halpe_coco133 = [(i, i)
+                 for i in range(17)] + [(20, 17), (21, 20), (22, 18), (23, 21),
+                                        (24, 19),
+                                        (25, 22)] + [(i, i - 3)
+                                                     for i in range(26, 136)]
+
+posetrack_coco133 = [
+    (0, 0),
+    (3, 3),
+    (4, 4),
+    (5, 5),
+    (6, 6),
+    (7, 7),
+    (8, 8),
+    (9, 9),
+    (10, 10),
+    (11, 11),
+    (12, 12),
+    (13, 13),
+    (14, 14),
+    (15, 15),
+    (16, 16),
+]
+
+humanart_coco133 = [(i, i) for i in range(17)] + [(17, 99), (18, 120),
+                                                  (19, 17), (20, 20)]
+
+# train datasets
+dataset_coco = dict(
+    type=dataset_type,
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='coco/annotations/coco_wholebody_train_v1.0.json',
+    data_prefix=dict(img='detection/coco/train2017/'),
+    pipeline=[],
+)
+
+dataset_aic = dict(
+    type=AicDataset,
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='aic/annotations/aic_train.json',
+    data_prefix=dict(img='pose/ai_challenge/ai_challenger_keypoint'
+                     '_train_20170902/keypoint_train_images_20170902/'),
+    pipeline=[
+        dict(
+            type=KeypointConverter,
+            num_keypoints=num_keypoints,
+            mapping=aic_coco133)
+    ],
+)
+
+dataset_crowdpose = dict(
+    type=CrowdPoseDataset,
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='crowdpose/annotations/mmpose_crowdpose_trainval.json',
+    data_prefix=dict(img='pose/CrowdPose/images/'),
+    pipeline=[
+        dict(
+            type=KeypointConverter,
+            num_keypoints=num_keypoints,
+            mapping=crowdpose_coco133)
+    ],
+)
+
+dataset_mpii = dict(
+    type=MpiiDataset,
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='mpii/annotations/mpii_train.json',
+    data_prefix=dict(img='pose/MPI/images/'),
+    pipeline=[
+        dict(
+            type=KeypointConverter,
+            num_keypoints=num_keypoints,
+            mapping=mpii_coco133)
+    ],
+)
+
+dataset_jhmdb = dict(
+    type=JhmdbDataset,
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='jhmdb/annotations/Sub1_train.json',
+    data_prefix=dict(img='pose/JHMDB/'),
+    pipeline=[
+        dict(
+            type=KeypointConverter,
+            num_keypoints=num_keypoints,
+            mapping=jhmdb_coco133)
+    ],
+)
+
+dataset_halpe = dict(
+    type=HalpeDataset,
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='halpe/annotations/halpe_train_v1.json',
+    data_prefix=dict(img='pose/Halpe/hico_20160224_det/images/train2015'),
+    pipeline=[
+        dict(
+            type=KeypointConverter,
+            num_keypoints=num_keypoints,
+            mapping=halpe_coco133)
+    ],
+)
+
+dataset_posetrack = dict(
+    type=PoseTrack18Dataset,
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='posetrack18/annotations/posetrack18_train.json',
+    data_prefix=dict(img='pose/PoseChallenge2018/'),
+    pipeline=[
+        dict(
+            type=KeypointConverter,
+            num_keypoints=num_keypoints,
+            mapping=posetrack_coco133)
+    ],
+)
+
+dataset_humanart = dict(
+    type=HumanArt21Dataset,
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='HumanArt/annotations/training_humanart.json',
+    filter_cfg=dict(scenes=['real_human']),
+    data_prefix=dict(img='pose/'),
+    pipeline=[
+        dict(
+            type=KeypointConverter,
+            num_keypoints=num_keypoints,
+            mapping=humanart_coco133)
+    ])
+
+ubody_scenes = [
+    'Magic_show', 'Entertainment', 'ConductMusic', 'Online_class', 'TalkShow',
+    'Speech', 'Fitness', 'Interview', 'Olympic', 'TVShow', 'Singing',
+    'SignLanguage', 'Movie', 'LiveVlog', 'VideoConference'
+]
+
+ubody_datasets = []
+for scene in ubody_scenes:
+    each = dict(
+        type=UBody2dDataset,
+        data_root=data_root,
+        data_mode=data_mode,
+        ann_file=f'Ubody/annotations/{scene}/train_annotations.json',
+        data_prefix=dict(img='pose/UBody/images/'),
+        pipeline=[],
+        sample_interval=10)
+    ubody_datasets.append(each)
+
+dataset_ubody = dict(
+    type=CombinedDataset,
+    metainfo=dict(from_file='configs/_base_/datasets/ubody2d.py'),
+    datasets=ubody_datasets,
+    pipeline=[],
+    test_mode=False,
+)
+
+face_pipeline = [
+    dict(type=LoadImage, backend_args=backend_args),
+    dict(type=GetBBoxCenterScale, padding=1.25),
+    dict(
+        type=RandomBBoxTransform,
+        shift_factor=0.,
+        scale_factor=[1.5, 2.0],
+        rotate_factor=0),
+]
+
+wflw_coco133 = [(i * 2, 23 + i)
+                for i in range(17)] + [(33 + i, 40 + i) for i in range(5)] + [
+                    (42 + i, 45 + i) for i in range(5)
+                ] + [(51 + i, 50 + i)
+                     for i in range(9)] + [(60, 59), (61, 60), (63, 61),
+                                           (64, 62), (65, 63), (67, 64),
+                                           (68, 65), (69, 66), (71, 67),
+                                           (72, 68), (73, 69),
+                                           (75, 70)] + [(76 + i, 71 + i)
+                                                        for i in range(20)]
+dataset_wflw = dict(
+    type=WFLWDataset,
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='wflw/annotations/face_landmarks_wflw_train.json',
+    data_prefix=dict(img='pose/WFLW/images/'),
+    pipeline=[
+        dict(
+            type=KeypointConverter,
+            num_keypoints=num_keypoints,
+            mapping=wflw_coco133), *face_pipeline
+    ],
+)
+
+mapping_300w_coco133 = [(i, 23 + i) for i in range(68)]
+dataset_300w = dict(
+    type=Face300WDataset,
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='300w/annotations/face_landmarks_300w_train.json',
+    data_prefix=dict(img='pose/300w/images/'),
+    pipeline=[
+        dict(
+            type=KeypointConverter,
+            num_keypoints=num_keypoints,
+            mapping=mapping_300w_coco133), *face_pipeline
+    ],
+)
+
+cofw_coco133 = [(0, 40), (2, 44), (4, 42), (1, 49), (3, 45), (6, 47), (8, 59),
+                (10, 62), (9, 68), (11, 65), (18, 54), (19, 58), (20, 53),
+                (21, 56), (22, 71), (23, 77), (24, 74), (25, 85), (26, 89),
+                (27, 80), (28, 31)]
+dataset_cofw = dict(
+    type=COFWDataset,
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='cofw/annotations/cofw_train.json',
+    data_prefix=dict(img='pose/COFW/images/'),
+    pipeline=[
+        dict(
+            type=KeypointConverter,
+            num_keypoints=num_keypoints,
+            mapping=cofw_coco133), *face_pipeline
+    ],
+)
+
+lapa_coco133 = [(i * 2, 23 + i) for i in range(17)] + [
+    (33 + i, 40 + i) for i in range(5)
+] + [(42 + i, 45 + i) for i in range(5)] + [
+    (51 + i, 50 + i) for i in range(4)
+] + [(58 + i, 54 + i) for i in range(5)] + [(66, 59), (67, 60), (69, 61),
+                                            (70, 62), (71, 63), (73, 64),
+                                            (75, 65), (76, 66), (78, 67),
+                                            (79, 68), (80, 69),
+                                            (82, 70)] + [(84 + i, 71 + i)
+                                                         for i in range(20)]
+dataset_lapa = dict(
+    type=LapaDataset,
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='LaPa/annotations/lapa_trainval.json',
+    data_prefix=dict(img='pose/LaPa/'),
+    pipeline=[
+        dict(
+            type=KeypointConverter,
+            num_keypoints=num_keypoints,
+            mapping=lapa_coco133), *face_pipeline
+    ],
+)
+
+dataset_wb = dict(
+    type=CombinedDataset,
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[dataset_coco, dataset_halpe, dataset_ubody],
+    pipeline=[],
+    test_mode=False,
+)
+
+dataset_body = dict(
+    type=CombinedDataset,
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[
+        dataset_aic,
+        dataset_crowdpose,
+        dataset_mpii,
+        dataset_jhmdb,
+        dataset_posetrack,
+        dataset_humanart,
+    ],
+    pipeline=[],
+    test_mode=False,
+)
+
+dataset_face = dict(
+    type=CombinedDataset,
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[
+        dataset_wflw,
+        dataset_300w,
+        dataset_cofw,
+        dataset_lapa,
+    ],
+    pipeline=[],
+    test_mode=False,
+)
+
+hand_pipeline = [
+    dict(type=LoadImage, backend_args=backend_args),
+    dict(type=GetBBoxCenterScale),
+    dict(
+        type=RandomBBoxTransform,
+        shift_factor=0.,
+        scale_factor=[1.5, 2.0],
+        rotate_factor=0),
+]
+
+interhand_left = [(21, 95), (22, 94), (23, 93), (24, 92), (25, 99), (26, 98),
+                  (27, 97), (28, 96), (29, 103), (30, 102), (31, 101),
+                  (32, 100), (33, 107), (34, 106), (35, 105), (36, 104),
+                  (37, 111), (38, 110), (39, 109), (40, 108), (41, 91)]
+interhand_right = [(i - 21, j + 21) for i, j in interhand_left]
+interhand_coco133 = interhand_right + interhand_left
+
+dataset_interhand2d = dict(
+    type=InterHand2DDoubleDataset,
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='interhand26m/annotations/all/InterHand2.6M_train_data.json',
+    camera_param_file='interhand26m/annotations/all/'
+    'InterHand2.6M_train_camera.json',
+    joint_file='interhand26m/annotations/all/'
+    'InterHand2.6M_train_joint_3d.json',
+    data_prefix=dict(img='interhand2.6m/images/train/'),
+    sample_interval=10,
+    pipeline=[
+        dict(
+            type=KeypointConverter,
+            num_keypoints=num_keypoints,
+            mapping=interhand_coco133,
+        ), *hand_pipeline
+    ],
+)
+
+dataset_hand = dict(
+    type=CombinedDataset,
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[dataset_interhand2d],
+    pipeline=[],
+    test_mode=False,
+)
+
+train_datasets = [dataset_wb, dataset_body, dataset_face, dataset_hand]
+
+# data loaders
+train_dataloader = dict(
+    batch_size=train_batch_size,
+    num_workers=4,
+    pin_memory=False,
+    persistent_workers=True,
+    sampler=dict(type=DefaultSampler, shuffle=True),
+    dataset=dict(
+        type=CombinedDataset,
+        metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+        datasets=train_datasets,
+        pipeline=train_pipeline,
+        test_mode=False,
+    ))
+
+val_dataloader = dict(
+    batch_size=val_batch_size,
+    num_workers=4,
+    persistent_workers=True,
+    drop_last=False,
+    sampler=dict(type=DefaultSampler, shuffle=False, round_up=False),
+    dataset=dict(
+        type=CocoWholeBodyDataset,
+        ann_file='data/coco/annotations/coco_wholebody_val_v1.0.json',
+        data_prefix=dict(img='data/detection/coco/val2017/'),
+        pipeline=val_pipeline,
+        bbox_file='data/coco/person_detection_results/'
+        'COCO_val2017_detections_AP_H_56_person.json',
+        test_mode=True))
+
+test_dataloader = val_dataloader
+
+# hooks
+default_hooks.update(  # noqa
+    checkpoint=dict(
+        save_best='coco-wholebody/AP', rule='greater', max_keep_ckpts=1))
+
+custom_hooks = [
+    dict(
+        type=EMAHook,
+        ema_type=ExpMomentumEMA,
+        momentum=0.0002,
+        update_buffers=True,
+        priority=49),
+    dict(
+        type=PipelineSwitchHook,
+        switch_epoch=max_epochs - stage2_num_epochs,
+        switch_pipeline=train_pipeline_stage2)
+]
+
+# evaluators
+val_evaluator = dict(
+    type=CocoWholeBodyMetric,
+    ann_file='data/coco/annotations/coco_wholebody_val_v1.0.json')
+test_evaluator = val_evaluator
diff --git a/projects/rtmpose/rtmpose/wholebody_2d_keypoint/rtmw-x_8xb320-270e_cocktail13-384x288.py b/mmpose/configs/wholebody_2d_keypoint/rtmpose/cocktail13/rtmw-l_8xb320-270e_cocktail14-384x288.py
similarity index 96%
rename from projects/rtmpose/rtmpose/wholebody_2d_keypoint/rtmw-x_8xb320-270e_cocktail13-384x288.py
rename to mmpose/configs/wholebody_2d_keypoint/rtmpose/cocktail13/rtmw-l_8xb320-270e_cocktail14-384x288.py
index 55d07b61a8..58172f7cc5 100644
--- a/projects/rtmpose/rtmpose/wholebody_2d_keypoint/rtmw-x_8xb320-270e_cocktail13-384x288.py
+++ b/mmpose/configs/wholebody_2d_keypoint/rtmpose/cocktail13/rtmw-l_8xb320-270e_cocktail14-384x288.py
@@ -48,7 +48,7 @@
 # optimizer
 optim_wrapper = dict(
     type=OptimWrapper,
-    optimizer=dict(type=AdamW, lr=base_lr, weight_decay=0.05),
+    optimizer=dict(type=AdamW, lr=base_lr, weight_decay=0.1),
     clip_grad=dict(max_norm=35, norm_type=2),
     paramwise_cfg=dict(
         norm_decay_mult=0, bias_decay_mult=0, bypass_duplicate=True))
@@ -68,7 +68,7 @@
 ]
 
 # automatically scaling LR based on the actual training batch size
-auto_scale_lr = dict(base_batch_size=5632)
+auto_scale_lr = dict(base_batch_size=2560)
 
 # codec settings
 codec = dict(
@@ -91,20 +91,20 @@
         type=CSPNeXt,
         arch='P5',
         expand_ratio=0.5,
-        deepen_factor=1.33,
-        widen_factor=1.25,
+        deepen_factor=1.,
+        widen_factor=1.,
         channel_attention=True,
         norm_cfg=dict(type='BN'),
         act_cfg=dict(type=SiLU),
         init_cfg=dict(
-            type=PretrainedInit,
+            type='Pretrained',
             prefix='backbone.',
-            checkpoint='https://download.openmmlab.com/mmpose/v1/'
-            'wholebody_2d_keypoint/rtmpose/ubody/rtmpose-x_simcc-ucoco_pt-aic-coco_270e-384x288-f5b50679_20230822.pth'  # noqa
+            checkpoint='https://download.openmmlab.com/mmpose/v1/projects/'
+            'rtmposev1/rtmpose-l_simcc-ucoco_dw-ucoco_270e-256x192-4d6dfc62_20230728.pth'  # noqa
         )),
     neck=dict(
         type=CSPNeXtPAFPN,
-        in_channels=[320, 640, 1280],
+        in_channels=[256, 512, 1024],
         out_channels=None,
         out_indices=(
             1,
@@ -116,7 +116,7 @@
         act_cfg=dict(type=SiLU, inplace=True)),
     head=dict(
         type=RTMWHead,
-        in_channels=1280,
+        in_channels=1024,
         out_channels=num_keypoints,
         input_size=input_size,
         in_featuremap_size=tuple([s // 32 for s in input_size]),
@@ -128,14 +128,18 @@
             expansion_factor=2,
             dropout_rate=0.,
             drop_path=0.,
-            act_fn=SiLU,
+            act_fn='SiLU',
             use_rel_bias=False,
             pos_enc=False),
         loss=dict(
             type=KLDiscretLoss,
             use_target_weight=True,
-            beta=10.,
-            label_softmax=True),
+            beta=1.,
+            label_softmax=True,
+            label_beta=10.,
+            mask=list(range(23, 91)),
+            mask_weight=0.5,
+        ),
         decoder=codec),
     test_cfg=dict(flip_test=True))
 
@@ -168,7 +172,7 @@
                 min_holes=1,
                 min_height=0.2,
                 min_width=0.2,
-                p=1.0),
+                p=0.5),
         ]),
     dict(
         type=GenerateTarget, encoder=codec, use_dataset_keypoint_weights=True),
@@ -217,8 +221,6 @@
     (3, 11),
     (4, 13),
     (5, 15),
-    (8, 18),
-    (9, 17),
     (10, 10),
     (11, 8),
     (12, 6),
@@ -228,8 +230,6 @@
 ]
 
 jhmdb_coco133 = [
-    (0, 18),
-    (2, 17),
     (3, 6),
     (4, 5),
     (5, 12),
@@ -252,7 +252,6 @@
 
 posetrack_coco133 = [
     (0, 0),
-    (2, 17),
     (3, 3),
     (4, 4),
     (5, 5),
diff --git a/projects/rtmpose/rtmpose/wholebody_2d_keypoint/rtmw-x_8xb704-270e_cocktail13-256x192.py b/mmpose/configs/wholebody_2d_keypoint/rtmpose/cocktail13/rtmw-m_8xb1024-270e_cocktail14-256x192.py
similarity index 97%
rename from projects/rtmpose/rtmpose/wholebody_2d_keypoint/rtmw-x_8xb704-270e_cocktail13-256x192.py
rename to mmpose/configs/wholebody_2d_keypoint/rtmpose/cocktail13/rtmw-m_8xb1024-270e_cocktail14-256x192.py
index 48275c3c11..dd46cce5ff 100644
--- a/projects/rtmpose/rtmpose/wholebody_2d_keypoint/rtmw-x_8xb704-270e_cocktail13-256x192.py
+++ b/mmpose/configs/wholebody_2d_keypoint/rtmpose/cocktail13/rtmw-m_8xb1024-270e_cocktail14-256x192.py
@@ -39,7 +39,7 @@
 max_epochs = 270
 stage2_num_epochs = 10
 base_lr = 5e-4
-train_batch_size = 704
+train_batch_size = 1024
 val_batch_size = 32
 
 train_cfg.update(max_epochs=max_epochs, val_interval=10)  # noqa
@@ -68,7 +68,7 @@
 ]
 
 # automatically scaling LR based on the actual training batch size
-auto_scale_lr = dict(base_batch_size=5632)
+auto_scale_lr = dict(base_batch_size=8192)
 
 # codec settings
 codec = dict(
@@ -91,20 +91,20 @@
         type=CSPNeXt,
         arch='P5',
         expand_ratio=0.5,
-        deepen_factor=1.33,
-        widen_factor=1.25,
+        deepen_factor=0.67,
+        widen_factor=0.75,
         channel_attention=True,
         norm_cfg=dict(type='BN'),
         act_cfg=dict(type=SiLU),
         init_cfg=dict(
             type=PretrainedInit,
             prefix='backbone.',
-            checkpoint='https://download.openmmlab.com/mmpose/v1/'
-            'wholebody_2d_keypoint/rtmpose/ubody/rtmpose-x_simcc-ucoco_pt-aic-coco_270e-256x192-05f5bcb7_20230822.pth'  # noqa
+            checkpoint='https://download.openmmlab.com/mmpose/v1/projects/'
+            'rtmposev1/rtmpose-m_simcc-ucoco_dw-ucoco_270e-256x192-c8b76419_20230728.pth'  # noqa
         )),
     neck=dict(
         type=CSPNeXtPAFPN,
-        in_channels=[320, 640, 1280],
+        in_channels=[192, 384, 768],
         out_channels=None,
         out_indices=(
             1,
@@ -116,7 +116,7 @@
         act_cfg=dict(type=SiLU, inplace=True)),
     head=dict(
         type=RTMWHead,
-        in_channels=1280,
+        in_channels=768,
         out_channels=num_keypoints,
         input_size=input_size,
         in_featuremap_size=tuple([s // 32 for s in input_size]),
@@ -128,14 +128,18 @@
             expansion_factor=2,
             dropout_rate=0.,
             drop_path=0.,
-            act_fn=SiLU,
+            act_fn='SiLU',
             use_rel_bias=False,
             pos_enc=False),
         loss=dict(
             type=KLDiscretLoss,
             use_target_weight=True,
-            beta=10.,
-            label_softmax=True),
+            beta=1.,
+            label_softmax=True,
+            label_beta=10.,
+            mask=list(range(23, 91)),
+            mask_weight=0.5,
+        ),
         decoder=codec),
     test_cfg=dict(flip_test=True))
 
@@ -168,7 +172,7 @@
                 min_holes=1,
                 min_height=0.2,
                 min_width=0.2,
-                p=1.0),
+                p=0.5),
         ]),
     dict(
         type=GenerateTarget, encoder=codec, use_dataset_keypoint_weights=True),
@@ -180,7 +184,6 @@
     dict(type=TopdownAffine, input_size=codec['input_size']),
     dict(type=PackPoseInputs)
 ]
-
 train_pipeline_stage2 = [
     dict(type=LoadImage, backend_args=backend_args),
     dict(type=GetBBoxCenterScale),
@@ -218,8 +221,6 @@
     (3, 11),
     (4, 13),
     (5, 15),
-    (8, 18),
-    (9, 17),
     (10, 10),
     (11, 8),
     (12, 6),
@@ -229,8 +230,6 @@
 ]
 
 jhmdb_coco133 = [
-    (0, 18),
-    (2, 17),
     (3, 6),
     (4, 5),
     (5, 12),
@@ -253,7 +252,6 @@
 
 posetrack_coco133 = [
     (0, 0),
-    (2, 17),
     (3, 3),
     (4, 4),
     (5, 5),
diff --git a/configs/wholebody_2d_keypoint/rtmpose/cocktail13/rtmw-x_8xb320-270e_cocktail13-384x288.py b/mmpose/configs/wholebody_2d_keypoint/rtmpose/cocktail13/rtmw-x_8xb320-270e_cocktail14-384x288.py
similarity index 98%
rename from configs/wholebody_2d_keypoint/rtmpose/cocktail13/rtmw-x_8xb320-270e_cocktail13-384x288.py
rename to mmpose/configs/wholebody_2d_keypoint/rtmpose/cocktail13/rtmw-x_8xb320-270e_cocktail14-384x288.py
index 55d07b61a8..73fe642366 100644
--- a/configs/wholebody_2d_keypoint/rtmpose/cocktail13/rtmw-x_8xb320-270e_cocktail13-384x288.py
+++ b/mmpose/configs/wholebody_2d_keypoint/rtmpose/cocktail13/rtmw-x_8xb320-270e_cocktail14-384x288.py
@@ -48,7 +48,7 @@
 # optimizer
 optim_wrapper = dict(
     type=OptimWrapper,
-    optimizer=dict(type=AdamW, lr=base_lr, weight_decay=0.05),
+    optimizer=dict(type=AdamW, lr=base_lr, weight_decay=0.1),
     clip_grad=dict(max_norm=35, norm_type=2),
     paramwise_cfg=dict(
         norm_decay_mult=0, bias_decay_mult=0, bypass_duplicate=True))
@@ -68,7 +68,7 @@
 ]
 
 # automatically scaling LR based on the actual training batch size
-auto_scale_lr = dict(base_batch_size=5632)
+auto_scale_lr = dict(base_batch_size=2560)
 
 # codec settings
 codec = dict(
@@ -128,14 +128,18 @@
             expansion_factor=2,
             dropout_rate=0.,
             drop_path=0.,
-            act_fn=SiLU,
+            act_fn='SiLU',
             use_rel_bias=False,
             pos_enc=False),
         loss=dict(
             type=KLDiscretLoss,
             use_target_weight=True,
-            beta=10.,
-            label_softmax=True),
+            beta=1.,
+            label_softmax=True,
+            label_beta=10.,
+            mask=list(range(23, 91)),
+            mask_weight=0.5,
+        ),
         decoder=codec),
     test_cfg=dict(flip_test=True))
 
@@ -168,7 +172,7 @@
                 min_holes=1,
                 min_height=0.2,
                 min_width=0.2,
-                p=1.0),
+                p=0.5),
         ]),
     dict(
         type=GenerateTarget, encoder=codec, use_dataset_keypoint_weights=True),
@@ -217,8 +221,6 @@
     (3, 11),
     (4, 13),
     (5, 15),
-    (8, 18),
-    (9, 17),
     (10, 10),
     (11, 8),
     (12, 6),
@@ -228,8 +230,6 @@
 ]
 
 jhmdb_coco133 = [
-    (0, 18),
-    (2, 17),
     (3, 6),
     (4, 5),
     (5, 12),
@@ -252,7 +252,6 @@
 
 posetrack_coco133 = [
     (0, 0),
-    (2, 17),
     (3, 3),
     (4, 4),
     (5, 5),
diff --git a/configs/wholebody_2d_keypoint/rtmpose/cocktail13/rtmw-x_8xb704-270e_cocktail13-256x192.py b/mmpose/configs/wholebody_2d_keypoint/rtmpose/cocktail13/rtmw-x_8xb704-270e_cocktail14-256x192.py
similarity index 98%
rename from configs/wholebody_2d_keypoint/rtmpose/cocktail13/rtmw-x_8xb704-270e_cocktail13-256x192.py
rename to mmpose/configs/wholebody_2d_keypoint/rtmpose/cocktail13/rtmw-x_8xb704-270e_cocktail14-256x192.py
index 48275c3c11..b4447fa1d7 100644
--- a/configs/wholebody_2d_keypoint/rtmpose/cocktail13/rtmw-x_8xb704-270e_cocktail13-256x192.py
+++ b/mmpose/configs/wholebody_2d_keypoint/rtmpose/cocktail13/rtmw-x_8xb704-270e_cocktail14-256x192.py
@@ -48,7 +48,7 @@
 # optimizer
 optim_wrapper = dict(
     type=OptimWrapper,
-    optimizer=dict(type=AdamW, lr=base_lr, weight_decay=0.05),
+    optimizer=dict(type=AdamW, lr=base_lr, weight_decay=0.1),
     clip_grad=dict(max_norm=35, norm_type=2),
     paramwise_cfg=dict(
         norm_decay_mult=0, bias_decay_mult=0, bypass_duplicate=True))
@@ -128,14 +128,18 @@
             expansion_factor=2,
             dropout_rate=0.,
             drop_path=0.,
-            act_fn=SiLU,
+            act_fn='SiLU',
             use_rel_bias=False,
             pos_enc=False),
         loss=dict(
             type=KLDiscretLoss,
             use_target_weight=True,
-            beta=10.,
-            label_softmax=True),
+            beta=1.,
+            label_softmax=True,
+            label_beta=10.,
+            mask=list(range(23, 91)),
+            mask_weight=0.5,
+        ),
         decoder=codec),
     test_cfg=dict(flip_test=True))
 
@@ -168,7 +172,7 @@
                 min_holes=1,
                 min_height=0.2,
                 min_width=0.2,
-                p=1.0),
+                p=0.5),
         ]),
     dict(
         type=GenerateTarget, encoder=codec, use_dataset_keypoint_weights=True),
@@ -180,7 +184,6 @@
     dict(type=TopdownAffine, input_size=codec['input_size']),
     dict(type=PackPoseInputs)
 ]
-
 train_pipeline_stage2 = [
     dict(type=LoadImage, backend_args=backend_args),
     dict(type=GetBBoxCenterScale),
@@ -218,8 +221,6 @@
     (3, 11),
     (4, 13),
     (5, 15),
-    (8, 18),
-    (9, 17),
     (10, 10),
     (11, 8),
     (12, 6),
@@ -229,8 +230,6 @@
 ]
 
 jhmdb_coco133 = [
-    (0, 18),
-    (2, 17),
     (3, 6),
     (4, 5),
     (5, 12),
@@ -253,7 +252,6 @@
 
 posetrack_coco133 = [
     (0, 0),
-    (2, 17),
     (3, 3),
     (4, 4),
     (5, 5),
diff --git a/mmpose/datasets/datasets/body/__init__.py b/mmpose/datasets/datasets/body/__init__.py
index 3ae05a3856..c4e3cca37a 100644
--- a/mmpose/datasets/datasets/body/__init__.py
+++ b/mmpose/datasets/datasets/body/__init__.py
@@ -2,6 +2,7 @@
 from .aic_dataset import AicDataset
 from .coco_dataset import CocoDataset
 from .crowdpose_dataset import CrowdPoseDataset
+from .exlpose_dataset import ExlposeDataset
 from .humanart21_dataset import HumanArt21Dataset
 from .humanart_dataset import HumanArtDataset
 from .jhmdb_dataset import JhmdbDataset
@@ -16,5 +17,5 @@
     'CocoDataset', 'MpiiDataset', 'MpiiTrbDataset', 'AicDataset',
     'CrowdPoseDataset', 'OCHumanDataset', 'MhpDataset', 'PoseTrack18Dataset',
     'JhmdbDataset', 'PoseTrack18VideoDataset', 'HumanArtDataset',
-    'HumanArt21Dataset'
+    'HumanArt21Dataset', 'ExlposeDataset'
 ]
diff --git a/mmpose/datasets/datasets/body/exlpose_dataset.py b/mmpose/datasets/datasets/body/exlpose_dataset.py
new file mode 100644
index 0000000000..ad29f5d751
--- /dev/null
+++ b/mmpose/datasets/datasets/body/exlpose_dataset.py
@@ -0,0 +1,69 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmpose.registry import DATASETS
+from ..base import BaseCocoStyleDataset
+
+
+@DATASETS.register_module()
+class ExlposeDataset(BaseCocoStyleDataset):
+    """Exlpose dataset for pose estimation.
+
+    "Human Pose Estimation in Extremely Low-Light Conditions",
+    CVPR'2023.
+    More details can be found in the `paper
+    <http://cg.postech.ac.kr/research/ExLPose/>`__.
+
+    ExLPose keypoints:
+        0: "left_shoulder",
+        1: "right_shoulder",
+        2: "left_elbow",
+        3: "right_elbow",
+        4: "left_wrist",
+        5: "right_wrist",
+        6: "left_hip",
+        7: "right_hip",
+        8: "left_knee",
+        9: "right_knee",
+        10: "left_ankle",
+        11: "right_ankle",
+        12: "head",
+        13: "neck"
+
+    Args:
+        ann_file (str): Annotation file path. Default: ''.
+        bbox_file (str, optional): Detection result file path. If
+            ``bbox_file`` is set, detected bboxes loaded from this file will
+            be used instead of ground-truth bboxes. This setting is only for
+            evaluation, i.e., ignored when ``test_mode`` is ``False``.
+            Default: ``None``.
+        data_mode (str): Specifies the mode of data samples: ``'topdown'`` or
+            ``'bottomup'``. In ``'topdown'`` mode, each data sample contains
+            one instance; while in ``'bottomup'`` mode, each data sample
+            contains all instances in a image. Default: ``'topdown'``
+        metainfo (dict, optional): Meta information for dataset, such as class
+            information. Default: ``None``.
+        data_root (str, optional): The root directory for ``data_prefix`` and
+            ``ann_file``. Default: ``None``.
+        data_prefix (dict, optional): Prefix for training data. Default:
+            ``dict(img=None, ann=None)``.
+        filter_cfg (dict, optional): Config for filter data. Default: `None`.
+        indices (int or Sequence[int], optional): Support using first few
+            data in annotation file to facilitate training/testing on a smaller
+            dataset. Default: ``None`` which means using all ``data_infos``.
+        serialize_data (bool, optional): Whether to hold memory using
+            serialized objects, when enabled, data loader workers can use
+            shared RAM from master process instead of making a copy.
+            Default: ``True``.
+        pipeline (list, optional): Processing pipeline. Default: [].
+        test_mode (bool, optional): ``test_mode=True`` means in test phase.
+            Default: ``False``.
+        lazy_init (bool, optional): Whether to load annotation during
+            instantiation. In some cases, such as visualization, only the meta
+            information of the dataset is needed, which is not necessary to
+            load annotation file. ``Basedataset`` can skip load annotations to
+            save time by set ``lazy_init=False``. Default: ``False``.
+        max_refetch (int, optional): If ``Basedataset.prepare_data`` get a
+            None img. The maximum extra number of cycles to get a valid
+            image. Default: 1000.
+    """
+
+    METAINFO: dict = dict(from_file='configs/_base_/datasets/exlpose.py')
diff --git a/mmpose/datasets/datasets/body/mpii_dataset.py b/mmpose/datasets/datasets/body/mpii_dataset.py
index 5490f6f0dd..28d53bd8b8 100644
--- a/mmpose/datasets/datasets/body/mpii_dataset.py
+++ b/mmpose/datasets/datasets/body/mpii_dataset.py
@@ -181,7 +181,8 @@ def _load_annotations(self) -> Tuple[List[dict], List[dict]]:
             bbox = bbox_cs2xyxy(center, scale)
 
             # load keypoints in shape [1, K, 2] and keypoints_visible in [1, K]
-            keypoints = np.array(ann['joints']).reshape(1, -1, 2)
+            keypoints = np.array(
+                ann['joints'], dtype=np.float32).reshape(1, -1, 2)
             keypoints_visible = np.array(ann['joints_vis']).reshape(1, -1)
 
             x1, y1, x2, y2 = np.split(bbox, axis=1, indices_or_sections=4)
diff --git a/mmpose/datasets/datasets/wholebody/coco_wholebody_dataset.py b/mmpose/datasets/datasets/wholebody/coco_wholebody_dataset.py
index 00a2ea418f..9c8b88c20f 100644
--- a/mmpose/datasets/datasets/wholebody/coco_wholebody_dataset.py
+++ b/mmpose/datasets/datasets/wholebody/coco_wholebody_dataset.py
@@ -105,6 +105,12 @@ def parse_data_info(self, raw_data_info: dict) -> Optional[dict]:
         keypoints = _keypoints[..., :2]
         keypoints_visible = np.minimum(1, _keypoints[..., 2] > 0)
 
+        if 'area' in ann:
+            area = np.array(ann['area'], dtype=np.float32)
+        else:
+            area = np.clip((x2 - x1) * (y2 - y1) * 0.53, a_min=1.0, a_max=None)
+            area = np.array(area, dtype=np.float32)
+
         num_keypoints = ann['num_keypoints']
 
         data_info = {
@@ -117,6 +123,7 @@ def parse_data_info(self, raw_data_info: dict) -> Optional[dict]:
             'keypoints_visible': keypoints_visible,
             'iscrowd': ann['iscrowd'],
             'segmentation': ann['segmentation'],
+            'area': area,
             'id': ann['id'],
             'category_id': ann['category_id'],
             # store the raw annotation of the instance
diff --git a/mmpose/datasets/datasets/wholebody3d/__init__.py b/mmpose/datasets/datasets/wholebody3d/__init__.py
index db0e25b155..19e1fe2f6c 100644
--- a/mmpose/datasets/datasets/wholebody3d/__init__.py
+++ b/mmpose/datasets/datasets/wholebody3d/__init__.py
@@ -1,4 +1,5 @@
 # Copyright (c) OpenMMLab. All rights reserved.
+from .h3wb_dataset import H36MWholeBodyDataset
 from .ubody3d_dataset import UBody3dDataset
 
-__all__ = ['UBody3dDataset']
+__all__ = ['UBody3dDataset', 'H36MWholeBodyDataset']
diff --git a/mmpose/datasets/datasets/wholebody3d/h3wb_dataset.py b/mmpose/datasets/datasets/wholebody3d/h3wb_dataset.py
new file mode 100644
index 0000000000..95e40db4b4
--- /dev/null
+++ b/mmpose/datasets/datasets/wholebody3d/h3wb_dataset.py
@@ -0,0 +1,213 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Tuple
+
+import numpy as np
+from mmengine.fileio import get_local_path
+
+from mmpose.registry import DATASETS
+from ..body3d import Human36mDataset
+
+
+@DATASETS.register_module()
+class H36MWholeBodyDataset(Human36mDataset):
+    METAINFO: dict = dict(from_file='configs/_base_/datasets/h3wb.py')
+    """Human3.6M 3D WholeBody Dataset.
+
+    "H3WB: Human3.6M 3D WholeBody Dataset and Benchmark", ICCV'2023.
+    More details can be found in the `paper
+    <https://arxiv.org/abs/2211.15692>`__.
+
+    H36M-WholeBody keypoints::
+
+        0-16: 17 body keypoints,
+        17-22: 6 foot keypoints,
+        23-90: 68 face keypoints,
+        91-132: 42 hand keypoints
+
+        In total, we have 133 keypoints for wholebody pose estimation.
+
+    Args:
+        ann_file (str): Annotation file path. Default: ''.
+        seq_len (int): Number of frames in a sequence. Default: 1.
+        seq_step (int): The interval for extracting frames from the video.
+            Default: 1.
+        multiple_target (int): If larger than 0, merge every
+            ``multiple_target`` sequence together. Default: 0.
+        multiple_target_step (int): The interval for merging sequence. Only
+            valid when ``multiple_target`` is larger than 0. Default: 0.
+        pad_video_seq (bool): Whether to pad the video so that poses will be
+            predicted for every frame in the video. Default: ``False``.
+        causal (bool): If set to ``True``, the rightmost input frame will be
+            the target frame. Otherwise, the middle input frame will be the
+            target frame. Default: ``True``.
+        subset_frac (float): The fraction to reduce dataset size. If set to 1,
+            the dataset size is not reduced. Default: 1.
+        keypoint_2d_src (str): Specifies 2D keypoint information options, which
+            should be one of the following options:
+
+            - ``'gt'``: load from the annotation file
+            - ``'detection'``: load from a detection
+              result file of 2D keypoint
+            - 'pipeline': the information will be generated by the pipeline
+
+            Default: ``'gt'``.
+        keypoint_2d_det_file (str, optional): The 2D keypoint detection file.
+            If set, 2d keypoint loaded from this file will be used instead of
+            ground-truth keypoints. This setting is only when
+            ``keypoint_2d_src`` is ``'detection'``. Default: ``None``.
+        factor_file (str, optional): The projection factors' file. If set,
+            factor loaded from this file will be used instead of calculated
+            factors. Default: ``None``.
+        camera_param_file (str): Cameras' parameters file. Default: ``None``.
+        data_mode (str): Specifies the mode of data samples: ``'topdown'`` or
+            ``'bottomup'``. In ``'topdown'`` mode, each data sample contains
+            one instance; while in ``'bottomup'`` mode, each data sample
+            contains all instances in a image. Default: ``'topdown'``
+        metainfo (dict, optional): Meta information for dataset, such as class
+            information. Default: ``None``.
+        data_root (str, optional): The root directory for ``data_prefix`` and
+            ``ann_file``. Default: ``None``.
+        data_prefix (dict, optional): Prefix for training data.
+            Default: ``dict(img='')``.
+        filter_cfg (dict, optional): Config for filter data. Default: `None`.
+        indices (int or Sequence[int], optional): Support using first few
+            data in annotation file to facilitate training/testing on a smaller
+            dataset. Default: ``None`` which means using all ``data_infos``.
+        serialize_data (bool, optional): Whether to hold memory using
+            serialized objects, when enabled, data loader workers can use
+            shared RAM from master process instead of making a copy.
+            Default: ``True``.
+        pipeline (list, optional): Processing pipeline. Default: [].
+        test_mode (bool, optional): ``test_mode=True`` means in test phase.
+            Default: ``False``.
+        lazy_init (bool, optional): Whether to load annotation during
+            instantiation. In some cases, such as visualization, only the meta
+            information of the dataset is needed, which is not necessary to
+            load annotation file. ``Basedataset`` can skip load annotations to
+            save time by set ``lazy_init=False``. Default: ``False``.
+        max_refetch (int, optional): If ``Basedataset.prepare_data`` get a
+            None img. The maximum extra number of cycles to get a valid
+            image. Default: 1000.
+    """
+
+    def __init__(self, test_mode: bool = False, **kwargs):
+
+        self.camera_order_id = ['54138969', '55011271', '58860488', '60457274']
+        if not test_mode:
+            self.subjects = ['S1', 'S5', 'S6']
+        else:
+            self.subjects = ['S7']
+
+        super().__init__(test_mode=test_mode, **kwargs)
+
+    def _load_ann_file(self, ann_file: str) -> dict:
+        with get_local_path(ann_file) as local_path:
+            data = np.load(local_path, allow_pickle=True)
+
+        self.ann_data = data['train_data'].item()
+        self.camera_data = data['metadata'].item()
+
+    def get_sequence_indices(self) -> List[List[int]]:
+        return []
+
+    def _load_annotations(self) -> Tuple[List[dict], List[dict]]:
+
+        instance_list = []
+        image_list = []
+
+        instance_id = 0
+        for subject in self.subjects:
+            actions = self.ann_data[subject].keys()
+            for act in actions:
+                for cam in self.camera_order_id:
+                    if cam not in self.ann_data[subject][act]:
+                        continue
+                    keypoints_2d = self.ann_data[subject][act][cam]['pose_2d']
+                    keypoints_3d = self.ann_data[subject][act][cam][
+                        'camera_3d']
+                    num_keypoints = keypoints_2d.shape[1]
+
+                    camera_param = self.camera_data[subject][cam]
+                    camera_param = {
+                        'K': camera_param['K'][0, :2, ...],
+                        'R': camera_param['R'][0],
+                        'T': camera_param['T'].reshape(3, 1),
+                        'Distortion': camera_param['Distortion'][0]
+                    }
+
+                    seq_step = 1
+                    _len = (self.seq_len - 1) * seq_step + 1
+                    _indices = list(
+                        range(len(self.ann_data[subject][act]['frame_id'])))
+                    seq_indices = [
+                        _indices[i:(i + _len):seq_step]
+                        for i in list(range(0,
+                                            len(_indices) - _len + 1))
+                    ]
+
+                    for idx, frame_ids in enumerate(seq_indices):
+                        expected_num_frames = self.seq_len
+                        if self.multiple_target:
+                            expected_num_frames = self.multiple_target
+
+                        assert len(frame_ids) == (expected_num_frames), (
+                            f'Expected `frame_ids` == {expected_num_frames}, but '  # noqa
+                            f'got {len(frame_ids)} ')
+
+                        _kpts_2d = keypoints_2d[frame_ids]
+                        _kpts_3d = keypoints_3d[frame_ids]
+
+                        target_idx = [-1] if self.causal else [
+                            int(self.seq_len) // 2
+                        ]
+                        if self.multiple_target > 0:
+                            target_idx = list(range(self.multiple_target))
+
+                        instance_info = {
+                            'num_keypoints':
+                            num_keypoints,
+                            'keypoints':
+                            _kpts_2d,
+                            'keypoints_3d':
+                            _kpts_3d / 1000,
+                            'keypoints_visible':
+                            np.ones_like(_kpts_2d[..., 0], dtype=np.float32),
+                            'keypoints_3d_visible':
+                            np.ones_like(_kpts_2d[..., 0], dtype=np.float32),
+                            'scale':
+                            np.zeros((1, 1), dtype=np.float32),
+                            'center':
+                            np.zeros((1, 2), dtype=np.float32),
+                            'factor':
+                            np.zeros((1, 1), dtype=np.float32),
+                            'id':
+                            instance_id,
+                            'category_id':
+                            1,
+                            'iscrowd':
+                            0,
+                            'camera_param':
+                            camera_param,
+                            'img_paths': [
+                                f'{subject}/{act}/{cam}/{i:06d}.jpg'
+                                for i in frame_ids
+                            ],
+                            'img_ids':
+                            frame_ids,
+                            'lifting_target':
+                            _kpts_3d[target_idx] / 1000,
+                            'lifting_target_visible':
+                            np.ones_like(_kpts_2d[..., 0],
+                                         dtype=np.float32)[target_idx],
+                        }
+                        instance_list.append(instance_info)
+
+                        if self.data_mode == 'bottomup':
+                            for idx, img_name in enumerate(
+                                    instance_info['img_paths']):
+                                img_info = self.get_img_info(idx, img_name)
+                                image_list.append(img_info)
+
+                        instance_id += 1
+
+        return instance_list, image_list
diff --git a/mmpose/datasets/transforms/bottomup_transforms.py b/mmpose/datasets/transforms/bottomup_transforms.py
index 0175e013dc..c27afd042a 100644
--- a/mmpose/datasets/transforms/bottomup_transforms.py
+++ b/mmpose/datasets/transforms/bottomup_transforms.py
@@ -59,14 +59,15 @@ def _segs_to_mask(self, segs: list, img_shape: Tuple[int,
         # mask.py>`__
         rles = []
         for seg in segs:
-            rle = cocomask.frPyObjects(seg, img_shape[0], img_shape[1])
-            if isinstance(rle, list):
-                # For non-crowded objects (e.g. human with no visible
-                # keypoints), the results is a list of rles
-                rles.extend(rle)
-            else:
-                # For crowded objects, the result is a single rle
-                rles.append(rle)
+            if isinstance(seg, (tuple, list)):
+                rle = cocomask.frPyObjects(seg, img_shape[0], img_shape[1])
+                if isinstance(rle, list):
+                    # For non-crowded objects (e.g. human with no visible
+                    # keypoints), the results is a list of rles
+                    rles.extend(rle)
+                else:
+                    # For crowded objects, the result is a single rle
+                    rles.append(rle)
 
         if rles:
             mask = cocomask.decode(cocomask.merge(rles))
diff --git a/mmpose/datasets/transforms/converting.py b/mmpose/datasets/transforms/converting.py
index bca000435a..b7e214733f 100644
--- a/mmpose/datasets/transforms/converting.py
+++ b/mmpose/datasets/transforms/converting.py
@@ -92,6 +92,11 @@ def __init__(self, num_keypoints: int,
     def transform(self, results: dict) -> dict:
         """Transforms the keypoint results to match the target keypoints."""
         num_instances = results['keypoints'].shape[0]
+
+        if 'keypoints_visible' not in results:
+            results['keypoints_visible'] = np.ones(
+                (num_instances, results['keypoints'].shape[1]))
+
         if len(results['keypoints_visible'].shape) > 2:
             results['keypoints_visible'] = results['keypoints_visible'][:, :,
                                                                         0]
diff --git a/mmpose/datasets/transforms/formatting.py b/mmpose/datasets/transforms/formatting.py
index d3f3ec04aa..fc9307b298 100644
--- a/mmpose/datasets/transforms/formatting.py
+++ b/mmpose/datasets/transforms/formatting.py
@@ -251,5 +251,6 @@ def __repr__(self) -> str:
             str: Formatted string.
         """
         repr_str = self.__class__.__name__
-        repr_str += f'(meta_keys={self.meta_keys})'
+        repr_str += f'(meta_keys={self.meta_keys}, '
+        repr_str += f'pack_transformed={self.pack_transformed})'
         return repr_str
diff --git a/mmpose/datasets/transforms/pose3d_transforms.py b/mmpose/datasets/transforms/pose3d_transforms.py
index 5831692000..9dec8db64b 100644
--- a/mmpose/datasets/transforms/pose3d_transforms.py
+++ b/mmpose/datasets/transforms/pose3d_transforms.py
@@ -44,11 +44,11 @@ class RandomFlipAroundRoot(BaseTransform):
     """
 
     def __init__(self,
-                 keypoints_flip_cfg,
-                 target_flip_cfg,
-                 flip_prob=0.5,
-                 flip_camera=False,
-                 flip_label=False):
+                 keypoints_flip_cfg: dict,
+                 target_flip_cfg: dict,
+                 flip_prob: float = 0.5,
+                 flip_camera: bool = False,
+                 flip_label: bool = False):
         self.keypoints_flip_cfg = keypoints_flip_cfg
         self.target_flip_cfg = target_flip_cfg
         self.flip_prob = flip_prob
@@ -104,11 +104,20 @@ def transform(self, results: Dict) -> dict:
             _camera_param = deepcopy(results['camera_param'])
 
             keypoints, keypoints_visible = flip_keypoints_custom_center(
-                keypoints, keypoints_visible, flip_indices,
-                **self.keypoints_flip_cfg)
+                keypoints,
+                keypoints_visible,
+                flip_indices,
+                center_mode=self.keypoints_flip_cfg.get(
+                    'center_mode', 'static'),
+                center_x=self.keypoints_flip_cfg.get('center_x', 0.5),
+                center_index=self.keypoints_flip_cfg.get('center_index', 0))
             lifting_target, lifting_target_visible = flip_keypoints_custom_center(  # noqa
-                lifting_target, lifting_target_visible, flip_indices,
-                **self.target_flip_cfg)
+                lifting_target,
+                lifting_target_visible,
+                flip_indices,
+                center_mode=self.target_flip_cfg.get('center_mode', 'static'),
+                center_x=self.target_flip_cfg.get('center_x', 0.5),
+                center_index=self.target_flip_cfg.get('center_index', 0))
 
             results[keypoints_key] = keypoints
             results[keypoints_visible_key] = keypoints_visible
diff --git a/mmpose/datasets/transforms/topdown_transforms.py b/mmpose/datasets/transforms/topdown_transforms.py
index 18e85d9664..3480c5b38c 100644
--- a/mmpose/datasets/transforms/topdown_transforms.py
+++ b/mmpose/datasets/transforms/topdown_transforms.py
@@ -118,7 +118,10 @@ def transform(self, results: Dict) -> Optional[dict]:
                 results['img'], warp_mat, warp_size, flags=cv2.INTER_LINEAR)
 
         if results.get('keypoints', None) is not None:
-            transformed_keypoints = results['keypoints'].copy()
+            if results.get('transformed_keypoints', None) is not None:
+                transformed_keypoints = results['transformed_keypoints'].copy()
+            else:
+                transformed_keypoints = results['keypoints'].copy()
             # Only transform (x, y) coordinates
             transformed_keypoints[..., :2] = cv2.transform(
                 results['keypoints'][..., :2], warp_mat)
diff --git a/mmpose/engine/hooks/__init__.py b/mmpose/engine/hooks/__init__.py
index 2c31ca081c..2527a258bc 100644
--- a/mmpose/engine/hooks/__init__.py
+++ b/mmpose/engine/hooks/__init__.py
@@ -1,11 +1,11 @@
 # Copyright (c) OpenMMLab. All rights reserved.
 from .badcase_hook import BadCaseAnalysisHook
 from .ema_hook import ExpMomentumEMA
-from .mode_switch_hooks import YOLOXPoseModeSwitchHook
+from .mode_switch_hooks import RTMOModeSwitchHook, YOLOXPoseModeSwitchHook
 from .sync_norm_hook import SyncNormHook
 from .visualization_hook import PoseVisualizationHook
 
 __all__ = [
     'PoseVisualizationHook', 'ExpMomentumEMA', 'BadCaseAnalysisHook',
-    'YOLOXPoseModeSwitchHook', 'SyncNormHook'
+    'YOLOXPoseModeSwitchHook', 'SyncNormHook', 'RTMOModeSwitchHook'
 ]
diff --git a/mmpose/engine/hooks/mode_switch_hooks.py b/mmpose/engine/hooks/mode_switch_hooks.py
index 862e36dc0b..8990ecab67 100644
--- a/mmpose/engine/hooks/mode_switch_hooks.py
+++ b/mmpose/engine/hooks/mode_switch_hooks.py
@@ -1,12 +1,13 @@
 # Copyright (c) OpenMMLab. All rights reserved.
 import copy
-from typing import Sequence
+from typing import Dict, Sequence
 
 from mmengine.hooks import Hook
 from mmengine.model import is_model_wrapper
 from mmengine.runner import Runner
 
 from mmpose.registry import HOOKS
+from mmpose.utils.hooks import rgetattr, rsetattr
 
 
 @HOOKS.register_module()
@@ -63,3 +64,45 @@ def before_train_epoch(self, runner: Runner):
             self._modify_dataloader(runner)
             runner.logger.info('Added additional reg loss now!')
             model.head.use_aux_loss = True
+
+
+@HOOKS.register_module()
+class RTMOModeSwitchHook(Hook):
+    """A hook to switch the mode of RTMO during training.
+
+    This hook allows for dynamic adjustments of model attributes at specified
+    training epochs. It is designed to modify configurations such as turning
+    off specific augmentations or changing loss functions at different stages
+    of the training process.
+
+    Args:
+        epoch_attributes (Dict[str, Dict]): A dictionary where keys are epoch
+        numbers and values are attribute modification dictionaries. Each
+        dictionary specifies the attribute to modify and its new value.
+
+    Example:
+        epoch_attributes = {
+            5: [{"attr1.subattr": new_value1}, {"attr2.subattr": new_value2}],
+            10: [{"attr3.subattr": new_value3}]
+        }
+    """
+
+    def __init__(self, epoch_attributes: Dict[int, Dict]):
+        self.epoch_attributes = epoch_attributes
+
+    def before_train_epoch(self, runner: Runner):
+        """Method called before each training epoch.
+
+        It checks if the current epoch is in the `epoch_attributes` mapping and
+        applies the corresponding attribute changes to the model.
+        """
+        epoch = runner.epoch
+        model = runner.model
+        if is_model_wrapper(model):
+            model = model.module
+
+        if epoch in self.epoch_attributes:
+            for key, value in self.epoch_attributes[epoch].items():
+                rsetattr(model.head, key, value)
+                runner.logger.info(
+                    f'Change model.head.{key} to {rgetattr(model.head, key)}')
diff --git a/mmpose/engine/optim_wrappers/__init__.py b/mmpose/engine/optim_wrappers/__init__.py
index 7c0b1f533a..16174c500f 100644
--- a/mmpose/engine/optim_wrappers/__init__.py
+++ b/mmpose/engine/optim_wrappers/__init__.py
@@ -1,4 +1,7 @@
 # Copyright (c) OpenMMLab. All rights reserved.
+from .force_default_constructor import ForceDefaultOptimWrapperConstructor
 from .layer_decay_optim_wrapper import LayerDecayOptimWrapperConstructor
 
-__all__ = ['LayerDecayOptimWrapperConstructor']
+__all__ = [
+    'LayerDecayOptimWrapperConstructor', 'ForceDefaultOptimWrapperConstructor'
+]
diff --git a/mmpose/engine/optim_wrappers/force_default_constructor.py b/mmpose/engine/optim_wrappers/force_default_constructor.py
new file mode 100644
index 0000000000..f45291a73b
--- /dev/null
+++ b/mmpose/engine/optim_wrappers/force_default_constructor.py
@@ -0,0 +1,255 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import logging
+from typing import List, Optional, Union
+
+import torch
+import torch.nn as nn
+from mmengine.logging import print_log
+from mmengine.optim import DefaultOptimWrapperConstructor
+from mmengine.utils.dl_utils import mmcv_full_available
+from mmengine.utils.dl_utils.parrots_wrapper import _BatchNorm, _InstanceNorm
+from torch.nn import GroupNorm, LayerNorm
+
+from mmpose.registry import OPTIM_WRAPPER_CONSTRUCTORS
+
+
+@OPTIM_WRAPPER_CONSTRUCTORS.register_module()
+class ForceDefaultOptimWrapperConstructor(DefaultOptimWrapperConstructor):
+    """Default constructor with forced optimizer settings.
+
+    This constructor extends the default constructor to add an option for
+    forcing default optimizer settings. This is useful for ensuring that
+    certain parameters or layers strictly adhere to pre-defined default
+    settings, regardless of any custom settings specified.
+
+    By default, each parameter share the same optimizer settings, and we
+    provide an argument ``paramwise_cfg`` to specify parameter-wise settings.
+    It is a dict and may contain various fields like 'custom_keys',
+    'bias_lr_mult', etc., as well as the additional field
+    `force_default_settings` which allows for enforcing default settings on
+    optimizer parameters.
+
+    - ``custom_keys`` (dict): Specified parameters-wise settings by keys. If
+      one of the keys in ``custom_keys`` is a substring of the name of one
+      parameter, then the setting of the parameter will be specified by
+      ``custom_keys[key]`` and other setting like ``bias_lr_mult`` etc. will
+      be ignored. It should be noted that the aforementioned ``key`` is the
+      longest key that is a substring of the name of the parameter. If there
+      are multiple matched keys with the same length, then the key with lower
+      alphabet order will be chosen.
+      ``custom_keys[key]`` should be a dict and may contain fields ``lr_mult``
+      and ``decay_mult``. See Example 2 below.
+    - ``bias_lr_mult`` (float): It will be multiplied to the learning
+      rate for all bias parameters (except for those in normalization
+      layers and offset layers of DCN).
+    - ``bias_decay_mult`` (float): It will be multiplied to the weight
+      decay for all bias parameters (except for those in
+      normalization layers, depthwise conv layers, offset layers of DCN).
+    - ``norm_decay_mult`` (float): It will be multiplied to the weight
+      decay for all weight and bias parameters of normalization
+      layers.
+    - ``flat_decay_mult`` (float): It will be multiplied to the weight
+      decay for all one-dimensional parameters
+    - ``dwconv_decay_mult`` (float): It will be multiplied to the weight
+      decay for all weight and bias parameters of depthwise conv
+      layers.
+    - ``dcn_offset_lr_mult`` (float): It will be multiplied to the learning
+      rate for parameters of offset layer in the deformable convs
+      of a model.
+    - ``bypass_duplicate`` (bool): If true, the duplicate parameters
+      would not be added into optimizer. Defaults to False.
+    - ``force_default_settings`` (bool): If true, this will override any
+      custom settings defined by ``custom_keys`` and enforce the use of
+      default settings for optimizer parameters like ``bias_lr_mult``.
+      This is particularly useful when you want to ensure that certain layers
+      or parameters adhere strictly to the pre-defined default settings.
+
+    Note:
+
+        1. If the option ``dcn_offset_lr_mult`` is used, the constructor will
+        override the effect of ``bias_lr_mult`` in the bias of offset layer.
+        So be careful when using both ``bias_lr_mult`` and
+        ``dcn_offset_lr_mult``. If you wish to apply both of them to the offset
+        layer in deformable convs, set ``dcn_offset_lr_mult`` to the original
+        ``dcn_offset_lr_mult`` * ``bias_lr_mult``.
+
+        2. If the option ``dcn_offset_lr_mult`` is used, the constructor will
+        apply it to all the DCN layers in the model. So be careful when the
+        model contains multiple DCN layers in places other than backbone.
+
+        3. When the option ``force_default_settings`` is true, it will override
+        any custom settings provided in ``custom_keys``. This ensures that the
+        default settings for the optimizer parameters are used.
+
+    Args:
+        optim_wrapper_cfg (dict): The config dict of the optimizer wrapper.
+
+            Required fields of ``optim_wrapper_cfg`` are
+
+            - ``type``: class name of the OptimizerWrapper
+            - ``optimizer``: The configuration of optimizer.
+
+            Optional fields of ``optim_wrapper_cfg`` are
+
+            - any arguments of the corresponding optimizer wrapper type,
+              e.g., accumulative_counts, clip_grad, etc.
+
+            Required fields of ``optimizer`` are
+
+            - `type`: class name of the optimizer.
+
+            Optional fields of ``optimizer`` are
+
+            - any arguments of the corresponding optimizer type, e.g.,
+              lr, weight_decay, momentum, etc.
+
+        paramwise_cfg (dict, optional): Parameter-wise options.
+
+    Example 1:
+        >>> model = torch.nn.modules.Conv1d(1, 1, 1)
+        >>> optim_wrapper_cfg = dict(
+        >>>     dict(type='OptimWrapper', optimizer=dict(type='SGD', lr=0.01,
+        >>>         momentum=0.9, weight_decay=0.0001))
+        >>> paramwise_cfg = dict(norm_decay_mult=0.)
+        >>> optim_wrapper_builder = DefaultOptimWrapperConstructor(
+        >>>     optim_wrapper_cfg, paramwise_cfg)
+        >>> optim_wrapper = optim_wrapper_builder(model)
+
+    Example 2:
+        >>> # assume model have attribute model.backbone and model.cls_head
+        >>> optim_wrapper_cfg = dict(type='OptimWrapper', optimizer=dict(
+        >>>     type='SGD', lr=0.01, weight_decay=0.95))
+        >>> paramwise_cfg = dict(custom_keys={
+        >>>     'backbone': dict(lr_mult=0.1, decay_mult=0.9)})
+        >>> optim_wrapper_builder = DefaultOptimWrapperConstructor(
+        >>>     optim_wrapper_cfg, paramwise_cfg)
+        >>> optim_wrapper = optim_wrapper_builder(model)
+        >>> # Then the `lr` and `weight_decay` for model.backbone is
+        >>> # (0.01 * 0.1, 0.95 * 0.9). `lr` and `weight_decay` for
+        >>> # model.cls_head is (0.01, 0.95).
+    """
+
+    def add_params(self,
+                   params: List[dict],
+                   module: nn.Module,
+                   prefix: str = '',
+                   is_dcn_module: Optional[Union[int, float]] = None) -> None:
+        """Add all parameters of module to the params list.
+
+        The parameters of the given module will be added to the list of param
+        groups, with specific rules defined by paramwise_cfg.
+
+        Args:
+            params (list[dict]): A list of param groups, it will be modified
+                in place.
+            module (nn.Module): The module to be added.
+            prefix (str): The prefix of the module
+            is_dcn_module (int|float|None): If the current module is a
+                submodule of DCN, `is_dcn_module` will be passed to
+                control conv_offset layer's learning rate. Defaults to None.
+        """
+        # get param-wise options
+        custom_keys = self.paramwise_cfg.get('custom_keys', {})
+        # first sort with alphabet order and then sort with reversed len of str
+        sorted_keys = sorted(sorted(custom_keys.keys()), key=len, reverse=True)
+
+        bias_lr_mult = self.paramwise_cfg.get('bias_lr_mult', None)
+        bias_decay_mult = self.paramwise_cfg.get('bias_decay_mult', None)
+        norm_decay_mult = self.paramwise_cfg.get('norm_decay_mult', None)
+        dwconv_decay_mult = self.paramwise_cfg.get('dwconv_decay_mult', None)
+        flat_decay_mult = self.paramwise_cfg.get('flat_decay_mult', None)
+        bypass_duplicate = self.paramwise_cfg.get('bypass_duplicate', False)
+        dcn_offset_lr_mult = self.paramwise_cfg.get('dcn_offset_lr_mult', None)
+        force_default_settings = self.paramwise_cfg.get(
+            'force_default_settings', False)
+
+        # special rules for norm layers and depth-wise conv layers
+        is_norm = isinstance(module,
+                             (_BatchNorm, _InstanceNorm, GroupNorm, LayerNorm))
+        is_dwconv = (
+            isinstance(module, torch.nn.Conv2d)
+            and module.in_channels == module.groups)
+
+        for name, param in module.named_parameters(recurse=False):
+            param_group = {'params': [param]}
+            if bypass_duplicate and self._is_in(param_group, params):
+                print_log(
+                    f'{prefix} is duplicate. It is skipped since '
+                    f'bypass_duplicate={bypass_duplicate}',
+                    logger='current',
+                    level=logging.WARNING)
+                continue
+            if not param.requires_grad:
+                params.append(param_group)
+                continue
+
+            # if the parameter match one of the custom keys, ignore other rules
+            is_custom = False
+            for key in sorted_keys:
+                if key in f'{prefix}.{name}':
+                    is_custom = True
+                    lr_mult = custom_keys[key].get('lr_mult', 1.)
+                    param_group['lr'] = self.base_lr * lr_mult
+                    if self.base_wd is not None:
+                        decay_mult = custom_keys[key].get('decay_mult', 1.)
+                        param_group['weight_decay'] = self.base_wd * decay_mult
+                    # add custom settings to param_group
+                    for k, v in custom_keys[key].items():
+                        param_group[k] = v
+                    break
+
+            if not is_custom or force_default_settings:
+                # bias_lr_mult affects all bias parameters
+                # except for norm.bias dcn.conv_offset.bias
+                if name == 'bias' and not (
+                        is_norm or is_dcn_module) and bias_lr_mult is not None:
+                    param_group['lr'] = self.base_lr * bias_lr_mult
+
+                if (prefix.find('conv_offset') != -1 and is_dcn_module
+                        and dcn_offset_lr_mult is not None
+                        and isinstance(module, torch.nn.Conv2d)):
+                    # deal with both dcn_offset's bias & weight
+                    param_group['lr'] = self.base_lr * dcn_offset_lr_mult
+
+                # apply weight decay policies
+                if self.base_wd is not None:
+                    # norm decay
+                    if is_norm and norm_decay_mult is not None:
+                        param_group[
+                            'weight_decay'] = self.base_wd * norm_decay_mult
+                    # bias lr and decay
+                    elif (name == 'bias' and not is_dcn_module
+                          and bias_decay_mult is not None):
+                        param_group[
+                            'weight_decay'] = self.base_wd * bias_decay_mult
+                    # depth-wise conv
+                    elif is_dwconv and dwconv_decay_mult is not None:
+                        param_group[
+                            'weight_decay'] = self.base_wd * dwconv_decay_mult
+                    # flatten parameters except dcn offset
+                    elif (param.ndim == 1 and not is_dcn_module
+                          and flat_decay_mult is not None):
+                        param_group[
+                            'weight_decay'] = self.base_wd * flat_decay_mult
+            params.append(param_group)
+            for key, value in param_group.items():
+                if key == 'params':
+                    continue
+                full_name = f'{prefix}.{name}' if prefix else name
+                print_log(
+                    f'paramwise_options -- {full_name}:{key}={value}',
+                    logger='current')
+
+        if mmcv_full_available():
+            from mmcv.ops import DeformConv2d, ModulatedDeformConv2d
+            is_dcn_module = isinstance(module,
+                                       (DeformConv2d, ModulatedDeformConv2d))
+        else:
+            is_dcn_module = False
+        for child_name, child_mod in module.named_children():
+            child_prefix = f'{prefix}.{child_name}' if prefix else child_name
+            self.add_params(
+                params,
+                child_mod,
+                prefix=child_prefix,
+                is_dcn_module=is_dcn_module)
diff --git a/mmpose/engine/schedulers/__init__.py b/mmpose/engine/schedulers/__init__.py
index 01261646fa..8ea59930e8 100644
--- a/mmpose/engine/schedulers/__init__.py
+++ b/mmpose/engine/schedulers/__init__.py
@@ -1,8 +1,9 @@
 # Copyright (c) OpenMMLab. All rights reserved.
+from .constant_lr import ConstantLR
 from .quadratic_warmup import (QuadraticWarmupLR, QuadraticWarmupMomentum,
                                QuadraticWarmupParamScheduler)
 
 __all__ = [
     'QuadraticWarmupParamScheduler', 'QuadraticWarmupMomentum',
-    'QuadraticWarmupLR'
+    'QuadraticWarmupLR', 'ConstantLR'
 ]
diff --git a/mmpose/engine/schedulers/constant_lr.py b/mmpose/engine/schedulers/constant_lr.py
new file mode 100644
index 0000000000..3b96374542
--- /dev/null
+++ b/mmpose/engine/schedulers/constant_lr.py
@@ -0,0 +1,80 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmengine.optim.scheduler import \
+    ConstantParamScheduler as MMENGINE_ConstantParamScheduler
+from mmengine.optim.scheduler.lr_scheduler import LRSchedulerMixin
+
+from mmpose.registry import PARAM_SCHEDULERS
+
+INF = int(1e9)
+
+
+class ConstantParamScheduler(MMENGINE_ConstantParamScheduler):
+    """Decays the parameter value of each parameter group by a small constant
+    factor until the number of epoch reaches a pre-defined milestone: ``end``.
+    Notice that such decay can happen simultaneously with other changes to the
+    parameter value from outside this scheduler. The factor range restriction
+    is removed.
+
+    Args:
+        optimizer (Optimizer or BaseOptimWrapper): optimizer or Wrapped
+            optimizer.
+        param_name (str): Name of the parameter to be adjusted, such as
+            ``lr``, ``momentum``.
+        factor (float): The number we multiply parameter value until the
+            milestone. Defaults to 1./3.
+        begin (int): Step at which to start updating the parameters.
+            Defaults to 0.
+        end (int): Step at which to stop updating the parameters.
+            Defaults to INF.
+        last_step (int): The index of last step. Used for resume without
+            state dict. Defaults to -1.
+        by_epoch (bool): Whether the scheduled parameters are updated by
+            epochs. Defaults to True.
+        verbose (bool): Whether to print the value for each update.
+            Defaults to False.
+    """
+
+    def __init__(self,
+                 optimizer,
+                 param_name: str,
+                 factor: float = 1.0 / 3,
+                 begin: int = 0,
+                 end: int = INF,
+                 last_step: int = -1,
+                 by_epoch: bool = True,
+                 verbose: bool = False):
+
+        self.factor = factor
+        self.total_iters = end - begin - 1
+        super(MMENGINE_ConstantParamScheduler, self).__init__(
+            optimizer,
+            param_name=param_name,
+            begin=begin,
+            end=end,
+            last_step=last_step,
+            by_epoch=by_epoch,
+            verbose=verbose)
+
+
+@PARAM_SCHEDULERS.register_module()
+class ConstantLR(LRSchedulerMixin, ConstantParamScheduler):
+    """Decays the learning rate value of each parameter group by a small
+    constant factor until the number of epoch reaches a pre-defined milestone:
+    ``end``. Notice that such decay can happen simultaneously with other
+    changes to the learning rate value from outside this scheduler.
+
+    Args:
+        optimizer (Optimizer or OptimWrapper): Wrapped optimizer.
+        factor (float): The number we multiply learning rate until the
+            milestone. Defaults to 1./3.
+        begin (int): Step at which to start updating the learning rate.
+            Defaults to 0.
+        end (int): Step at which to stop updating the learning rate.
+            Defaults to INF.
+        last_step (int): The index of last step. Used for resume without state
+            dict. Defaults to -1.
+        by_epoch (bool): Whether the scheduled learning rate is updated by
+            epochs. Defaults to True.
+        verbose (bool): Whether to print the learning rate for each update.
+            Defaults to False.
+    """
diff --git a/mmpose/evaluation/functional/__init__.py b/mmpose/evaluation/functional/__init__.py
index 47255fc394..239968f03a 100644
--- a/mmpose/evaluation/functional/__init__.py
+++ b/mmpose/evaluation/functional/__init__.py
@@ -3,12 +3,13 @@
                             keypoint_nme, keypoint_pck_accuracy,
                             multilabel_classification_accuracy,
                             pose_pck_accuracy, simcc_pck_accuracy)
-from .nms import nms, nms_torch, oks_nms, soft_oks_nms
+from .nms import nearby_joints_nms, nms, nms_torch, oks_nms, soft_oks_nms
 from .transforms import transform_ann, transform_pred, transform_sigmas
 
 __all__ = [
     'keypoint_pck_accuracy', 'keypoint_auc', 'keypoint_nme', 'keypoint_epe',
     'pose_pck_accuracy', 'multilabel_classification_accuracy',
     'simcc_pck_accuracy', 'nms', 'oks_nms', 'soft_oks_nms', 'keypoint_mpjpe',
-    'nms_torch', 'transform_ann', 'transform_sigmas', 'transform_pred'
+    'nms_torch', 'transform_ann', 'transform_sigmas', 'transform_pred',
+    'nearby_joints_nms'
 ]
diff --git a/mmpose/evaluation/functional/nms.py b/mmpose/evaluation/functional/nms.py
index 7f669c89cb..f7dd2279c7 100644
--- a/mmpose/evaluation/functional/nms.py
+++ b/mmpose/evaluation/functional/nms.py
@@ -258,7 +258,7 @@ def soft_oks_nms(kpts_db: List[dict],
 
 def nearby_joints_nms(
     kpts_db: List[dict],
-    dist_thr: float,
+    dist_thr: float = 0.05,
     num_nearby_joints_thr: Optional[int] = None,
     score_per_joint: bool = False,
     max_dets: int = 30,
@@ -271,9 +271,10 @@ def nearby_joints_nms(
     Args:
         kpts_db (list[dict]): keypoints and scores.
         dist_thr (float): threshold for judging whether two joints are close.
+            Defaults to 0.05.
         num_nearby_joints_thr (int): threshold for judging whether two
             instances are close.
-        max_dets (int): max number of detections to keep.
+        max_dets (int): max number of detections to keep. Defaults to 30.
         score_per_joint (bool): the input scores (in kpts_db) are per joint
             scores.
 
diff --git a/mmpose/models/heads/__init__.py b/mmpose/models/heads/__init__.py
index e4b499ad2b..319f0c6836 100644
--- a/mmpose/models/heads/__init__.py
+++ b/mmpose/models/heads/__init__.py
@@ -3,7 +3,7 @@
 from .coord_cls_heads import RTMCCHead, RTMWHead, SimCCHead
 from .heatmap_heads import (AssociativeEmbeddingHead, CIDHead, CPMHead,
                             HeatmapHead, InternetHead, MSPNHead, ViPNASHead)
-from .hybrid_heads import DEKRHead, VisPredictHead
+from .hybrid_heads import DEKRHead, RTMOHead, VisPredictHead
 from .regression_heads import (DSNTHead, IntegralRegressionHead,
                                MotionRegressionHead, RegressionHead, RLEHead,
                                TemporalRegressionHead,
@@ -16,5 +16,5 @@
     'DSNTHead', 'AssociativeEmbeddingHead', 'DEKRHead', 'VisPredictHead',
     'CIDHead', 'RTMCCHead', 'TemporalRegressionHead',
     'TrajectoryRegressionHead', 'MotionRegressionHead', 'EDPoseHead',
-    'InternetHead', 'RTMWHead'
+    'InternetHead', 'RTMWHead', 'RTMOHead'
 ]
diff --git a/mmpose/models/heads/base_head.py b/mmpose/models/heads/base_head.py
index 14882db243..d35c27b8b2 100644
--- a/mmpose/models/heads/base_head.py
+++ b/mmpose/models/heads/base_head.py
@@ -64,20 +64,33 @@ def _pack_and_call(args, func):
         if self.decoder.support_batch_decoding:
             batch_keypoints, batch_scores = _pack_and_call(
                 batch_outputs, self.decoder.batch_decode)
+            if isinstance(batch_scores, tuple) and len(batch_scores) == 2:
+                batch_scores, batch_visibility = batch_scores
+            else:
+                batch_visibility = [None] * len(batch_keypoints)
 
         else:
             batch_output_np = to_numpy(batch_outputs, unzip=True)
             batch_keypoints = []
             batch_scores = []
+            batch_visibility = []
             for outputs in batch_output_np:
                 keypoints, scores = _pack_and_call(outputs,
                                                    self.decoder.decode)
                 batch_keypoints.append(keypoints)
-                batch_scores.append(scores)
-
-        preds = [
-            InstanceData(keypoints=keypoints, keypoint_scores=scores)
-            for keypoints, scores in zip(batch_keypoints, batch_scores)
-        ]
+                if isinstance(scores, tuple) and len(scores) == 2:
+                    batch_scores.append(scores[0])
+                    batch_visibility.append(scores[1])
+                else:
+                    batch_scores.append(scores)
+                    batch_visibility.append(None)
+
+        preds = []
+        for keypoints, scores, visibility in zip(batch_keypoints, batch_scores,
+                                                 batch_visibility):
+            pred = InstanceData(keypoints=keypoints, keypoint_scores=scores)
+            if visibility is not None:
+                pred.keypoints_visible = visibility
+            preds.append(pred)
 
         return preds
diff --git a/mmpose/models/heads/hybrid_heads/__init__.py b/mmpose/models/heads/hybrid_heads/__init__.py
index ff026ce855..9d82a3cd2b 100644
--- a/mmpose/models/heads/hybrid_heads/__init__.py
+++ b/mmpose/models/heads/hybrid_heads/__init__.py
@@ -1,6 +1,7 @@
 # Copyright (c) OpenMMLab. All rights reserved.
 from .dekr_head import DEKRHead
+from .rtmo_head import RTMOHead
 from .vis_head import VisPredictHead
 from .yoloxpose_head import YOLOXPoseHead
 
-__all__ = ['DEKRHead', 'VisPredictHead', 'YOLOXPoseHead']
+__all__ = ['DEKRHead', 'VisPredictHead', 'YOLOXPoseHead', 'RTMOHead']
diff --git a/mmpose/models/heads/hybrid_heads/rtmo_head.py b/mmpose/models/heads/hybrid_heads/rtmo_head.py
new file mode 100644
index 0000000000..c364c20e98
--- /dev/null
+++ b/mmpose/models/heads/hybrid_heads/rtmo_head.py
@@ -0,0 +1,1040 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import copy
+import types
+from typing import Dict, List, Optional, Sequence, Tuple, Union
+
+import torch
+import torch.nn as nn
+from mmcv.cnn import ConvModule, Scale
+from mmdet.utils import ConfigType, reduce_mean
+from mmengine.model import BaseModule, bias_init_with_prob
+from mmengine.structures import InstanceData
+from torch import Tensor
+
+from mmpose.evaluation.functional import nms_torch
+from mmpose.models.utils import (GAUEncoder, SinePositionalEncoding,
+                                 filter_scores_and_topk)
+from mmpose.registry import MODELS
+from mmpose.structures.bbox import bbox_xyxy2cs
+from mmpose.utils.typing import Features, OptSampleList, Predictions
+from .yoloxpose_head import YOLOXPoseHead
+
+EPS = 1e-8
+
+
+class RTMOHeadModule(BaseModule):
+    """RTMO head module for one-stage human pose estimation.
+
+    This module predicts classification scores, bounding boxes, keypoint
+    offsets and visibilities from multi-level feature maps.
+
+    Args:
+        num_classes (int): Number of categories excluding the background
+            category.
+        num_keypoints (int): Number of keypoints defined for one instance.
+         in_channels (int): Number of channels in the input feature maps.
+        cls_feat_channels (int): Number of channels in the classification score
+            and objectness prediction branch. Defaults to 256.
+         widen_factor (float): Width multiplier, multiply number of
+             channels in each layer by this amount. Defaults to 1.0.
+        num_groups (int): Group number of group convolution layers in keypoint
+            regression branch. Defaults to 8.
+        channels_per_group (int): Number of channels for each group of group
+            convolution layers in keypoint regression branch. Defaults to 32.
+        featmap_strides (Sequence[int]): Downsample factor of each feature
+            map. Defaults to [8, 16, 32].
+        conv_bias (bool or str): If specified as `auto`, it will be decided
+            by the norm_cfg. Bias of conv will be set as True if `norm_cfg`
+            is None, otherwise False. Defaults to "auto".
+        conv_cfg (:obj:`ConfigDict` or dict, optional): Config dict for
+            convolution layer. Defaults to None.
+        norm_cfg (:obj:`ConfigDict` or dict): Config dict for normalization
+            layer. Defaults to dict(type='BN', momentum=0.03, eps=0.001).
+        act_cfg (:obj:`ConfigDict` or dict): Config dict for activation layer.
+            Defaults to None.
+        init_cfg (:obj:`ConfigDict` or list[:obj:`ConfigDict`] or dict or
+            list[dict], optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(
+        self,
+        num_keypoints: int,
+        in_channels: int,
+        num_classes: int = 1,
+        widen_factor: float = 1.0,
+        cls_feat_channels: int = 256,
+        stacked_convs: int = 2,
+        num_groups=8,
+        channels_per_group=36,
+        pose_vec_channels=-1,
+        featmap_strides: Sequence[int] = [8, 16, 32],
+        conv_bias: Union[bool, str] = 'auto',
+        conv_cfg: Optional[ConfigType] = None,
+        norm_cfg: ConfigType = dict(type='BN', momentum=0.03, eps=0.001),
+        act_cfg: ConfigType = dict(type='SiLU', inplace=True),
+        init_cfg: Optional[ConfigType] = None,
+    ):
+        super().__init__(init_cfg=init_cfg)
+        self.num_classes = num_classes
+        self.cls_feat_channels = int(cls_feat_channels * widen_factor)
+        self.stacked_convs = stacked_convs
+        assert conv_bias == 'auto' or isinstance(conv_bias, bool)
+        self.conv_bias = conv_bias
+
+        self.conv_cfg = conv_cfg
+        self.norm_cfg = norm_cfg
+        self.act_cfg = act_cfg
+        self.featmap_strides = featmap_strides
+
+        self.in_channels = int(in_channels * widen_factor)
+        self.num_keypoints = num_keypoints
+
+        self.num_groups = num_groups
+        self.channels_per_group = int(widen_factor * channels_per_group)
+        self.pose_vec_channels = pose_vec_channels
+
+        self._init_layers()
+
+    def _init_layers(self):
+        """Initialize heads for all level feature maps."""
+        self._init_cls_branch()
+        self._init_pose_branch()
+
+    def _init_cls_branch(self):
+        """Initialize classification branch for all level feature maps."""
+        self.conv_cls = nn.ModuleList()
+        for _ in self.featmap_strides:
+            stacked_convs = []
+            for i in range(self.stacked_convs):
+                chn = self.in_channels if i == 0 else self.cls_feat_channels
+                stacked_convs.append(
+                    ConvModule(
+                        chn,
+                        self.cls_feat_channels,
+                        3,
+                        stride=1,
+                        padding=1,
+                        conv_cfg=self.conv_cfg,
+                        norm_cfg=self.norm_cfg,
+                        act_cfg=self.act_cfg,
+                        bias=self.conv_bias))
+            self.conv_cls.append(nn.Sequential(*stacked_convs))
+
+        # output layers
+        self.out_cls = nn.ModuleList()
+        for _ in self.featmap_strides:
+            self.out_cls.append(
+                nn.Conv2d(self.cls_feat_channels, self.num_classes, 1))
+
+    def _init_pose_branch(self):
+        """Initialize pose prediction branch for all level feature maps."""
+        self.conv_pose = nn.ModuleList()
+        out_chn = self.num_groups * self.channels_per_group
+        for _ in self.featmap_strides:
+            stacked_convs = []
+            for i in range(self.stacked_convs * 2):
+                chn = self.in_channels if i == 0 else out_chn
+                groups = 1 if i == 0 else self.num_groups
+                stacked_convs.append(
+                    ConvModule(
+                        chn,
+                        out_chn,
+                        3,
+                        stride=1,
+                        padding=1,
+                        groups=groups,
+                        conv_cfg=self.conv_cfg,
+                        norm_cfg=self.norm_cfg,
+                        act_cfg=self.act_cfg,
+                        bias=self.conv_bias))
+            self.conv_pose.append(nn.Sequential(*stacked_convs))
+
+        # output layers
+        self.out_bbox = nn.ModuleList()
+        self.out_kpt_reg = nn.ModuleList()
+        self.out_kpt_vis = nn.ModuleList()
+        for _ in self.featmap_strides:
+            self.out_bbox.append(nn.Conv2d(out_chn, 4, 1))
+            self.out_kpt_reg.append(
+                nn.Conv2d(out_chn, self.num_keypoints * 2, 1))
+            self.out_kpt_vis.append(nn.Conv2d(out_chn, self.num_keypoints, 1))
+
+        if self.pose_vec_channels > 0:
+            self.out_pose = nn.ModuleList()
+            for _ in self.featmap_strides:
+                self.out_pose.append(
+                    nn.Conv2d(out_chn, self.pose_vec_channels, 1))
+
+    def init_weights(self):
+        """Initialize weights of the head.
+
+        Use prior in model initialization to improve stability.
+        """
+
+        super().init_weights()
+        bias_init = bias_init_with_prob(0.01)
+        for conv_cls in self.out_cls:
+            conv_cls.bias.data.fill_(bias_init)
+
+    def forward(self, x: Tuple[Tensor]) -> Tuple[List]:
+        """Forward features from the upstream network.
+
+        Args:
+            x (Tuple[Tensor]): Features from the upstream network, each is
+                a 4D-tensor.
+
+        Returns:
+            cls_scores (List[Tensor]): Classification scores for each level.
+            bbox_preds (List[Tensor]): Bounding box predictions for each level.
+            kpt_offsets (List[Tensor]): Keypoint offsets for each level.
+            kpt_vis (List[Tensor]): Keypoint visibilities for each level.
+            pose_feats (List[Tensor]): Pose features for each level.
+        """
+
+        cls_scores, bbox_preds = [], []
+        kpt_offsets, kpt_vis = [], []
+        pose_feats = []
+
+        for i in range(len(x)):
+
+            cls_feat, reg_feat = x[i].split(x[i].size(1) // 2, 1)
+
+            cls_feat = self.conv_cls[i](cls_feat)
+            reg_feat = self.conv_pose[i](reg_feat)
+
+            cls_scores.append(self.out_cls[i](cls_feat))
+            bbox_preds.append(self.out_bbox[i](reg_feat))
+            if self.training:
+                # `kpt_offsets` generates the proxy poses for positive
+                # sample selection during training
+                kpt_offsets.append(self.out_kpt_reg[i](reg_feat))
+            kpt_vis.append(self.out_kpt_vis[i](reg_feat))
+
+            if self.pose_vec_channels > 0:
+                pose_feats.append(self.out_pose[i](reg_feat))
+            else:
+                pose_feats.append(reg_feat)
+
+        return cls_scores, bbox_preds, kpt_offsets, kpt_vis, pose_feats
+
+
+class DCC(BaseModule):
+    """Dynamic Coordinate Classifier for One-stage Pose Estimation.
+
+    Args:
+        in_channels (int): Number of input feature map channels.
+        num_keypoints (int): Number of keypoints for pose estimation.
+        feat_channels (int): Number of feature channels.
+        num_bins (Tuple[int, int]): Tuple representing the number of bins in
+            x and y directions.
+        spe_channels (int): Number of channels for Sine Positional Encoding.
+            Defaults to 128.
+        spe_temperature (float): Temperature for Sine Positional Encoding.
+            Defaults to 300.0.
+        gau_cfg (dict, optional): Configuration for Gated Attention Unit.
+    """
+
+    def __init__(
+        self,
+        in_channels: int,
+        num_keypoints: int,
+        feat_channels: int,
+        num_bins: Tuple[int, int],
+        spe_channels: int = 128,
+        spe_temperature: float = 300.0,
+        gau_cfg: Optional[dict] = dict(
+            s=128,
+            expansion_factor=2,
+            dropout_rate=0.0,
+            drop_path=0.0,
+            act_fn='SiLU',
+            use_rel_bias=False,
+            pos_enc='add'),
+    ):
+        super().__init__()
+
+        self.in_channels = in_channels
+        self.num_keypoints = num_keypoints
+
+        self.feat_channels = feat_channels
+        self.num_bins = num_bins
+        self.gau_cfg = gau_cfg
+
+        self.spe = SinePositionalEncoding(
+            out_channels=spe_channels,
+            temperature=spe_temperature,
+        )
+        self.spe_feat_channels = spe_channels
+
+        self._build_layers()
+        self._build_basic_bins()
+
+    def _build_layers(self):
+        """Builds layers for the model."""
+
+        # GAU encoder
+        if self.gau_cfg is not None:
+            gau_cfg = self.gau_cfg.copy()
+            gau_cfg['in_token_dims'] = self.feat_channels
+            gau_cfg['out_token_dims'] = self.feat_channels
+            self.gau = GAUEncoder(**gau_cfg)
+            if gau_cfg.get('pos_enc', 'none') in ('add', 'rope'):
+                self.pos_enc = nn.Parameter(
+                    torch.randn(self.num_keypoints, gau_cfg['s']))
+
+        # fully-connected layers to convert pose feats to keypoint feats
+        pose_to_kpts = [
+            nn.Linear(self.in_channels,
+                      self.feat_channels * self.num_keypoints),
+            nn.BatchNorm1d(self.feat_channels * self.num_keypoints)
+        ]
+        self.pose_to_kpts = nn.Sequential(*pose_to_kpts)
+
+        # adapter layers for dynamic encodings
+        self.x_fc = nn.Linear(self.spe_feat_channels, self.feat_channels)
+        self.y_fc = nn.Linear(self.spe_feat_channels, self.feat_channels)
+
+        # fully-connected layers to predict sigma
+        self.sigma_fc = nn.Sequential(
+            nn.Linear(self.in_channels, self.num_keypoints), nn.Sigmoid(),
+            Scale(0.1))
+
+    def _build_basic_bins(self):
+        """Builds basic bin coordinates for x and y."""
+        self.register_buffer('y_bins',
+                             torch.linspace(-0.5, 0.5, self.num_bins[1]))
+        self.register_buffer('x_bins',
+                             torch.linspace(-0.5, 0.5, self.num_bins[0]))
+
+    def _apply_softmax(self, x_hms, y_hms):
+        """Apply softmax on 1-D heatmaps.
+
+        Args:
+            x_hms (Tensor): 1-D heatmap in x direction.
+            y_hms (Tensor): 1-D heatmap in y direction.
+
+        Returns:
+            tuple: A tuple containing the normalized x and y heatmaps.
+        """
+
+        x_hms = x_hms.clamp(min=-5e4, max=5e4)
+        y_hms = y_hms.clamp(min=-5e4, max=5e4)
+        pred_x = x_hms - x_hms.max(dim=-1, keepdims=True).values.detach()
+        pred_y = y_hms - y_hms.max(dim=-1, keepdims=True).values.detach()
+
+        exp_x, exp_y = pred_x.exp(), pred_y.exp()
+        prob_x = exp_x / (exp_x.sum(dim=-1, keepdims=True) + EPS)
+        prob_y = exp_y / (exp_y.sum(dim=-1, keepdims=True) + EPS)
+
+        return prob_x, prob_y
+
+    def _get_bin_enc(self, bbox_cs, grids):
+        """Calculate dynamic bin encodings for expanded bounding box.
+
+        This function computes dynamic bin allocations and encodings based
+        on the expanded bounding box center-scale (bbox_cs) and grid values.
+        The process involves adjusting the bins according to the scale and
+        center of the bounding box and then applying a sinusoidal positional
+        encoding (spe) followed by a fully connected layer (fc) to obtain the
+        final x and y bin encodings.
+
+        Args:
+            bbox_cs (Tensor): A tensor representing the center and scale of
+                bounding boxes.
+            grids (Tensor): A tensor representing the grid coordinates.
+
+        Returns:
+            tuple: A tuple containing the encoded x and y bins.
+        """
+        center, scale = bbox_cs.split(2, dim=-1)
+        center = center - grids
+
+        x_bins, y_bins = self.x_bins, self.y_bins
+
+        # dynamic bin allocation
+        x_bins = x_bins.view(*((1,) * (scale.ndim-1)), -1) \
+            * scale[..., 0:1] + center[..., 0:1]
+        y_bins = y_bins.view(*((1,) * (scale.ndim-1)), -1) \
+            * scale[..., 1:2] + center[..., 1:2]
+
+        # dynamic bin encoding
+        x_bins_enc = self.x_fc(self.spe(position=x_bins))
+        y_bins_enc = self.y_fc(self.spe(position=y_bins))
+
+        return x_bins_enc, y_bins_enc
+
+    def _pose_feats_to_heatmaps(self, pose_feats, x_bins_enc, y_bins_enc):
+        """Convert pose features to heatmaps using x and y bin encodings.
+
+        This function transforms the given pose features into keypoint
+        features and then generates x and y heatmaps based on the x and y
+        bin encodings. If Gated attention unit (gau) is used, it applies it
+        to the keypoint features. The heatmaps are generated using matrix
+        multiplication of pose features and bin encodings.
+
+        Args:
+            pose_feats (Tensor): The pose features tensor.
+            x_bins_enc (Tensor): The encoded x bins tensor.
+            y_bins_enc (Tensor): The encoded y bins tensor.
+
+        Returns:
+            tuple: A tuple containing the x and y heatmaps.
+        """
+
+        kpt_feats = self.pose_to_kpts(pose_feats)
+
+        kpt_feats = kpt_feats.reshape(*kpt_feats.shape[:-1],
+                                      self.num_keypoints, self.feat_channels)
+
+        if hasattr(self, 'gau'):
+            kpt_feats = self.gau(
+                kpt_feats, pos_enc=getattr(self, 'pos_enc', None))
+
+        x_hms = torch.matmul(kpt_feats,
+                             x_bins_enc.transpose(-1, -2).contiguous())
+        y_hms = torch.matmul(kpt_feats,
+                             y_bins_enc.transpose(-1, -2).contiguous())
+
+        return x_hms, y_hms
+
+    def _decode_xy_heatmaps(self, x_hms, y_hms, bbox_cs):
+        """Decode x and y heatmaps to obtain coordinates.
+
+        This function  decodes x and y heatmaps to obtain the corresponding
+        coordinates. It adjusts the x and y bins based on the bounding box
+        center and scale, and then computes the weighted sum of these bins
+        with the heatmaps to derive the x and y coordinates.
+
+        Args:
+            x_hms (Tensor): The normalized x heatmaps tensor.
+            y_hms (Tensor): The normalized y heatmaps tensor.
+            bbox_cs (Tensor): The bounding box center-scale tensor.
+
+        Returns:
+            Tensor: A tensor of decoded x and y coordinates.
+        """
+        center, scale = bbox_cs.split(2, dim=-1)
+
+        x_bins, y_bins = self.x_bins, self.y_bins
+
+        x_bins = x_bins.view(*((1,) * (scale.ndim-1)), -1) \
+            * scale[..., 0:1] + center[..., 0:1]
+        y_bins = y_bins.view(*((1,) * (scale.ndim-1)), -1) \
+            * scale[..., 1:2] + center[..., 1:2]
+
+        x = (x_hms * x_bins.unsqueeze(1)).sum(dim=-1)
+        y = (y_hms * y_bins.unsqueeze(1)).sum(dim=-1)
+
+        return torch.stack((x, y), dim=-1)
+
+    def generate_target_heatmap(self, kpt_targets, bbox_cs, sigmas, areas):
+        """Generate target heatmaps for keypoints based on bounding box.
+
+        This function calculates x and y bins adjusted by bounding box center
+        and scale. It then computes distances from keypoint targets to these
+        bins and normalizes these distances based on the areas and sigmas.
+        Finally, it uses these distances to generate heatmaps for x and y
+        coordinates under assumption of laplacian error.
+
+        Args:
+            kpt_targets (Tensor): Keypoint targets tensor.
+            bbox_cs (Tensor): Bounding box center-scale tensor.
+            sigmas (Tensor): Learned deviation of grids.
+            areas (Tensor): Areas of GT instance assigned to grids.
+
+        Returns:
+            tuple: A tuple containing the x and y heatmaps.
+        """
+
+        # calculate the error of each bin from the GT keypoint coordinates
+        center, scale = bbox_cs.split(2, dim=-1)
+        x_bins = self.x_bins.view(*((1,) * (scale.ndim-1)), -1) \
+            * scale[..., 0:1] + center[..., 0:1]
+        y_bins = self.y_bins.view(*((1,) * (scale.ndim-1)), -1) \
+            * scale[..., 1:2] + center[..., 1:2]
+
+        dist_x = torch.abs(kpt_targets.narrow(2, 0, 1) - x_bins.unsqueeze(1))
+        dist_y = torch.abs(kpt_targets.narrow(2, 1, 1) - y_bins.unsqueeze(1))
+
+        # normalize
+        areas = areas.pow(0.5).clip(min=1).reshape(-1, 1, 1)
+        sigmas = sigmas.clip(min=1e-3).unsqueeze(2)
+        dist_x = dist_x / areas / sigmas
+        dist_y = dist_y / areas / sigmas
+
+        hm_x = torch.exp(-dist_x / 2) / sigmas
+        hm_y = torch.exp(-dist_y / 2) / sigmas
+
+        return hm_x, hm_y
+
+    def forward_train(self, pose_feats, bbox_cs, grids):
+        """Forward pass for training.
+
+        This function processes pose features during training. It computes
+        sigmas using a fully connected layer, generates bin encodings,
+        creates heatmaps from pose features, applies softmax to the heatmaps,
+        and then decodes the heatmaps to get pose predictions.
+
+        Args:
+            pose_feats (Tensor): The pose features tensor.
+            bbox_cs (Tensor): The bounding box in the format of center & scale.
+            grids (Tensor): The grid coordinates.
+
+        Returns:
+            tuple: A tuple containing pose predictions, heatmaps, and sigmas.
+        """
+        sigmas = self.sigma_fc(pose_feats)
+        x_bins_enc, y_bins_enc = self._get_bin_enc(bbox_cs, grids)
+        x_hms, y_hms = self._pose_feats_to_heatmaps(pose_feats, x_bins_enc,
+                                                    y_bins_enc)
+        x_hms, y_hms = self._apply_softmax(x_hms, y_hms)
+        pose_preds = self._decode_xy_heatmaps(x_hms, y_hms, bbox_cs)
+        return pose_preds, (x_hms, y_hms), sigmas
+
+    @torch.no_grad()
+    def forward_test(self, pose_feats, bbox_cs, grids):
+        """Forward pass for testing.
+
+        This function processes pose features during testing. It generates
+        bin encodings, creates heatmaps from pose features, and then decodes
+        the heatmaps to get pose predictions.
+
+        Args:
+            pose_feats (Tensor): The pose features tensor.
+            bbox_cs (Tensor): The bounding box in the format of center & scale.
+            grids (Tensor): The grid coordinates.
+
+        Returns:
+            Tensor: Pose predictions tensor.
+        """
+        x_bins_enc, y_bins_enc = self._get_bin_enc(bbox_cs, grids)
+        x_hms, y_hms = self._pose_feats_to_heatmaps(pose_feats, x_bins_enc,
+                                                    y_bins_enc)
+        x_hms, y_hms = self._apply_softmax(x_hms, y_hms)
+        pose_preds = self._decode_xy_heatmaps(x_hms, y_hms, bbox_cs)
+        return pose_preds
+
+    def switch_to_deploy(self, test_cfg: Optional[Dict] = None):
+        if getattr(self, 'deploy', False):
+            return
+
+        self._convert_pose_to_kpts()
+        if hasattr(self, 'gau'):
+            self._convert_gau()
+        self._convert_forward_test()
+
+        self.deploy = True
+
+    def _convert_pose_to_kpts(self):
+        """Merge BatchNorm layer into Fully Connected layer.
+
+        This function merges a BatchNorm layer into the associated Fully
+        Connected layer to avoid dimension mismatch during ONNX exportation. It
+        adjusts the weights and biases of the FC layer to incorporate the BN
+        layer's parameters, and then replaces the original FC layer with the
+        updated one.
+        """
+        fc, bn = self.pose_to_kpts
+
+        # Calculate adjusted weights and biases
+        std = (bn.running_var + bn.eps).sqrt()
+        weight = fc.weight * (bn.weight / std).unsqueeze(1)
+        bias = bn.bias + (fc.bias - bn.running_mean) * bn.weight / std
+
+        # Update FC layer with adjusted parameters
+        fc.weight.data = weight.detach()
+        fc.bias.data = bias.detach()
+        self.pose_to_kpts = fc
+
+    def _convert_gau(self):
+        """Reshape and merge tensors for Gated Attention Unit (GAU).
+
+        This function pre-processes the gamma and beta tensors of the GAU and
+        handles the position encoding if available. It also redefines the GAU's
+        forward method to incorporate these pre-processed tensors, optimizing
+        the computation process.
+        """
+        # Reshape gamma and beta tensors in advance
+        gamma_q = self.gau.gamma[0].view(1, 1, 1, self.gau.gamma.size(-1))
+        gamma_k = self.gau.gamma[1].view(1, 1, 1, self.gau.gamma.size(-1))
+        beta_q = self.gau.beta[0].view(1, 1, 1, self.gau.beta.size(-1))
+        beta_k = self.gau.beta[1].view(1, 1, 1, self.gau.beta.size(-1))
+
+        # Adjust beta tensors with position encoding if available
+        if hasattr(self, 'pos_enc'):
+            pos_enc = self.pos_enc.reshape(1, 1, *self.pos_enc.shape)
+            beta_q = beta_q + pos_enc
+            beta_k = beta_k + pos_enc
+
+        gamma_q = gamma_q.detach().cpu()
+        gamma_k = gamma_k.detach().cpu()
+        beta_q = beta_q.detach().cpu()
+        beta_k = beta_k.detach().cpu()
+
+        @torch.no_grad()
+        def _forward(self, x, *args, **kwargs):
+            norm = torch.linalg.norm(x, dim=-1, keepdim=True) * self.ln.scale
+            x = x / norm.clamp(min=self.ln.eps) * self.ln.g
+
+            uv = self.uv(x)
+            uv = self.act_fn(uv)
+
+            u, v, base = torch.split(uv, [self.e, self.e, self.s], dim=-1)
+            if not torch.onnx.is_in_onnx_export():
+                q = base * gamma_q.to(base) + beta_q.to(base)
+                k = base * gamma_k.to(base) + beta_k.to(base)
+            else:
+                q = base * gamma_q + beta_q
+                k = base * gamma_k + beta_k
+            qk = torch.matmul(q, k.transpose(-1, -2))
+
+            kernel = torch.square(torch.nn.functional.relu(qk / self.sqrt_s))
+            x = u * torch.matmul(kernel, v)
+            x = self.o(x)
+            return x
+
+        self.gau._forward = types.MethodType(_forward, self.gau)
+
+    def _convert_forward_test(self):
+        """Simplify the forward test process.
+
+        This function precomputes certain tensors and redefines the
+        forward_test method for the model. It includes steps for converting
+        pose features to keypoint features, performing dynamic bin encoding,
+        calculating 1-D heatmaps, and decoding these heatmaps to produce final
+        pose predictions.
+        """
+        x_bins_ = self.x_bins.view(1, 1, -1).detach().cpu()
+        y_bins_ = self.y_bins.view(1, 1, -1).detach().cpu()
+        dim_t = self.spe.dim_t.view(1, 1, 1, -1).detach().cpu()
+
+        @torch.no_grad()
+        def _forward_test(self, pose_feats, bbox_cs, grids):
+
+            # step 1: pose features -> keypoint features
+            kpt_feats = self.pose_to_kpts(pose_feats)
+            kpt_feats = kpt_feats.reshape(*kpt_feats.shape[:-1],
+                                          self.num_keypoints,
+                                          self.feat_channels)
+            kpt_feats = self.gau(kpt_feats)
+
+            # step 2: dynamic bin encoding
+            center, scale = bbox_cs.split(2, dim=-1)
+            center = center - grids
+
+            if not torch.onnx.is_in_onnx_export():
+                x_bins = x_bins_.to(scale) * scale[..., 0:1] + center[..., 0:1]
+                y_bins = y_bins_.to(scale) * scale[..., 1:2] + center[..., 1:2]
+                freq_x = x_bins.unsqueeze(-1) / dim_t.to(scale)
+                freq_y = y_bins.unsqueeze(-1) / dim_t.to(scale)
+            else:
+                x_bins = x_bins_ * scale[..., 0:1] + center[..., 0:1]
+                y_bins = y_bins_ * scale[..., 1:2] + center[..., 1:2]
+                freq_x = x_bins.unsqueeze(-1) / dim_t
+                freq_y = y_bins.unsqueeze(-1) / dim_t
+
+            spe_x = torch.cat((freq_x.cos(), freq_x.sin()), dim=-1)
+            spe_y = torch.cat((freq_y.cos(), freq_y.sin()), dim=-1)
+
+            x_bins_enc = self.x_fc(spe_x).transpose(-1, -2).contiguous()
+            y_bins_enc = self.y_fc(spe_y).transpose(-1, -2).contiguous()
+
+            # step 3: calculate 1-D heatmaps
+            x_hms = torch.matmul(kpt_feats, x_bins_enc)
+            y_hms = torch.matmul(kpt_feats, y_bins_enc)
+            x_hms, y_hms = self._apply_softmax(x_hms, y_hms)
+
+            # step 4: decode 1-D heatmaps through integral
+            x = (x_hms * x_bins.unsqueeze(-2)).sum(dim=-1) + grids[..., 0:1]
+            y = (y_hms * y_bins.unsqueeze(-2)).sum(dim=-1) + grids[..., 1:2]
+
+            keypoints = torch.stack((x, y), dim=-1)
+
+            if not torch.onnx.is_in_onnx_export():
+                keypoints = keypoints.squeeze(0)
+            return keypoints
+
+        self.forward_test = types.MethodType(_forward_test, self)
+
+
+@MODELS.register_module()
+class RTMOHead(YOLOXPoseHead):
+    """One-stage coordinate classification head introduced in RTMO (2023). This
+    head incorporates dynamic coordinate classification and YOLO structure for
+    precise keypoint localization.
+
+    Args:
+        num_keypoints (int): Number of keypoints to detect.
+        head_module_cfg (ConfigType): Configuration for the head module.
+        featmap_strides (Sequence[int]): Strides of feature maps.
+            Defaults to [16, 32].
+        num_classes (int): Number of object classes, defaults to 1.
+        use_aux_loss (bool): Indicates whether to use auxiliary loss,
+            defaults to False.
+        proxy_target_cc (bool): Indicates whether to use keypoints predicted
+            by coordinate classification as the targets for proxy regression
+            branch. Defaults to False.
+        assigner (ConfigType): Configuration for positive sample assigning
+            module.
+        prior_generator (ConfigType): Configuration for prior generation.
+        bbox_padding (float): Padding for bounding boxes, defaults to 1.25.
+        overlaps_power (float): Power factor adopted by overlaps before they
+            are assigned as targets in classification loss. Defaults to 1.0.
+        dcc_cfg (Optional[ConfigType]): Configuration for dynamic coordinate
+            classification module.
+        loss_cls (Optional[ConfigType]): Configuration for classification loss.
+        loss_bbox (Optional[ConfigType]): Configuration for bounding box loss.
+        loss_oks (Optional[ConfigType]): Configuration for OKS loss.
+        loss_vis (Optional[ConfigType]): Configuration for visibility loss.
+        loss_mle (Optional[ConfigType]): Configuration for MLE loss.
+        loss_bbox_aux (Optional[ConfigType]): Configuration for auxiliary
+            bounding box loss.
+    """
+
+    def __init__(
+        self,
+        num_keypoints: int,
+        head_module_cfg: ConfigType,
+        featmap_strides: Sequence[int] = [16, 32],
+        num_classes: int = 1,
+        use_aux_loss: bool = False,
+        proxy_target_cc: bool = False,
+        assigner: ConfigType = None,
+        prior_generator: ConfigType = None,
+        bbox_padding: float = 1.25,
+        overlaps_power: float = 1.0,
+        dcc_cfg: Optional[ConfigType] = None,
+        loss_cls: Optional[ConfigType] = None,
+        loss_bbox: Optional[ConfigType] = None,
+        loss_oks: Optional[ConfigType] = None,
+        loss_vis: Optional[ConfigType] = None,
+        loss_mle: Optional[ConfigType] = None,
+        loss_bbox_aux: Optional[ConfigType] = None,
+    ):
+        super().__init__(
+            num_keypoints=num_keypoints,
+            head_module_cfg=None,
+            featmap_strides=featmap_strides,
+            num_classes=num_classes,
+            use_aux_loss=use_aux_loss,
+            assigner=assigner,
+            prior_generator=prior_generator,
+            loss_cls=loss_cls,
+            loss_bbox=loss_bbox,
+            loss_oks=loss_oks,
+            loss_vis=loss_vis,
+            loss_bbox_aux=loss_bbox_aux,
+            overlaps_power=overlaps_power)
+
+        self.bbox_padding = bbox_padding
+
+        # override to ensure consistency
+        head_module_cfg['featmap_strides'] = featmap_strides
+        head_module_cfg['num_keypoints'] = num_keypoints
+
+        # build modules
+        self.head_module = RTMOHeadModule(**head_module_cfg)
+
+        self.proxy_target_cc = proxy_target_cc
+        if dcc_cfg is not None:
+            dcc_cfg['num_keypoints'] = num_keypoints
+            self.dcc = DCC(**dcc_cfg)
+
+        # build losses
+        if loss_mle is not None:
+            self.loss_mle = MODELS.build(loss_mle)
+
+    def loss(self,
+             feats: Tuple[Tensor],
+             batch_data_samples: OptSampleList,
+             train_cfg: ConfigType = {}) -> dict:
+        """Calculate losses from a batch of inputs and data samples.
+
+        Args:
+            feats (Tuple[Tensor]): The multi-stage features
+            batch_data_samples (List[:obj:`PoseDataSample`]): The batch
+                data samples
+            train_cfg (dict): The runtime config for training process.
+                Defaults to {}
+
+        Returns:
+            dict: A dictionary of losses.
+        """
+
+        # 1. collect & reform predictions
+        cls_scores, bbox_preds, kpt_offsets, kpt_vis, pose_vecs = self.forward(
+            feats)
+
+        featmap_sizes = [cls_score.shape[2:] for cls_score in cls_scores]
+        mlvl_priors = self.prior_generator.grid_priors(
+            featmap_sizes,
+            dtype=cls_scores[0].dtype,
+            device=cls_scores[0].device,
+            with_stride=True)
+        flatten_priors = torch.cat(mlvl_priors)
+
+        # flatten cls_scores, bbox_preds and objectness
+        flatten_cls_scores = self._flatten_predictions(cls_scores)
+        flatten_bbox_preds = self._flatten_predictions(bbox_preds)
+        flatten_objectness = torch.ones_like(
+            flatten_cls_scores).detach().narrow(-1, 0, 1) * 1e4
+        flatten_kpt_offsets = self._flatten_predictions(kpt_offsets)
+        flatten_kpt_vis = self._flatten_predictions(kpt_vis)
+        flatten_pose_vecs = self._flatten_predictions(pose_vecs)
+        flatten_bbox_decoded = self.decode_bbox(flatten_bbox_preds,
+                                                flatten_priors[..., :2],
+                                                flatten_priors[..., -1])
+        flatten_kpt_decoded = self.decode_kpt_reg(flatten_kpt_offsets,
+                                                  flatten_priors[..., :2],
+                                                  flatten_priors[..., -1])
+
+        # 2. generate targets
+        targets = self._get_targets(flatten_priors,
+                                    flatten_cls_scores.detach(),
+                                    flatten_objectness.detach(),
+                                    flatten_bbox_decoded.detach(),
+                                    flatten_kpt_decoded.detach(),
+                                    flatten_kpt_vis.detach(),
+                                    batch_data_samples)
+        pos_masks, cls_targets, obj_targets, obj_weights, \
+            bbox_targets, bbox_aux_targets, kpt_targets, kpt_aux_targets, \
+            vis_targets, vis_weights, pos_areas, pos_priors, group_indices, \
+            num_fg_imgs = targets
+
+        num_pos = torch.tensor(
+            sum(num_fg_imgs),
+            dtype=torch.float,
+            device=flatten_cls_scores.device)
+        num_total_samples = max(reduce_mean(num_pos), 1.0)
+
+        # 3. calculate loss
+        extra_info = dict(num_samples=num_total_samples)
+        losses = dict()
+        cls_preds_all = flatten_cls_scores.view(-1, self.num_classes)
+
+        if num_pos > 0:
+
+            # 3.1 bbox loss
+            bbox_preds = flatten_bbox_decoded.view(-1, 4)[pos_masks]
+            losses['loss_bbox'] = self.loss_bbox(
+                bbox_preds, bbox_targets) / num_total_samples
+
+            if self.use_aux_loss:
+                if hasattr(self, 'loss_bbox_aux'):
+                    bbox_preds_raw = flatten_bbox_preds.view(-1, 4)[pos_masks]
+                    losses['loss_bbox_aux'] = self.loss_bbox_aux(
+                        bbox_preds_raw, bbox_aux_targets) / num_total_samples
+
+            # 3.2 keypoint visibility loss
+            kpt_vis_preds = flatten_kpt_vis.view(-1,
+                                                 self.num_keypoints)[pos_masks]
+            losses['loss_vis'] = self.loss_vis(kpt_vis_preds, vis_targets,
+                                               vis_weights)
+
+            # 3.3 keypoint loss
+            kpt_reg_preds = flatten_kpt_decoded.view(-1, self.num_keypoints,
+                                                     2)[pos_masks]
+
+            if hasattr(self, 'loss_mle') and self.loss_mle.loss_weight > 0:
+                pose_vecs = flatten_pose_vecs.view(
+                    -1, flatten_pose_vecs.size(-1))[pos_masks]
+                bbox_cs = torch.cat(
+                    bbox_xyxy2cs(bbox_preds, self.bbox_padding), dim=1)
+                # 'cc' refers to 'cordinate classification'
+                kpt_cc_preds, pred_hms, sigmas = \
+                    self.dcc.forward_train(pose_vecs,
+                                           bbox_cs,
+                                           pos_priors[..., :2])
+                target_hms = self.dcc.generate_target_heatmap(
+                    kpt_targets, bbox_cs, sigmas, pos_areas)
+                losses['loss_mle'] = self.loss_mle(pred_hms, target_hms,
+                                                   vis_targets)
+
+            if self.proxy_target_cc:
+                # form the regression target using the coordinate
+                # classification predictions
+                with torch.no_grad():
+                    diff_cc = torch.norm(kpt_cc_preds - kpt_targets, dim=-1)
+                    diff_reg = torch.norm(kpt_reg_preds - kpt_targets, dim=-1)
+                    mask = (diff_reg > diff_cc).float()
+                    kpt_weights_reg = vis_targets * mask
+                    oks = self.assigner.oks_calculator(kpt_cc_preds,
+                                                       kpt_targets,
+                                                       vis_targets, pos_areas)
+                    cls_targets = oks.unsqueeze(1)
+
+                losses['loss_oks'] = self.loss_oks(kpt_reg_preds,
+                                                   kpt_cc_preds.detach(),
+                                                   kpt_weights_reg, pos_areas)
+
+            else:
+                losses['loss_oks'] = self.loss_oks(kpt_reg_preds, kpt_targets,
+                                                   vis_targets, pos_areas)
+
+            # update the target for classification loss
+            # the target for the positive grids are set to the oks calculated
+            # using predictions and assigned ground truth instances
+            extra_info['overlaps'] = cls_targets
+            cls_targets = cls_targets.pow(self.overlaps_power).detach()
+            obj_targets[pos_masks] = cls_targets.to(obj_targets)
+
+        # 3.4 classification loss
+        losses['loss_cls'] = self.loss_cls(cls_preds_all, obj_targets,
+                                           obj_weights) / num_total_samples
+        losses.update(extra_info)
+
+        return losses
+
+    def predict(self,
+                feats: Features,
+                batch_data_samples: OptSampleList,
+                test_cfg: ConfigType = {}) -> Predictions:
+        """Predict results from features.
+
+        Args:
+            feats (Tuple[Tensor] | List[Tuple[Tensor]]): The multi-stage
+                features (or multiple multi-scale features in TTA)
+            batch_data_samples (List[:obj:`PoseDataSample`]): The batch
+                data samples
+            test_cfg (dict): The runtime config for testing process. Defaults
+                to {}
+
+        Returns:
+            Union[InstanceList | Tuple[InstanceList | PixelDataList]]: If
+            ``test_cfg['output_heatmap']==True``, return both pose and heatmap
+            prediction; otherwise only return the pose prediction.
+
+            The pose prediction is a list of ``InstanceData``, each contains
+            the following fields:
+
+                - keypoints (np.ndarray): predicted keypoint coordinates in
+                    shape (num_instances, K, D) where K is the keypoint number
+                    and D is the keypoint dimension
+                - keypoint_scores (np.ndarray): predicted keypoint scores in
+                    shape (num_instances, K)
+
+            The heatmap prediction is a list of ``PixelData``, each contains
+            the following fields:
+
+                - heatmaps (Tensor): The predicted heatmaps in shape (1, h, w)
+                    or (K+1, h, w) if keypoint heatmaps are predicted
+                - displacements (Tensor): The predicted displacement fields
+                    in shape (K*2, h, w)
+        """
+
+        cls_scores, bbox_preds, _, kpt_vis, pose_vecs = self.forward(feats)
+
+        cfg = copy.deepcopy(test_cfg)
+
+        batch_img_metas = [d.metainfo for d in batch_data_samples]
+        featmap_sizes = [cls_score.shape[2:] for cls_score in cls_scores]
+
+        # If the shape does not change, use the previous mlvl_priors
+        if featmap_sizes != self.featmap_sizes:
+            self.mlvl_priors = self.prior_generator.grid_priors(
+                featmap_sizes,
+                dtype=cls_scores[0].dtype,
+                device=cls_scores[0].device)
+            self.featmap_sizes = featmap_sizes
+        flatten_priors = torch.cat(self.mlvl_priors)
+
+        mlvl_strides = [
+            flatten_priors.new_full((featmap_size.numel(), ),
+                                    stride) for featmap_size, stride in zip(
+                                        featmap_sizes, self.featmap_strides)
+        ]
+        flatten_stride = torch.cat(mlvl_strides)
+
+        # flatten predictions
+        flatten_cls_scores = self._flatten_predictions(cls_scores).sigmoid()
+        flatten_bbox_preds = self._flatten_predictions(bbox_preds)
+        flatten_kpt_vis = self._flatten_predictions(kpt_vis).sigmoid()
+        flatten_pose_vecs = self._flatten_predictions(pose_vecs)
+        if flatten_pose_vecs is None:
+            flatten_pose_vecs = [None] * len(batch_img_metas)
+        flatten_bbox_preds = self.decode_bbox(flatten_bbox_preds,
+                                              flatten_priors, flatten_stride)
+
+        results_list = []
+        for (bboxes, scores, kpt_vis, pose_vecs,
+             img_meta) in zip(flatten_bbox_preds, flatten_cls_scores,
+                              flatten_kpt_vis, flatten_pose_vecs,
+                              batch_img_metas):
+
+            score_thr = cfg.get('score_thr', 0.01)
+
+            nms_pre = cfg.get('nms_pre', 100000)
+            scores, labels = scores.max(1, keepdim=True)
+            scores, _, keep_idxs_score, results = filter_scores_and_topk(
+                scores, score_thr, nms_pre, results=dict(labels=labels[:, 0]))
+            labels = results['labels']
+
+            bboxes = bboxes[keep_idxs_score]
+            kpt_vis = kpt_vis[keep_idxs_score]
+            grids = flatten_priors[keep_idxs_score]
+            stride = flatten_stride[keep_idxs_score]
+
+            if bboxes.numel() > 0:
+                nms_thr = cfg.get('nms_thr', 1.0)
+                if nms_thr < 1.0:
+
+                    keep_idxs_nms = nms_torch(bboxes, scores, nms_thr)
+                    bboxes = bboxes[keep_idxs_nms]
+                    stride = stride[keep_idxs_nms]
+                    labels = labels[keep_idxs_nms]
+                    kpt_vis = kpt_vis[keep_idxs_nms]
+                    scores = scores[keep_idxs_nms]
+
+                pose_vecs = pose_vecs[keep_idxs_score][keep_idxs_nms]
+                bbox_cs = torch.cat(
+                    bbox_xyxy2cs(bboxes, self.bbox_padding), dim=1)
+                grids = grids[keep_idxs_nms]
+                keypoints = self.dcc.forward_test(pose_vecs, bbox_cs, grids)
+
+            else:
+                # empty prediction
+                keypoints = bboxes.new_zeros((0, self.num_keypoints, 2))
+
+            results = InstanceData(
+                scores=scores,
+                labels=labels,
+                bboxes=bboxes,
+                bbox_scores=scores,
+                keypoints=keypoints,
+                keypoint_scores=kpt_vis,
+                keypoints_visible=kpt_vis)
+
+            input_size = img_meta['input_size']
+            results.bboxes[:, 0::2].clamp_(0, input_size[0])
+            results.bboxes[:, 1::2].clamp_(0, input_size[1])
+
+            results_list.append(results.numpy())
+
+        return results_list
+
+    def switch_to_deploy(self, test_cfg: Optional[Dict]):
+        """Precompute and save the grid coordinates and strides."""
+
+        if getattr(self, 'deploy', False):
+            return
+
+        self.deploy = True
+
+        # grid generator
+        input_size = test_cfg.get('input_size', (640, 640))
+        featmaps = []
+        for s in self.featmap_strides:
+            featmaps.append(
+                torch.rand(1, 1, input_size[0] // s, input_size[1] // s))
+        featmap_sizes = [fmap.shape[2:] for fmap in featmaps]
+
+        self.mlvl_priors = self.prior_generator.grid_priors(
+            featmap_sizes, dtype=torch.float32, device='cpu')
+        self.flatten_priors = torch.cat(self.mlvl_priors)
+
+        mlvl_strides = [
+            self.flatten_priors.new_full((featmap_size.numel(), ), stride) for
+            featmap_size, stride in zip(featmap_sizes, self.featmap_strides)
+        ]
+        self.flatten_stride = torch.cat(mlvl_strides)
diff --git a/mmpose/models/heads/hybrid_heads/vis_head.py b/mmpose/models/heads/hybrid_heads/vis_head.py
index f95634541b..6f808670ad 100644
--- a/mmpose/models/heads/hybrid_heads/vis_head.py
+++ b/mmpose/models/heads/hybrid_heads/vis_head.py
@@ -185,7 +185,7 @@ def vis_accuracy(self, vis_pred_outputs, vis_labels, vis_weights=None):
         correct = (predictions == vis_labels).float()
         if vis_weights is not None:
             accuracy = (correct * vis_weights).sum(dim=1) / (
-                vis_weights.sum(dim=1, keepdims=True) + 1e-6)
+                vis_weights.sum(dim=1) + 1e-6)
         else:
             accuracy = correct.mean(dim=1)
         return accuracy.mean()
diff --git a/mmpose/models/heads/hybrid_heads/yoloxpose_head.py b/mmpose/models/heads/hybrid_heads/yoloxpose_head.py
index bdd25f7851..07ae63a325 100644
--- a/mmpose/models/heads/hybrid_heads/yoloxpose_head.py
+++ b/mmpose/models/heads/hybrid_heads/yoloxpose_head.py
@@ -259,7 +259,8 @@ def __init__(
 
         # build losses
         self.loss_cls = MODELS.build(loss_cls)
-        self.loss_obj = MODELS.build(loss_obj)
+        if loss_obj is not None:
+            self.loss_obj = MODELS.build(loss_obj)
         self.loss_bbox = MODELS.build(loss_bbox)
         self.loss_oks = MODELS.build(loss_oks)
         self.loss_vis = MODELS.build(loss_vis)
@@ -362,6 +363,7 @@ def loss(self,
             # 3.5 classification loss
             cls_preds = flatten_cls_scores.view(-1,
                                                 self.num_classes)[pos_masks]
+            losses['overlaps'] = cls_targets
             cls_targets = cls_targets.pow(self.overlaps_power).detach()
             losses['loss_cls'] = self.loss_cls(cls_preds,
                                                cls_targets) / num_total_samples
@@ -417,7 +419,8 @@ def _get_targets(
         for i, target in enumerate(targets):
             if torch.is_tensor(target[0]):
                 target = tuple(filter(lambda x: x.size(0) > 0, target))
-                targets[i] = torch.cat(target)
+                if len(target) > 0:
+                    targets[i] = torch.cat(target)
 
         foreground_masks, cls_targets, obj_targets, obj_weights, \
             bbox_targets, kpt_targets, vis_targets, vis_weights, pos_areas, \
@@ -477,19 +480,28 @@ def _get_targets_single(
                 truth annotations for current image.
 
         Returns:
-            # TODO: modify the description of returned values
-            tuple:
-                foreground_mask (list[Tensor]): Binary mask of foreground
-                targets.
-                cls_target (list[Tensor]): Classification targets of an image.
-                obj_target (list[Tensor]): Objectness targets of an image.
-                bbox_target (list[Tensor]): BBox targets of an image.
-                bbox_aux_target (int): BBox aux targets of an image.
-                num_pos_per_img (int): Number of positive samples in an image.
+            tuple: A tuple containing various target tensors for training:
+                - foreground_mask (Tensor): Binary mask indicating foreground
+                    priors.
+                - cls_target (Tensor): Classification targets.
+                - obj_target (Tensor): Objectness targets.
+                - obj_weight (Tensor): Weights for objectness targets.
+                - bbox_target (Tensor): BBox targets.
+                - kpt_target (Tensor): Keypoints targets.
+                - vis_target (Tensor): Visibility targets for keypoints.
+                - vis_weight (Tensor): Weights for keypoints visibility
+                    targets.
+                - pos_areas (Tensor): Areas of positive samples.
+                - pos_priors (Tensor): Priors corresponding to positive
+                    samples.
+                - group_index (List[Tensor]): Indices of groups for positive
+                    samples.
+                - num_pos_per_img (int): Number of positive samples.
         """
         # TODO: change the shape of objectness to [num_priors]
         num_priors = priors.size(0)
         gt_instances = data_sample.gt_instance_labels
+        gt_fields = data_sample.get('gt_fields', dict())
         num_gts = len(gt_instances)
 
         # No target
@@ -548,7 +560,22 @@ def _get_targets_single(
         # obj target
         obj_target = torch.zeros_like(objectness)
         obj_target[pos_inds] = 1
-        obj_weight = obj_target.new_ones(obj_target.shape)
+
+        invalid_mask = gt_fields.get('heatmap_mask', None)
+        if invalid_mask is not None and (invalid_mask != 0.0).any():
+            # ignore the tokens that predict the unlabled instances
+            pred_vis = (kpt_vis.unsqueeze(-1) > 0.3).float()
+            mean_kpts = (decoded_kpts * pred_vis).sum(dim=1) / pred_vis.sum(
+                dim=1).clamp(min=1e-8)
+            mean_kpts = mean_kpts.reshape(1, -1, 1, 2)
+            wh = invalid_mask.shape[-1]
+            grids = mean_kpts / (wh - 1) * 2 - 1
+            mask = invalid_mask.unsqueeze(0).float()
+            weight = F.grid_sample(
+                mask, grids, mode='bilinear', padding_mode='zeros')
+            obj_weight = 1.0 - weight.reshape(num_priors, 1)
+        else:
+            obj_weight = obj_target.new_ones(obj_target.shape)
 
         # misc
         foreground_mask = torch.zeros_like(objectness.squeeze()).to(torch.bool)
@@ -748,5 +775,8 @@ def decode_kpt_reg(self, pred_kpt_offsets: torch.Tensor,
     def _flatten_predictions(self, preds: List[Tensor]):
         """Flattens the predictions from a list of tensors to a single
         tensor."""
+        if len(preds) == 0:
+            return None
+
         preds = [x.permute(0, 2, 3, 1).flatten(1, 2) for x in preds]
         return torch.cat(preds, dim=1)
diff --git a/mmpose/models/heads/regression_heads/temporal_regression_head.py b/mmpose/models/heads/regression_heads/temporal_regression_head.py
index 61e585103f..902be8099e 100644
--- a/mmpose/models/heads/regression_heads/temporal_regression_head.py
+++ b/mmpose/models/heads/regression_heads/temporal_regression_head.py
@@ -1,11 +1,10 @@
 # Copyright (c) OpenMMLab. All rights reserved.
 from typing import Optional, Sequence, Tuple, Union
 
-import numpy as np
 import torch
 from torch import Tensor, nn
 
-from mmpose.evaluation.functional import keypoint_pck_accuracy
+from mmpose.evaluation.functional import keypoint_mpjpe
 from mmpose.registry import KEYPOINT_CODECS, MODELS
 from mmpose.utils.tensor_utils import to_numpy
 from mmpose.utils.typing import (ConfigType, OptConfigType, OptSampleList,
@@ -133,14 +132,13 @@ def loss(self,
         losses.update(loss_pose3d=loss)
 
         # calculate accuracy
-        _, avg_acc, _ = keypoint_pck_accuracy(
+        mpjpe_err = keypoint_mpjpe(
             pred=to_numpy(pred_outputs),
             gt=to_numpy(lifting_target_label),
-            mask=to_numpy(lifting_target_weight) > 0,
-            thr=0.05,
-            norm_factor=np.ones((pred_outputs.size(0), 3), dtype=np.float32))
+            mask=to_numpy(lifting_target_weight) > 0)
 
-        mpjpe_pose = torch.tensor(avg_acc, device=lifting_target_label.device)
+        mpjpe_pose = torch.tensor(
+            mpjpe_err, device=lifting_target_label.device)
         losses.update(mpjpe=mpjpe_pose)
 
         return losses
diff --git a/mmpose/models/heads/regression_heads/trajectory_regression_head.py b/mmpose/models/heads/regression_heads/trajectory_regression_head.py
index a1608aaae7..b4d02f2ce3 100644
--- a/mmpose/models/heads/regression_heads/trajectory_regression_head.py
+++ b/mmpose/models/heads/regression_heads/trajectory_regression_head.py
@@ -1,11 +1,10 @@
 # Copyright (c) OpenMMLab. All rights reserved.
 from typing import Optional, Sequence, Tuple, Union
 
-import numpy as np
 import torch
 from torch import Tensor, nn
 
-from mmpose.evaluation.functional import keypoint_pck_accuracy
+from mmpose.evaluation.functional import keypoint_mpjpe
 from mmpose.registry import KEYPOINT_CODECS, MODELS
 from mmpose.utils.tensor_utils import to_numpy
 from mmpose.utils.typing import (ConfigType, OptConfigType, OptSampleList,
@@ -132,14 +131,13 @@ def loss(self,
         losses.update(loss_traj=loss)
 
         # calculate accuracy
-        _, avg_acc, _ = keypoint_pck_accuracy(
+        mpjpe_err = keypoint_mpjpe(
             pred=to_numpy(pred_outputs),
             gt=to_numpy(lifting_target_label),
-            mask=to_numpy(trajectory_weights) > 0,
-            thr=0.05,
-            norm_factor=np.ones((pred_outputs.size(0), 3), dtype=np.float32))
+            mask=to_numpy(trajectory_weights) > 0)
 
-        mpjpe_traj = torch.tensor(avg_acc, device=lifting_target_label.device)
+        mpjpe_traj = torch.tensor(
+            mpjpe_err, device=lifting_target_label.device)
         losses.update(mpjpe_traj=mpjpe_traj)
 
         return losses
diff --git a/mmpose/models/losses/__init__.py b/mmpose/models/losses/__init__.py
index 92ed569bab..bec6c06846 100644
--- a/mmpose/models/losses/__init__.py
+++ b/mmpose/models/losses/__init__.py
@@ -1,10 +1,11 @@
 # Copyright (c) OpenMMLab. All rights reserved.
 from .ae_loss import AssociativeEmbeddingLoss
 from .bbox_loss import IoULoss
-from .classification_loss import BCELoss, JSDiscretLoss, KLDiscretLoss
+from .classification_loss import (BCELoss, JSDiscretLoss, KLDiscretLoss,
+                                  VariFocalLoss)
 from .fea_dis_loss import FeaLoss
 from .heatmap_loss import (AdaptiveWingLoss, KeypointMSELoss,
-                           KeypointOHKMMSELoss)
+                           KeypointOHKMMSELoss, MLECCLoss)
 from .logit_dis_loss import KDLoss
 from .loss_wrappers import CombinedLoss, MultipleLossWrapper
 from .regression_loss import (BoneLoss, L1Loss, MPJPELoss,
@@ -18,5 +19,6 @@
     'SemiSupervisionLoss', 'SoftWingLoss', 'AdaptiveWingLoss', 'RLELoss',
     'KLDiscretLoss', 'MultipleLossWrapper', 'JSDiscretLoss', 'CombinedLoss',
     'AssociativeEmbeddingLoss', 'SoftWeightSmoothL1Loss',
-    'MPJPEVelocityJointLoss', 'FeaLoss', 'KDLoss', 'OKSLoss', 'IoULoss'
+    'MPJPEVelocityJointLoss', 'FeaLoss', 'KDLoss', 'OKSLoss', 'IoULoss',
+    'VariFocalLoss', 'MLECCLoss'
 ]
diff --git a/mmpose/models/losses/bbox_loss.py b/mmpose/models/losses/bbox_loss.py
index b216dcdb4a..2694076b26 100644
--- a/mmpose/models/losses/bbox_loss.py
+++ b/mmpose/models/losses/bbox_loss.py
@@ -41,7 +41,7 @@ def __init__(self,
         self.mode = mode
         self.eps = eps
 
-    def forward(self, output, target):
+    def forward(self, output, target, target_weight=None):
         """Forward function.
 
         Note:
@@ -64,6 +64,11 @@ def forward(self, output, target):
         else:
             raise NotImplementedError
 
+        if target_weight is not None:
+            for i in range(loss.ndim - target_weight.ndim):
+                target_weight = target_weight.unsqueeze(-1)
+            loss = loss * target_weight
+
         if self.reduction == 'sum':
             loss = loss.sum()
         elif self.reduction == 'mean':
diff --git a/mmpose/models/losses/classification_loss.py b/mmpose/models/losses/classification_loss.py
index 2421e74819..0b70d88cfa 100644
--- a/mmpose/models/losses/classification_loss.py
+++ b/mmpose/models/losses/classification_loss.py
@@ -145,17 +145,31 @@ class KLDiscretLoss(nn.Module):
 
     <https://github.com/leeyegy/SimCC>`_.
     Args:
-        beta (float): Temperature factor of Softmax.
+        beta (float): Temperature factor of Softmax. Default: 1.0.
         label_softmax (bool): Whether to use Softmax on labels.
+            Default: False.
+        label_beta (float): Temperature factor of Softmax on labels.
+            Default: 1.0.
         use_target_weight (bool): Option to use weighted loss.
             Different joint types may have different target weights.
+        mask (list[int]): Index of masked keypoints.
+        mask_weight (float): Weight of masked keypoints. Default: 1.0.
     """
 
-    def __init__(self, beta=1.0, label_softmax=False, use_target_weight=True):
+    def __init__(self,
+                 beta=1.0,
+                 label_softmax=False,
+                 label_beta=10.0,
+                 use_target_weight=True,
+                 mask=None,
+                 mask_weight=1.0):
         super(KLDiscretLoss, self).__init__()
         self.beta = beta
         self.label_softmax = label_softmax
+        self.label_beta = label_beta
         self.use_target_weight = use_target_weight
+        self.mask = mask
+        self.mask_weight = mask_weight
 
         self.log_softmax = nn.LogSoftmax(dim=1)
         self.kl_loss = nn.KLDivLoss(reduction='none')
@@ -164,7 +178,7 @@ def criterion(self, dec_outs, labels):
         """Criterion function."""
         log_pt = self.log_softmax(dec_outs * self.beta)
         if self.label_softmax:
-            labels = F.softmax(labels * self.beta, dim=1)
+            labels = F.softmax(labels * self.label_beta, dim=1)
         loss = torch.mean(self.kl_loss(log_pt, labels), dim=1)
         return loss
 
@@ -178,7 +192,7 @@ def forward(self, pred_simcc, gt_simcc, target_weight):
             target_weight (torch.Tensor[N, K] or torch.Tensor[N]):
                 Weights across different labels.
         """
-        num_joints = pred_simcc[0].size(1)
+        N, K, _ = pred_simcc[0].shape
         loss = 0
 
         if self.use_target_weight:
@@ -190,9 +204,15 @@ def forward(self, pred_simcc, gt_simcc, target_weight):
             pred = pred.reshape(-1, pred.size(-1))
             target = target.reshape(-1, target.size(-1))
 
-            loss += self.criterion(pred, target).mul(weight).sum()
+            t_loss = self.criterion(pred, target).mul(weight)
+
+            if self.mask is not None:
+                t_loss = t_loss.reshape(N, K)
+                t_loss[:, self.mask] = t_loss[:, self.mask] * self.mask_weight
 
-        return loss / num_joints
+            loss = loss + t_loss.sum()
+
+        return loss / K
 
 
 @MODELS.register_module()
@@ -234,3 +254,80 @@ def forward(self, features: torch.Tensor) -> torch.Tensor:
         targets = torch.arange(n, dtype=torch.long, device=features.device)
         loss = F.cross_entropy(logits, targets, reduction='sum')
         return loss * self.loss_weight
+
+
+@MODELS.register_module()
+class VariFocalLoss(nn.Module):
+    """Varifocal loss.
+
+    Args:
+        use_target_weight (bool): Option to use weighted loss.
+            Different joint types may have different target weights.
+        reduction (str): Options are "none", "mean" and "sum".
+        loss_weight (float): Weight of the loss. Default: 1.0.
+        alpha (float): A balancing factor for the negative part of
+            Varifocal Loss. Defaults to 0.75.
+        gamma (float): Gamma parameter for the modulating factor.
+            Defaults to 2.0.
+    """
+
+    def __init__(self,
+                 use_target_weight=False,
+                 loss_weight=1.,
+                 reduction='mean',
+                 alpha=0.75,
+                 gamma=2.0):
+        super().__init__()
+
+        assert reduction in ('mean', 'sum', 'none'), f'the argument ' \
+            f'`reduction` should be either \'mean\', \'sum\' or \'none\', ' \
+            f'but got {reduction}'
+
+        self.reduction = reduction
+        self.use_target_weight = use_target_weight
+        self.loss_weight = loss_weight
+        self.alpha = alpha
+        self.gamma = gamma
+
+    def criterion(self, output, target):
+        label = (target > 1e-4).to(target)
+        weight = self.alpha * output.sigmoid().pow(
+            self.gamma) * (1 - label) + target
+        output = output.clip(min=-10, max=10)
+        vfl = (
+            F.binary_cross_entropy_with_logits(
+                output, target, reduction='none') * weight)
+        return vfl
+
+    def forward(self, output, target, target_weight=None):
+        """Forward function.
+
+        Note:
+            - batch_size: N
+            - num_labels: K
+
+        Args:
+            output (torch.Tensor[N, K]): Output classification.
+            target (torch.Tensor[N, K]): Target classification.
+            target_weight (torch.Tensor[N, K] or torch.Tensor[N]):
+                Weights across different labels.
+        """
+
+        if self.use_target_weight:
+            assert target_weight is not None
+            loss = self.criterion(output, target)
+            if target_weight.dim() == 1:
+                target_weight = target_weight.unsqueeze(1)
+            loss = (loss * target_weight)
+        else:
+            loss = self.criterion(output, target)
+
+        loss[torch.isinf(loss)] = 0.0
+        loss[torch.isnan(loss)] = 0.0
+
+        if self.reduction == 'sum':
+            loss = loss.sum()
+        elif self.reduction == 'mean':
+            loss = loss.mean()
+
+        return loss * self.loss_weight
diff --git a/mmpose/models/losses/heatmap_loss.py b/mmpose/models/losses/heatmap_loss.py
index ffe5cd1e80..4618e69ed6 100644
--- a/mmpose/models/losses/heatmap_loss.py
+++ b/mmpose/models/losses/heatmap_loss.py
@@ -453,3 +453,87 @@ def forward(self,
         else:
             loss = -(pos_loss.sum() + neg_loss.sum()) / num_pos
         return loss * self.loss_weight
+
+
+@MODELS.register_module()
+class MLECCLoss(nn.Module):
+    """Maximum Likelihood Estimation loss for Coordinate Classification.
+
+    This loss function is designed to work with coordinate classification
+    problems where the likelihood of each target coordinate is maximized.
+
+    Args:
+        reduction (str): Specifies the reduction to apply to the output:
+            'none' | 'mean' | 'sum'. Default: 'mean'.
+        mode (str): Specifies the mode of calculating loss:
+            'linear' | 'square' | 'log'. Default: 'log'.
+        use_target_weight (bool): If True, uses weighted loss. Different
+            joint types may have different target weights. Defaults to False.
+        loss_weight (float): Weight of the loss. Defaults to 1.0.
+
+    Raises:
+        AssertionError: If the `reduction` or `mode` arguments are not in the
+                        expected choices.
+        NotImplementedError: If the selected mode is not implemented.
+    """
+
+    def __init__(self,
+                 reduction: str = 'mean',
+                 mode: str = 'log',
+                 use_target_weight: bool = False,
+                 loss_weight: float = 1.0):
+        super().__init__()
+        assert reduction in ('mean', 'sum', 'none'), \
+            f"`reduction` should be either 'mean', 'sum', or 'none', " \
+            f'but got {reduction}'
+        assert mode in ('linear', 'square', 'log'), \
+            f"`mode` should be either 'linear', 'square', or 'log', " \
+            f'but got {mode}'
+
+        self.reduction = reduction
+        self.mode = mode
+        self.use_target_weight = use_target_weight
+        self.loss_weight = loss_weight
+
+    def forward(self, outputs, targets, target_weight=None):
+        """Forward pass for the MLECCLoss.
+
+        Args:
+            outputs (torch.Tensor): The predicted outputs.
+            targets (torch.Tensor): The ground truth targets.
+            target_weight (torch.Tensor, optional): Optional tensor of weights
+                for each target.
+
+        Returns:
+            torch.Tensor: Calculated loss based on the specified mode and
+                reduction.
+        """
+
+        assert len(outputs) == len(targets), \
+            'Outputs and targets must have the same length'
+
+        prob = 1.0
+        for o, t in zip(outputs, targets):
+            prob *= (o * t).sum(dim=-1)
+
+        if self.mode == 'linear':
+            loss = 1.0 - prob
+        elif self.mode == 'square':
+            loss = 1.0 - prob.pow(2)
+        elif self.mode == 'log':
+            loss = -torch.log(prob + 1e-4)
+
+        loss[torch.isnan(loss)] = 0.0
+
+        if self.use_target_weight:
+            assert target_weight is not None
+            for i in range(loss.ndim - target_weight.ndim):
+                target_weight = target_weight.unsqueeze(-1)
+            loss = loss * target_weight
+
+        if self.reduction == 'sum':
+            loss = loss.flatten(1).sum(dim=1)
+        elif self.reduction == 'mean':
+            loss = loss.flatten(1).mean(dim=1)
+
+        return loss * self.loss_weight
diff --git a/mmpose/models/necks/__init__.py b/mmpose/models/necks/__init__.py
index d4b4f51308..90d68013d5 100644
--- a/mmpose/models/necks/__init__.py
+++ b/mmpose/models/necks/__init__.py
@@ -4,10 +4,11 @@
 from .fmap_proc_neck import FeatureMapProcessor
 from .fpn import FPN
 from .gap_neck import GlobalAveragePooling
+from .hybrid_encoder import HybridEncoder
 from .posewarper_neck import PoseWarperNeck
 from .yolox_pafpn import YOLOXPAFPN
 
 __all__ = [
     'GlobalAveragePooling', 'PoseWarperNeck', 'FPN', 'FeatureMapProcessor',
-    'ChannelMapper', 'YOLOXPAFPN', 'CSPNeXtPAFPN'
+    'ChannelMapper', 'YOLOXPAFPN', 'CSPNeXtPAFPN', 'HybridEncoder'
 ]
diff --git a/mmpose/models/necks/hybrid_encoder.py b/mmpose/models/necks/hybrid_encoder.py
new file mode 100644
index 0000000000..6d9db8d1b8
--- /dev/null
+++ b/mmpose/models/necks/hybrid_encoder.py
@@ -0,0 +1,298 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional, Tuple
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from mmcv.cnn import ConvModule
+from mmengine.model import BaseModule, ModuleList
+from torch import Tensor
+
+from mmpose.models.utils import (DetrTransformerEncoder, RepVGGBlock,
+                                 SinePositionalEncoding)
+from mmpose.registry import MODELS
+from mmpose.utils.typing import ConfigType, OptConfigType
+
+
+class CSPRepLayer(BaseModule):
+    """CSPRepLayer, a layer that combines Cross Stage Partial Networks with
+    RepVGG Blocks.
+
+    Args:
+        in_channels (int): Number of input channels to the layer.
+        out_channels (int): Number of output channels from the layer.
+        num_blocks (int): The number of RepVGG blocks to be used in the layer.
+            Defaults to 3.
+        widen_factor (float): Expansion factor for intermediate channels.
+            Determines the hidden channel size based on out_channels.
+            Defaults to 1.0.
+        norm_cfg (dict): Configuration for normalization layers.
+            Defaults to Batch Normalization with trainable parameters.
+        act_cfg (dict): Configuration for activation layers.
+            Defaults to SiLU (Swish) with in-place operation.
+    """
+
+    def __init__(self,
+                 in_channels: int,
+                 out_channels: int,
+                 num_blocks: int = 3,
+                 widen_factor: float = 1.0,
+                 norm_cfg: OptConfigType = dict(type='BN', requires_grad=True),
+                 act_cfg: OptConfigType = dict(type='SiLU', inplace=True)):
+        super(CSPRepLayer, self).__init__()
+        hidden_channels = int(out_channels * widen_factor)
+        self.conv1 = ConvModule(
+            in_channels,
+            hidden_channels,
+            kernel_size=1,
+            norm_cfg=norm_cfg,
+            act_cfg=act_cfg)
+        self.conv2 = ConvModule(
+            in_channels,
+            hidden_channels,
+            kernel_size=1,
+            norm_cfg=norm_cfg,
+            act_cfg=act_cfg)
+
+        self.bottlenecks = nn.Sequential(*[
+            RepVGGBlock(hidden_channels, hidden_channels, act_cfg=act_cfg)
+            for _ in range(num_blocks)
+        ])
+        if hidden_channels != out_channels:
+            self.conv3 = ConvModule(
+                hidden_channels,
+                out_channels,
+                kernel_size=1,
+                norm_cfg=norm_cfg,
+                act_cfg=act_cfg)
+        else:
+            self.conv3 = nn.Identity()
+
+    def forward(self, x: Tensor) -> Tensor:
+        """Forward function.
+
+        Args:
+            x (Tensor): The input tensor.
+
+        Returns:
+            Tensor: The output tensor.
+        """
+        x_1 = self.conv1(x)
+        x_1 = self.bottlenecks(x_1)
+        x_2 = self.conv2(x)
+        return self.conv3(x_1 + x_2)
+
+
+@MODELS.register_module()
+class HybridEncoder(BaseModule):
+    """Hybrid encoder neck introduced in `RT-DETR` by Lyu et al (2023),
+    combining transformer encoders with a Feature Pyramid Network (FPN) and a
+    Path Aggregation Network (PAN).
+
+    Args:
+        encoder_cfg (ConfigType): Configuration for the transformer encoder.
+        projector (OptConfigType, optional): Configuration for an optional
+            projector module. Defaults to None.
+        num_encoder_layers (int, optional): Number of encoder layers.
+            Defaults to 1.
+        in_channels (List[int], optional): Input channels of feature maps.
+            Defaults to [512, 1024, 2048].
+        feat_strides (List[int], optional): Strides of feature maps.
+            Defaults to [8, 16, 32].
+        hidden_dim (int, optional): Hidden dimension of the MLP.
+            Defaults to 256.
+        use_encoder_idx (List[int], optional): Indices of encoder layers to
+            use. Defaults to [2].
+        pe_temperature (int, optional): Positional encoding temperature.
+            Defaults to 10000.
+        widen_factor (float, optional): Expansion factor for CSPRepLayer.
+            Defaults to 1.0.
+        deepen_factor (float, optional): Depth multiplier for CSPRepLayer.
+            Defaults to 1.0.
+        spe_learnable (bool, optional): Whether positional encoding is
+            learnable. Defaults to False.
+        output_indices (Optional[List[int]], optional): Indices of output
+            layers. Defaults to None.
+        norm_cfg (OptConfigType, optional): Configuration for normalization
+            layers. Defaults to Batch Normalization.
+        act_cfg (OptConfigType, optional): Configuration for activation
+            layers. Defaults to SiLU (Swish) with in-place operation.
+
+    .. _`RT-DETR`: https://arxiv.org/abs/2304.08069
+    """
+
+    def __init__(self,
+                 encoder_cfg: ConfigType = dict(),
+                 projector: OptConfigType = None,
+                 num_encoder_layers: int = 1,
+                 in_channels: List[int] = [512, 1024, 2048],
+                 feat_strides: List[int] = [8, 16, 32],
+                 hidden_dim: int = 256,
+                 use_encoder_idx: List[int] = [2],
+                 pe_temperature: int = 10000,
+                 widen_factor: float = 1.0,
+                 deepen_factor: float = 1.0,
+                 spe_learnable: bool = False,
+                 output_indices: Optional[List[int]] = None,
+                 norm_cfg: OptConfigType = dict(type='BN', requires_grad=True),
+                 act_cfg: OptConfigType = dict(type='SiLU', inplace=True)):
+        super(HybridEncoder, self).__init__()
+        self.in_channels = in_channels
+        self.feat_strides = feat_strides
+        self.hidden_dim = hidden_dim
+        self.use_encoder_idx = use_encoder_idx
+        self.num_encoder_layers = num_encoder_layers
+        self.pe_temperature = pe_temperature
+        self.output_indices = output_indices
+
+        # channel projection
+        self.input_proj = ModuleList()
+        for in_channel in in_channels:
+            self.input_proj.append(
+                ConvModule(
+                    in_channel,
+                    hidden_dim,
+                    kernel_size=1,
+                    padding=0,
+                    norm_cfg=norm_cfg,
+                    act_cfg=None))
+
+        # encoder transformer
+        if len(use_encoder_idx) > 0:
+            pos_enc_dim = self.hidden_dim // 2
+            self.encoder = ModuleList([
+                DetrTransformerEncoder(num_encoder_layers, encoder_cfg)
+                for _ in range(len(use_encoder_idx))
+            ])
+
+        self.sincos_pos_enc = SinePositionalEncoding(
+            pos_enc_dim,
+            learnable=spe_learnable,
+            temperature=self.pe_temperature,
+            spatial_dim=2)
+
+        # top-down fpn
+        lateral_convs = list()
+        fpn_blocks = list()
+        for idx in range(len(in_channels) - 1, 0, -1):
+            lateral_convs.append(
+                ConvModule(
+                    hidden_dim,
+                    hidden_dim,
+                    1,
+                    1,
+                    norm_cfg=norm_cfg,
+                    act_cfg=act_cfg))
+            fpn_blocks.append(
+                CSPRepLayer(
+                    hidden_dim * 2,
+                    hidden_dim,
+                    round(3 * deepen_factor),
+                    act_cfg=act_cfg,
+                    widen_factor=widen_factor))
+        self.lateral_convs = ModuleList(lateral_convs)
+        self.fpn_blocks = ModuleList(fpn_blocks)
+
+        # bottom-up pan
+        downsample_convs = list()
+        pan_blocks = list()
+        for idx in range(len(in_channels) - 1):
+            downsample_convs.append(
+                ConvModule(
+                    hidden_dim,
+                    hidden_dim,
+                    3,
+                    stride=2,
+                    padding=1,
+                    norm_cfg=norm_cfg,
+                    act_cfg=act_cfg))
+            pan_blocks.append(
+                CSPRepLayer(
+                    hidden_dim * 2,
+                    hidden_dim,
+                    round(3 * deepen_factor),
+                    act_cfg=act_cfg,
+                    widen_factor=widen_factor))
+        self.downsample_convs = ModuleList(downsample_convs)
+        self.pan_blocks = ModuleList(pan_blocks)
+
+        if projector is not None:
+            self.projector = MODELS.build(projector)
+        else:
+            self.projector = None
+
+    def forward(self, inputs: Tuple[Tensor]) -> Tuple[Tensor]:
+        """Forward function."""
+        assert len(inputs) == len(self.in_channels)
+
+        proj_feats = [
+            self.input_proj[i](inputs[i]) for i in range(len(inputs))
+        ]
+        # encoder
+        if self.num_encoder_layers > 0:
+            for i, enc_ind in enumerate(self.use_encoder_idx):
+                h, w = proj_feats[enc_ind].shape[2:]
+                # flatten [B, C, H, W] to [B, HxW, C]
+                src_flatten = proj_feats[enc_ind].flatten(2).permute(
+                    0, 2, 1).contiguous()
+
+                if torch.onnx.is_in_onnx_export():
+                    pos_enc = getattr(self, f'pos_enc_{i}')
+                else:
+                    pos_enc = self.sincos_pos_enc(size=(h, w))
+                    pos_enc = pos_enc.transpose(-1, -2).reshape(1, h * w, -1)
+                memory = self.encoder[i](
+                    src_flatten, query_pos=pos_enc, key_padding_mask=None)
+
+                proj_feats[enc_ind] = memory.permute(
+                    0, 2, 1).contiguous().view([-1, self.hidden_dim, h, w])
+
+        # top-down fpn
+        inner_outs = [proj_feats[-1]]
+        for idx in range(len(self.in_channels) - 1, 0, -1):
+            feat_high = inner_outs[0]
+            feat_low = proj_feats[idx - 1]
+            feat_high = self.lateral_convs[len(self.in_channels) - 1 - idx](
+                feat_high)
+            inner_outs[0] = feat_high
+
+            upsample_feat = F.interpolate(
+                feat_high, scale_factor=2., mode='nearest')
+            inner_out = self.fpn_blocks[len(self.in_channels) - 1 - idx](
+                torch.cat([upsample_feat, feat_low], axis=1))
+            inner_outs.insert(0, inner_out)
+
+        # bottom-up pan
+        outs = [inner_outs[0]]
+        for idx in range(len(self.in_channels) - 1):
+            feat_low = outs[-1]
+            feat_high = inner_outs[idx + 1]
+            downsample_feat = self.downsample_convs[idx](feat_low)  # Conv
+            out = self.pan_blocks[idx](  # CSPRepLayer
+                torch.cat([downsample_feat, feat_high], axis=1))
+            outs.append(out)
+
+        if self.output_indices is not None:
+            outs = [outs[i] for i in self.output_indices]
+
+        if self.projector is not None:
+            outs = self.projector(outs)
+
+        return tuple(outs)
+
+    def switch_to_deploy(self, test_cfg):
+        """Switch to deploy mode."""
+
+        if getattr(self, 'deploy', False):
+            return
+
+        if self.num_encoder_layers > 0:
+            for i, enc_ind in enumerate(self.use_encoder_idx):
+                h, w = test_cfg['input_size']
+                h = int(h / 2**(3 + enc_ind))
+                w = int(w / 2**(3 + enc_ind))
+                pos_enc = self.sincos_pos_enc(size=(h, w))
+                pos_enc = pos_enc.transpose(-1, -2).reshape(1, h * w, -1)
+                self.register_buffer(f'pos_enc_{i}', pos_enc)
+
+        self.deploy = True
diff --git a/mmpose/models/pose_estimators/base.py b/mmpose/models/pose_estimators/base.py
index 474e0a49d6..216f592fda 100644
--- a/mmpose/models/pose_estimators/base.py
+++ b/mmpose/models/pose_estimators/base.py
@@ -73,6 +73,16 @@ def __init__(self,
             torch.nn.SyncBatchNorm.convert_sync_batchnorm(self)
             print_log('Using SyncBatchNorm()', 'current')
 
+    def switch_to_deploy(self):
+        """Switch the sub-modules to deploy mode."""
+        for name, layer in self.named_modules():
+            if layer == self:
+                continue
+            if callable(getattr(layer, 'switch_to_deploy', None)):
+                print_log(f'module {name} has been switched to deploy mode',
+                          'current')
+                layer.switch_to_deploy(self.test_cfg)
+
     @property
     def with_neck(self) -> bool:
         """bool: whether the pose estimator has a neck."""
diff --git a/mmpose/models/task_modules/assigners/sim_ota_assigner.py b/mmpose/models/task_modules/assigners/sim_ota_assigner.py
index 69c7ed677e..b43851cf15 100644
--- a/mmpose/models/task_modules/assigners/sim_ota_assigner.py
+++ b/mmpose/models/task_modules/assigners/sim_ota_assigner.py
@@ -1,5 +1,5 @@
 # Copyright (c) OpenMMLab. All rights reserved.
-from typing import Tuple
+from typing import Optional, Tuple
 
 import torch
 import torch.nn.functional as F
@@ -28,6 +28,8 @@ class SimOTAAssigner:
         vis_weight (float): Weight of keypoint visibility cost. Defaults to 0.0
         dynamic_k_indicator (str): Cost type for calculating dynamic-k,
             either 'iou' or 'oks'. Defaults to 'iou'.
+        use_keypoints_for_center (bool): Whether to use keypoints to determine
+            if a prior is in the center of a gt. Defaults to False.
         iou_calculator (dict): Config of IoU calculation method.
             Defaults to dict(type='BBoxOverlaps2D').
         oks_calculator (dict): Config of OKS calculation method.
@@ -42,6 +44,7 @@ def __init__(self,
                  oks_weight: float = 3.0,
                  vis_weight: float = 0.0,
                  dynamic_k_indicator: str = 'iou',
+                 use_keypoints_for_center: bool = False,
                  iou_calculator: ConfigType = dict(type='BBoxOverlaps2D'),
                  oks_calculator: ConfigType = dict(type='PoseOKS')):
         self.center_radius = center_radius
@@ -55,6 +58,7 @@ def __init__(self,
             f'but got {dynamic_k_indicator}'
         self.dynamic_k_indicator = dynamic_k_indicator
 
+        self.use_keypoints_for_center = use_keypoints_for_center
         self.iou_calculator = TASK_UTILS.build(iou_calculator)
         self.oks_calculator = TASK_UTILS.build(oks_calculator)
 
@@ -108,7 +112,7 @@ def assign(self, pred_instances: InstanceData, gt_instances: InstanceData,
                 labels=assigned_labels)
 
         valid_mask, is_in_boxes_and_center = self.get_in_gt_and_in_center_info(
-            priors, gt_bboxes)
+            priors, gt_bboxes, gt_keypoints, gt_keypoints_visible)
         valid_decoded_bbox = decoded_bboxes[valid_mask]
         valid_pred_scores = pred_scores[valid_mask]
         valid_pred_kpts = keypoints[valid_mask]
@@ -203,8 +207,13 @@ def assign(self, pred_instances: InstanceData, gt_instances: InstanceData,
             max_overlaps=max_overlaps,
             labels=assigned_labels)
 
-    def get_in_gt_and_in_center_info(self, priors: Tensor, gt_bboxes: Tensor
-                                     ) -> Tuple[Tensor, Tensor]:
+    def get_in_gt_and_in_center_info(
+        self,
+        priors: Tensor,
+        gt_bboxes: Tensor,
+        gt_keypoints: Optional[Tensor] = None,
+        gt_keypoints_visible: Optional[Tensor] = None,
+    ) -> Tuple[Tensor, Tensor]:
         """Get the information of which prior is in gt bboxes and gt center
         priors."""
         num_gt = gt_bboxes.size(0)
@@ -227,6 +236,15 @@ def get_in_gt_and_in_center_info(self, priors: Tensor, gt_bboxes: Tensor
         # is prior centers in gt centers
         gt_cxs = (gt_bboxes[:, 0] + gt_bboxes[:, 2]) / 2.0
         gt_cys = (gt_bboxes[:, 1] + gt_bboxes[:, 3]) / 2.0
+        if self.use_keypoints_for_center and gt_keypoints_visible is not None:
+            gt_kpts_cts = (gt_keypoints * gt_keypoints_visible.unsqueeze(-1)
+                           ).sum(dim=-2) / gt_keypoints_visible.sum(
+                               dim=-1, keepdims=True).clip(min=0)
+            gt_kpts_cts = gt_kpts_cts.to(gt_bboxes)
+            valid_mask = gt_keypoints_visible.sum(-1) > 0
+            gt_cxs[valid_mask] = gt_kpts_cts[valid_mask][..., 0]
+            gt_cys[valid_mask] = gt_kpts_cts[valid_mask][..., 1]
+
         ct_box_l = gt_cxs - self.center_radius * repeated_stride_x
         ct_box_t = gt_cys - self.center_radius * repeated_stride_y
         ct_box_r = gt_cxs + self.center_radius * repeated_stride_x
diff --git a/mmpose/models/task_modules/prior_generators/mlvl_point_generator.py b/mmpose/models/task_modules/prior_generators/mlvl_point_generator.py
index 7dc6a6199b..aed01af734 100644
--- a/mmpose/models/task_modules/prior_generators/mlvl_point_generator.py
+++ b/mmpose/models/task_modules/prior_generators/mlvl_point_generator.py
@@ -21,13 +21,17 @@ class MlvlPointGenerator:
             in multiple feature levels in order (w, h).
         offset (float): The offset of points, the value is normalized with
             corresponding stride. Defaults to 0.5.
+        centralize_points (bool): Whether to centralize the points to
+            the center of anchors. Defaults to False.
     """
 
     def __init__(self,
                  strides: Union[List[int], List[Tuple[int, int]]],
-                 offset: float = 0.5) -> None:
+                 offset: float = 0.5,
+                 centralize_points: bool = False) -> None:
         self.strides = [_pair(stride) for stride in strides]
-        self.offset = offset
+        self.centralize_points = centralize_points
+        self.offset = offset if not centralize_points else 0.0
 
     @property
     def num_levels(self) -> int:
@@ -138,6 +142,10 @@ def single_level_grid_priors(self,
         # can convert to ONNX correctly
         shift_y = shift_y.to(dtype)
 
+        if self.centralize_points:
+            shift_x = shift_x + float(stride_w - 1) / 2.0
+            shift_y = shift_y + float(stride_h - 1) / 2.0
+
         shift_xx, shift_yy = self._meshgrid(shift_x, shift_y)
         if not with_stride:
             shifts = torch.stack([shift_xx, shift_yy], dim=-1)
diff --git a/mmpose/models/utils/__init__.py b/mmpose/models/utils/__init__.py
index 539da6ea2f..92ad02b36f 100644
--- a/mmpose/models/utils/__init__.py
+++ b/mmpose/models/utils/__init__.py
@@ -4,11 +4,14 @@
 from .csp_layer import CSPLayer
 from .misc import filter_scores_and_topk
 from .ops import FrozenBatchNorm2d, inverse_sigmoid
+from .reparam_layers import RepVGGBlock
 from .rtmcc_block import RTMCCBlock, rope
-from .transformer import PatchEmbed, nchw_to_nlc, nlc_to_nchw
+from .transformer import (DetrTransformerEncoder, GAUEncoder, PatchEmbed,
+                          SinePositionalEncoding, nchw_to_nlc, nlc_to_nchw)
 
 __all__ = [
     'PatchEmbed', 'nchw_to_nlc', 'nlc_to_nchw', 'pvt_convert', 'RTMCCBlock',
     'rope', 'check_and_update_config', 'filter_scores_and_topk', 'CSPLayer',
-    'FrozenBatchNorm2d', 'inverse_sigmoid'
+    'FrozenBatchNorm2d', 'inverse_sigmoid', 'GAUEncoder',
+    'SinePositionalEncoding', 'RepVGGBlock', 'DetrTransformerEncoder'
 ]
diff --git a/mmpose/models/utils/reparam_layers.py b/mmpose/models/utils/reparam_layers.py
new file mode 100644
index 0000000000..3ba196294f
--- /dev/null
+++ b/mmpose/models/utils/reparam_layers.py
@@ -0,0 +1,217 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import types
+from typing import Dict, Optional
+
+import numpy as np
+import torch
+import torch.nn as nn
+from mmcv.cnn import ConvModule, build_activation_layer, build_norm_layer
+from mmengine.model import BaseModule
+from torch import Tensor
+
+from mmpose.utils.typing import OptConfigType
+
+
+class RepVGGBlock(BaseModule):
+    """A block in RepVGG architecture, supporting optional normalization in the
+    identity branch.
+
+    This block consists of 3x3 and 1x1 convolutions, with an optional identity
+    shortcut branch that includes normalization.
+
+    Args:
+        in_channels (int): The input channels of the block.
+        out_channels (int): The output channels of the block.
+        stride (int): The stride of the block. Defaults to 1.
+        padding (int): The padding of the block. Defaults to 1.
+        dilation (int): The dilation of the block. Defaults to 1.
+        groups (int): The groups of the block. Defaults to 1.
+        padding_mode (str): The padding mode of the block. Defaults to 'zeros'.
+        norm_cfg (dict): The config dict for normalization layers.
+            Defaults to dict(type='BN').
+        act_cfg (dict): The config dict for activation layers.
+            Defaults to dict(type='ReLU').
+        without_branch_norm (bool): Whether to skip branch_norm.
+            Defaults to True.
+        init_cfg (dict): The config dict for initialization. Defaults to None.
+    """
+
+    def __init__(self,
+                 in_channels: int,
+                 out_channels: int,
+                 stride: int = 1,
+                 padding: int = 1,
+                 dilation: int = 1,
+                 groups: int = 1,
+                 padding_mode: str = 'zeros',
+                 norm_cfg: OptConfigType = dict(type='BN'),
+                 act_cfg: OptConfigType = dict(type='ReLU'),
+                 without_branch_norm: bool = True,
+                 init_cfg: OptConfigType = None):
+        super(RepVGGBlock, self).__init__(init_cfg)
+
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.stride = stride
+        self.padding = padding
+        self.dilation = dilation
+        self.groups = groups
+        self.norm_cfg = norm_cfg
+        self.act_cfg = act_cfg
+
+        # judge if input shape and output shape are the same.
+        # If true, add a normalized identity shortcut.
+        self.branch_norm = None
+        if out_channels == in_channels and stride == 1 and \
+                padding == dilation and not without_branch_norm:
+            self.branch_norm = build_norm_layer(norm_cfg, in_channels)[1]
+
+        self.branch_3x3 = ConvModule(
+            self.in_channels,
+            self.out_channels,
+            3,
+            stride=self.stride,
+            padding=self.padding,
+            groups=self.groups,
+            dilation=self.dilation,
+            norm_cfg=self.norm_cfg,
+            act_cfg=None)
+
+        self.branch_1x1 = ConvModule(
+            self.in_channels,
+            self.out_channels,
+            1,
+            groups=self.groups,
+            norm_cfg=self.norm_cfg,
+            act_cfg=None)
+
+        self.act = build_activation_layer(act_cfg)
+
+    def forward(self, x: Tensor) -> Tensor:
+        """Forward pass through the RepVGG block.
+
+        The output is the sum of 3x3 and 1x1 convolution outputs,
+        along with the normalized identity branch output, followed by
+        activation.
+
+        Args:
+            x (Tensor): The input tensor.
+
+        Returns:
+            Tensor: The output tensor.
+        """
+
+        if self.branch_norm is None:
+            branch_norm_out = 0
+        else:
+            branch_norm_out = self.branch_norm(x)
+
+        out = self.branch_3x3(x) + self.branch_1x1(x) + branch_norm_out
+
+        out = self.act(out)
+
+        return out
+
+    def _pad_1x1_to_3x3_tensor(self, kernel1x1):
+        """Pad 1x1 tensor to 3x3.
+        Args:
+            kernel1x1 (Tensor): The input 1x1 kernel need to be padded.
+
+        Returns:
+            Tensor: 3x3 kernel after padded.
+        """
+        if kernel1x1 is None:
+            return 0
+        else:
+            return torch.nn.functional.pad(kernel1x1, [1, 1, 1, 1])
+
+    def _fuse_bn_tensor(self, branch: nn.Module) -> Tensor:
+        """Derives the equivalent kernel and bias of a specific branch layer.
+
+        Args:
+            branch (nn.Module): The layer that needs to be equivalently
+                transformed, which can be nn.Sequential or nn.Batchnorm2d
+
+        Returns:
+            tuple: Equivalent kernel and bias
+        """
+        if branch is None:
+            return 0, 0
+
+        if isinstance(branch, ConvModule):
+            kernel = branch.conv.weight
+            running_mean = branch.bn.running_mean
+            running_var = branch.bn.running_var
+            gamma = branch.bn.weight
+            beta = branch.bn.bias
+            eps = branch.bn.eps
+        else:
+            assert isinstance(branch, (nn.SyncBatchNorm, nn.BatchNorm2d))
+            if not hasattr(self, 'id_tensor'):
+                input_dim = self.in_channels // self.groups
+                kernel_value = np.zeros((self.in_channels, input_dim, 3, 3),
+                                        dtype=np.float32)
+                for i in range(self.in_channels):
+                    kernel_value[i, i % input_dim, 1, 1] = 1
+                self.id_tensor = torch.from_numpy(kernel_value).to(
+                    branch.weight.device)
+            kernel = self.id_tensor
+            running_mean = branch.running_mean
+            running_var = branch.running_var
+            gamma = branch.weight
+            beta = branch.bias
+            eps = branch.eps
+
+        std = (running_var + eps).sqrt()
+        t = (gamma / std).reshape(-1, 1, 1, 1)
+        return kernel * t, beta - running_mean * gamma / std
+
+    def get_equivalent_kernel_bias(self):
+        """Derives the equivalent kernel and bias in a differentiable way.
+
+        Returns:
+            tuple: Equivalent kernel and bias
+        """
+        kernel3x3, bias3x3 = self._fuse_bn_tensor(self.branch_3x3)
+        kernel1x1, bias1x1 = self._fuse_bn_tensor(self.branch_1x1)
+        kernelid, biasid = (0, 0) if self.branch_norm is None else \
+            self._fuse_bn_tensor(self.branch_norm)
+
+        return (kernel3x3 + self._pad_1x1_to_3x3_tensor(kernel1x1) + kernelid,
+                bias3x3 + bias1x1 + biasid)
+
+    def switch_to_deploy(self, test_cfg: Optional[Dict] = None):
+        """Switches the block to deployment mode.
+
+        In deployment mode, the block uses a single convolution operation
+        derived from the equivalent kernel and bias, replacing the original
+        branches. This reduces computational complexity during inference.
+        """
+        if getattr(self, 'deploy', False):
+            return
+
+        kernel, bias = self.get_equivalent_kernel_bias()
+        self.conv_reparam = nn.Conv2d(
+            in_channels=self.branch_3x3.conv.in_channels,
+            out_channels=self.branch_3x3.conv.out_channels,
+            kernel_size=self.branch_3x3.conv.kernel_size,
+            stride=self.branch_3x3.conv.stride,
+            padding=self.branch_3x3.conv.padding,
+            dilation=self.branch_3x3.conv.dilation,
+            groups=self.branch_3x3.conv.groups,
+            bias=True)
+        self.conv_reparam.weight.data = kernel
+        self.conv_reparam.bias.data = bias
+        for para in self.parameters():
+            para.detach_()
+        self.__delattr__('branch_3x3')
+        self.__delattr__('branch_1x1')
+        if hasattr(self, 'branch_norm'):
+            self.__delattr__('branch_norm')
+
+        def _forward(self, x):
+            return self.act(self.conv_reparam(x))
+
+        self.forward = types.MethodType(_forward, self)
+
+        self.deploy = True
diff --git a/mmpose/models/utils/rtmcc_block.py b/mmpose/models/utils/rtmcc_block.py
index bd4929454c..0a16701c0f 100644
--- a/mmpose/models/utils/rtmcc_block.py
+++ b/mmpose/models/utils/rtmcc_block.py
@@ -8,6 +8,8 @@
 from mmengine.utils import digit_version
 from mmengine.utils.dl_utils import TORCH_VERSION
 
+from .transformer import ScaleNorm
+
 
 def rope(x, dim):
     """Applies Rotary Position Embedding to input tensor.
@@ -77,38 +79,6 @@ def forward(self, x):
         return x * self.scale
 
 
-class ScaleNorm(nn.Module):
-    """Scale Norm.
-
-    Args:
-        dim (int): The dimension of the scale vector.
-        eps (float, optional): The minimum value in clamp. Defaults to 1e-5.
-
-    Reference:
-        `Transformers without Tears: Improving the Normalization
-        of Self-Attention <https://arxiv.org/abs/1910.05895>`_
-    """
-
-    def __init__(self, dim, eps=1e-5):
-        super().__init__()
-        self.scale = dim**-0.5
-        self.eps = eps
-        self.g = nn.Parameter(torch.ones(1))
-
-    def forward(self, x):
-        """Forward function.
-
-        Args:
-            x (torch.Tensor): Input tensor.
-
-        Returns:
-            torch.Tensor: The tensor after applying scale norm.
-        """
-
-        norm = torch.norm(x, dim=2, keepdim=True) * self.scale
-        return x / norm.clamp(min=self.eps) * self.g
-
-
 class RTMCCBlock(nn.Module):
     """Gated Attention Unit (GAU) in RTMBlock.
 
@@ -198,13 +168,15 @@ def __init__(self,
 
         nn.init.xavier_uniform_(self.uv.weight)
 
-        if act_fn == 'SiLU':
+        if act_fn == 'SiLU' or act_fn == nn.SiLU:
             assert digit_version(TORCH_VERSION) >= digit_version('1.7.0'), \
                 'SiLU activation requires PyTorch version >= 1.7'
 
             self.act_fn = nn.SiLU(True)
-        else:
+        elif act_fn == 'ReLU' or act_fn == nn.ReLU:
             self.act_fn = nn.ReLU(True)
+        else:
+            raise NotImplementedError
 
         if in_token_dims == out_token_dims:
             self.shortcut = True
diff --git a/mmpose/models/utils/transformer.py b/mmpose/models/utils/transformer.py
index 103b9e9970..987b865808 100644
--- a/mmpose/models/utils/transformer.py
+++ b/mmpose/models/utils/transformer.py
@@ -1,12 +1,24 @@
 # Copyright (c) OpenMMLab. All rights reserved.
 import math
-from typing import Sequence
+from typing import Optional, Sequence, Union
 
+import torch
 import torch.nn as nn
 import torch.nn.functional as F
 from mmcv.cnn import build_conv_layer, build_norm_layer
-from mmengine.model import BaseModule
-from mmengine.utils import to_2tuple
+from mmcv.cnn.bricks import DropPath
+from mmcv.cnn.bricks.transformer import FFN, MultiheadAttention
+from mmengine.model import BaseModule, ModuleList
+from mmengine.utils import digit_version, to_2tuple
+from mmengine.utils.dl_utils import TORCH_VERSION
+from torch import Tensor
+
+from mmpose.utils.typing import ConfigType, OptConfigType
+
+try:
+    from fairscale.nn.checkpoint import checkpoint_wrapper
+except ImportError:
+    checkpoint_wrapper = None
 
 
 def nlc_to_nchw(x, hw_shape):
@@ -367,3 +379,524 @@ def forward(self, x, input_size):
         x = self.norm(x) if self.norm else x
         x = self.reduction(x)
         return x, output_size
+
+
+class ScaleNorm(nn.Module):
+    """Scale Norm.
+
+    Args:
+        dim (int): The dimension of the scale vector.
+        eps (float, optional): The minimum value in clamp. Defaults to 1e-5.
+
+    Reference:
+        `Transformers without Tears: Improving the Normalization
+        of Self-Attention <https://arxiv.org/abs/1910.05895>`_
+    """
+
+    def __init__(self, dim, eps=1e-5):
+        super().__init__()
+        self.scale = dim**-0.5
+        self.eps = eps
+        self.g = nn.Parameter(torch.ones(1))
+
+    def forward(self, x):
+        """Forward function.
+
+        Args:
+            x (torch.Tensor): Input tensor.
+
+        Returns:
+            torch.Tensor: The tensor after applying scale norm.
+        """
+
+        if torch.onnx.is_in_onnx_export() and \
+                digit_version(TORCH_VERSION) >= digit_version('1.12'):
+
+            norm = torch.linalg.norm(x, dim=-1, keepdim=True)
+
+        else:
+            norm = torch.norm(x, dim=-1, keepdim=True)
+        norm = norm * self.scale
+        return x / norm.clamp(min=self.eps) * self.g
+
+
+class SinePositionalEncoding(nn.Module):
+    """Sine Positional Encoding Module. This module implements sine positional
+    encoding, which is commonly used in transformer-based models to add
+    positional information to the input sequences. It uses sine and cosine
+    functions to create positional embeddings for each element in the input
+    sequence.
+
+    Args:
+        out_channels (int): The number of features in the input sequence.
+        temperature (int): A temperature parameter used to scale
+            the positional encodings. Defaults to 10000.
+        spatial_dim (int): The number of spatial dimension of input
+            feature. 1 represents sequence data and 2 represents grid data.
+            Defaults to 1.
+        learnable (bool): Whether to optimize the frequency base. Defaults
+            to False.
+        eval_size (int, tuple[int], optional): The fixed spatial size of
+            input features. Defaults to None.
+    """
+
+    def __init__(
+        self,
+        out_channels: int,
+        spatial_dim: int = 1,
+        temperature: int = 1e5,
+        learnable: bool = False,
+        eval_size: Optional[Union[int, Sequence[int]]] = None,
+    ) -> None:
+
+        super().__init__()
+
+        assert out_channels % 2 == 0
+        assert temperature > 0
+
+        self.spatial_dim = spatial_dim
+        self.out_channels = out_channels
+        self.temperature = temperature
+        self.eval_size = eval_size
+        self.learnable = learnable
+
+        pos_dim = out_channels // 2
+        dim_t = torch.arange(pos_dim, dtype=torch.float32) / pos_dim
+        dim_t = self.temperature**(dim_t)
+
+        if not learnable:
+            self.register_buffer('dim_t', dim_t)
+        else:
+            self.dim_t = nn.Parameter(dim_t.detach())
+
+        # set parameters
+        if eval_size:
+            if hasattr(self, f'pos_enc_{eval_size}'):
+                delattr(self, f'pos_enc_{eval_size}')
+            pos_enc = self.generate_pos_encoding(size=eval_size)
+            self.register_buffer(f'pos_enc_{eval_size}', pos_enc)
+
+    def forward(self, *args, **kwargs):
+        return self.generate_pos_encoding(*args, **kwargs)
+
+    def generate_pos_encoding(self,
+                              size: Union[int, Sequence[int]] = None,
+                              position: Optional[Tensor] = None):
+        """Generate positional encoding for input features.
+
+        Args:
+            size (int or tuple[int]): Size of the input features. Required
+                if position is None.
+            position (Tensor, optional): Position tensor. Required if size
+                is None.
+        """
+
+        assert (size is not None) ^ (position is not None)
+
+        if (not (self.learnable
+                 and self.training)) and size is not None and hasattr(
+                     self, f'pos_enc_{size}'):
+            return getattr(self, f'pos_enc_{size}')
+
+        if self.spatial_dim == 1:
+            if size is not None:
+                if isinstance(size, (tuple, list)):
+                    size = size[0]
+                position = torch.arange(
+                    size, dtype=torch.float32, device=self.dim_t.device)
+
+            dim_t = self.dim_t.reshape(*((1, ) * position.ndim), -1)
+            freq = position.unsqueeze(-1) / dim_t
+            pos_enc = torch.cat((freq.cos(), freq.sin()), dim=-1)
+
+        elif self.spatial_dim == 2:
+            if size is not None:
+                if isinstance(size, (tuple, list)):
+                    h, w = size[:2]
+                elif isinstance(size, (int, float)):
+                    h, w = int(size), int(size)
+                else:
+                    raise ValueError(f'got invalid type {type(size)} for size')
+                grid_h, grid_w = torch.meshgrid(
+                    torch.arange(
+                        int(h), dtype=torch.float32, device=self.dim_t.device),
+                    torch.arange(
+                        int(w), dtype=torch.float32, device=self.dim_t.device))
+                grid_h, grid_w = grid_h.flatten(), grid_w.flatten()
+            else:
+                assert position.size(-1) == 2
+                grid_h, grid_w = torch.unbind(position, dim=-1)
+
+            dim_t = self.dim_t.reshape(*((1, ) * grid_h.ndim), -1)
+            freq_h = grid_h.unsqueeze(-1) / dim_t
+            freq_w = grid_w.unsqueeze(-1) / dim_t
+            pos_enc_h = torch.cat((freq_h.cos(), freq_h.sin()), dim=-1)
+            pos_enc_w = torch.cat((freq_w.cos(), freq_w.sin()), dim=-1)
+            pos_enc = torch.stack((pos_enc_h, pos_enc_w), dim=-1)
+
+        return pos_enc
+
+    @staticmethod
+    def apply_additional_pos_enc(feature: Tensor,
+                                 pos_enc: Tensor,
+                                 spatial_dim: int = 1):
+        """Apply additional positional encoding to input features.
+
+        Args:
+            feature (Tensor): Input feature tensor.
+            pos_enc (Tensor): Positional encoding tensor.
+            spatial_dim (int): Spatial dimension of input features.
+        """
+
+        assert spatial_dim in (1, 2), f'the argument spatial_dim must be ' \
+            f'either 1 or 2, but got {spatial_dim}'
+        if spatial_dim == 2:
+            pos_enc = pos_enc.flatten(-2)
+        for _ in range(feature.ndim - pos_enc.ndim):
+            pos_enc = pos_enc.unsqueeze(0)
+        return feature + pos_enc
+
+    @staticmethod
+    def apply_rotary_pos_enc(feature: Tensor,
+                             pos_enc: Tensor,
+                             spatial_dim: int = 1):
+        """Apply rotary positional encoding to input features.
+
+        Args:
+            feature (Tensor): Input feature tensor.
+            pos_enc (Tensor): Positional encoding tensor.
+            spatial_dim (int): Spatial dimension of input features.
+        """
+
+        assert spatial_dim in (1, 2), f'the argument spatial_dim must be ' \
+            f'either 1 or 2, but got {spatial_dim}'
+
+        for _ in range(feature.ndim - pos_enc.ndim + spatial_dim - 1):
+            pos_enc = pos_enc.unsqueeze(0)
+
+        x1, x2 = torch.chunk(feature, 2, dim=-1)
+        if spatial_dim == 1:
+            cos, sin = torch.chunk(pos_enc, 2, dim=-1)
+            feature = torch.cat((x1 * cos - x2 * sin, x2 * cos + x1 * sin),
+                                dim=-1)
+        elif spatial_dim == 2:
+            pos_enc_h, pos_enc_w = torch.unbind(pos_enc, dim=-1)
+            cos_h, sin_h = torch.chunk(pos_enc_h, 2, dim=-1)
+            cos_w, sin_w = torch.chunk(pos_enc_w, 2, dim=-1)
+            feature = torch.cat(
+                (x1 * cos_h - x2 * sin_h, x1 * cos_w + x2 * sin_w), dim=-1)
+
+        return feature
+
+
+class ChannelWiseScale(nn.Module):
+    """Scale vector by element multiplications.
+
+    Args:
+        dim (int): The dimension of the scale vector.
+        init_value (float, optional): The initial value of the scale vector.
+            Defaults to 1.0.
+        trainable (bool, optional): Whether the scale vector is trainable.
+            Defaults to True.
+    """
+
+    def __init__(self, dim, init_value=1., trainable=True):
+        super().__init__()
+        self.scale = nn.Parameter(
+            init_value * torch.ones(dim), requires_grad=trainable)
+
+    def forward(self, x):
+        """Forward function."""
+
+        return x * self.scale
+
+
+class GAUEncoder(BaseModule):
+    """Gated Attention Unit (GAU) Encoder.
+
+    Args:
+        in_token_dims (int): The input token dimension.
+        out_token_dims (int): The output token dimension.
+        expansion_factor (int, optional): The expansion factor of the
+            intermediate token dimension. Defaults to 2.
+        s (int, optional): The self-attention feature dimension.
+            Defaults to 128.
+        eps (float, optional): The minimum value in clamp. Defaults to 1e-5.
+        dropout_rate (float, optional): The dropout rate. Defaults to 0.0.
+        drop_path (float, optional): The drop path rate. Defaults to 0.0.
+        act_fn (str, optional): The activation function which should be one
+            of the following options:
+
+            - 'ReLU': ReLU activation.
+            - 'SiLU': SiLU activation.
+
+            Defaults to 'SiLU'.
+        bias (bool, optional): Whether to use bias in linear layers.
+            Defaults to False.
+        pos_enc (bool, optional): Whether to use rotary position
+            embedding. Defaults to False.
+        spatial_dim (int, optional): The spatial dimension of inputs
+
+    Reference:
+        `Transformer Quality in Linear Time
+        <https://arxiv.org/abs/2202.10447>`_
+    """
+
+    def __init__(self,
+                 in_token_dims,
+                 out_token_dims,
+                 expansion_factor=2,
+                 s=128,
+                 eps=1e-5,
+                 dropout_rate=0.,
+                 drop_path=0.,
+                 act_fn='SiLU',
+                 bias=False,
+                 pos_enc: str = 'none',
+                 spatial_dim: int = 1):
+
+        super(GAUEncoder, self).__init__()
+        self.s = s
+        self.bias = bias
+        self.pos_enc = pos_enc
+        self.in_token_dims = in_token_dims
+        self.spatial_dim = spatial_dim
+        self.drop_path = DropPath(drop_path) \
+            if drop_path > 0. else nn.Identity()
+
+        self.e = int(in_token_dims * expansion_factor)
+        self.o = nn.Linear(self.e, out_token_dims, bias=bias)
+
+        self._build_layers()
+
+        self.ln = ScaleNorm(in_token_dims, eps=eps)
+
+        nn.init.xavier_uniform_(self.uv.weight)
+
+        if act_fn == 'SiLU':
+            assert digit_version(TORCH_VERSION) >= digit_version('1.7.0'), \
+                'SiLU activation requires PyTorch version >= 1.7'
+
+            self.act_fn = nn.SiLU(True)
+        else:
+            self.act_fn = nn.ReLU(True)
+
+        if in_token_dims == out_token_dims:
+            self.shortcut = True
+            self.res_scale = ChannelWiseScale(in_token_dims)
+        else:
+            self.shortcut = False
+
+        self.sqrt_s = math.sqrt(s)
+        self.dropout_rate = dropout_rate
+
+        if dropout_rate > 0.:
+            self.dropout = nn.Dropout(dropout_rate)
+
+    def _build_layers(self):
+        self.uv = nn.Linear(
+            self.in_token_dims, 2 * self.e + self.s, bias=self.bias)
+        self.gamma = nn.Parameter(torch.rand((2, self.s)))
+        self.beta = nn.Parameter(torch.rand((2, self.s)))
+
+    def _forward(self, x, mask=None, pos_enc=None):
+        """GAU Forward function."""
+
+        x = self.ln(x)
+
+        # [B, K, in_token_dims] -> [B, K, e + e + s]
+        uv = self.uv(x)
+        uv = self.act_fn(uv)
+
+        # [B, K, e + e + s] -> [B, K, e], [B, K, e], [B, K, s]
+        u, v, base = torch.split(uv, [self.e, self.e, self.s], dim=-1)
+        # [B, K, 1, s] * [1, 1, 2, s] + [2, s] -> [B, K, 2, s]
+        dim = base.ndim - self.gamma.ndim + 1
+        gamma = self.gamma.view(*((1, ) * dim), *self.gamma.size())
+        beta = self.beta.view(*((1, ) * dim), *self.beta.size())
+        base = base.unsqueeze(-2) * gamma + beta
+        # [B, K, 2, s] -> [B, K, s], [B, K, s]
+        q, k = torch.unbind(base, dim=-2)
+
+        if self.pos_enc == 'rope':
+            q = SinePositionalEncoding.apply_rotary_pos_enc(
+                q, pos_enc, self.spatial_dim)
+            k = SinePositionalEncoding.apply_rotary_pos_enc(
+                k, pos_enc, self.spatial_dim)
+        elif self.pos_enc == 'add':
+            pos_enc = pos_enc.reshape(*((1, ) * (q.ndim - 2)), q.size(-2),
+                                      q.size(-1))
+            q = q + pos_enc
+            k = k + pos_enc
+
+        # [B, K, s].transpose(-1, -2) -> [B, s, K]
+        # [B, K, s] x [B, s, K] -> [B, K, K]
+        qk = torch.matmul(q, k.transpose(-1, -2))
+
+        # [B, K, K]
+        kernel = torch.square(F.relu(qk / self.sqrt_s))
+
+        if mask is not None:
+            kernel = kernel * mask
+
+        if self.dropout_rate > 0.:
+            kernel = self.dropout(kernel)
+
+        # [B, K, K] x [B, K, e] -> [B, K, e]
+        x = u * torch.matmul(kernel, v)
+        # [B, K, e] -> [B, K, out_token_dims]
+        x = self.o(x)
+
+        return x
+
+    def forward(self, x, mask=None, pos_enc=None):
+        """Forward function."""
+        out = self.drop_path(self._forward(x, mask=mask, pos_enc=pos_enc))
+        if self.shortcut:
+            return self.res_scale(x) + out
+        else:
+            return out
+
+
+class DetrTransformerEncoder(BaseModule):
+    """Encoder of DETR.
+
+    Args:
+        num_layers (int): Number of encoder layers.
+        layer_cfg (:obj:`ConfigDict` or dict): the config of each encoder
+            layer. All the layers will share the same config.
+        num_cp (int): Number of checkpointing blocks in encoder layer.
+            Default to -1.
+        init_cfg (:obj:`ConfigDict` or dict, optional): the config to control
+            the initialization. Defaults to None.
+    """
+
+    def __init__(self,
+                 num_layers: int,
+                 layer_cfg: ConfigType,
+                 num_cp: int = -1,
+                 init_cfg: OptConfigType = None) -> None:
+
+        super().__init__(init_cfg=init_cfg)
+        self.num_layers = num_layers
+        self.layer_cfg = layer_cfg
+        self.num_cp = num_cp
+        assert self.num_cp <= self.num_layers
+        self._init_layers()
+
+    def _init_layers(self) -> None:
+        """Initialize encoder layers."""
+        self.layers = ModuleList([
+            DetrTransformerEncoderLayer(**self.layer_cfg)
+            for _ in range(self.num_layers)
+        ])
+
+        if self.num_cp > 0:
+            if checkpoint_wrapper is None:
+                raise NotImplementedError(
+                    'If you want to reduce GPU memory usage, \
+                    please install fairscale by executing the \
+                    following command: pip install fairscale.')
+            for i in range(self.num_cp):
+                self.layers[i] = checkpoint_wrapper(self.layers[i])
+
+        self.embed_dims = self.layers[0].embed_dims
+
+    def forward(self, query: Tensor, query_pos: Tensor,
+                key_padding_mask: Tensor, **kwargs) -> Tensor:
+        """Forward function of encoder.
+
+        Args:
+            query (Tensor): Input queries of encoder, has shape
+                (bs, num_queries, dim).
+            query_pos (Tensor): The positional embeddings of the queries, has
+                shape (bs, num_queries, dim).
+            key_padding_mask (Tensor): The `key_padding_mask` of `self_attn`
+                input. ByteTensor, has shape (bs, num_queries).
+
+        Returns:
+            Tensor: Has shape (bs, num_queries, dim) if `batch_first` is
+            `True`, otherwise (num_queries, bs, dim).
+        """
+        for layer in self.layers:
+            query = layer(query, query_pos, key_padding_mask, **kwargs)
+        return query
+
+
+class DetrTransformerEncoderLayer(BaseModule):
+    """Implements encoder layer in DETR transformer.
+
+    Args:
+        self_attn_cfg (:obj:`ConfigDict` or dict, optional): Config for self
+            attention.
+        ffn_cfg (:obj:`ConfigDict` or dict, optional): Config for FFN.
+        norm_cfg (:obj:`ConfigDict` or dict, optional): Config for
+            normalization layers. All the layers will share the same
+            config. Defaults to `LN`.
+        init_cfg (:obj:`ConfigDict` or dict, optional): Config to control
+            the initialization. Defaults to None.
+    """
+
+    def __init__(self,
+                 self_attn_cfg: OptConfigType = dict(
+                     embed_dims=256, num_heads=8, dropout=0.0),
+                 ffn_cfg: OptConfigType = dict(
+                     embed_dims=256,
+                     feedforward_channels=1024,
+                     num_fcs=2,
+                     ffn_drop=0.,
+                     act_cfg=dict(type='ReLU', inplace=True)),
+                 norm_cfg: OptConfigType = dict(type='LN'),
+                 init_cfg: OptConfigType = None) -> None:
+
+        super().__init__(init_cfg=init_cfg)
+
+        self.self_attn_cfg = self_attn_cfg
+        if 'batch_first' not in self.self_attn_cfg:
+            self.self_attn_cfg['batch_first'] = True
+        else:
+            assert self.self_attn_cfg['batch_first'] is True, 'First \
+            dimension of all DETRs in mmdet is `batch`, \
+            please set `batch_first` flag.'
+
+        self.ffn_cfg = ffn_cfg
+        self.norm_cfg = norm_cfg
+        self._init_layers()
+
+    def _init_layers(self) -> None:
+        """Initialize self-attention, FFN, and normalization."""
+        self.self_attn = MultiheadAttention(**self.self_attn_cfg)
+        self.embed_dims = self.self_attn.embed_dims
+        self.ffn = FFN(**self.ffn_cfg)
+        norms_list = [
+            build_norm_layer(self.norm_cfg, self.embed_dims)[1]
+            for _ in range(2)
+        ]
+        self.norms = ModuleList(norms_list)
+
+    def forward(self, query: Tensor, query_pos: Tensor,
+                key_padding_mask: Tensor, **kwargs) -> Tensor:
+        """Forward function of an encoder layer.
+
+        Args:
+            query (Tensor): The input query, has shape (bs, num_queries, dim).
+            query_pos (Tensor): The positional encoding for query, with
+                the same shape as `query`.
+            key_padding_mask (Tensor): The `key_padding_mask` of `self_attn`
+                input. ByteTensor. has shape (bs, num_queries).
+        Returns:
+            Tensor: forwarded results, has shape (bs, num_queries, dim).
+        """
+        query = self.self_attn(
+            query=query,
+            key=query,
+            value=query,
+            query_pos=query_pos,
+            key_pos=query_pos,
+            key_padding_mask=key_padding_mask,
+            **kwargs)
+        query = self.norms[0](query)
+        query = self.ffn(query)
+        query = self.norms[1](query)
+
+        return query
diff --git a/mmpose/structures/bbox/transforms.py b/mmpose/structures/bbox/transforms.py
index 7ddd821ace..88db311c27 100644
--- a/mmpose/structures/bbox/transforms.py
+++ b/mmpose/structures/bbox/transforms.py
@@ -369,12 +369,15 @@ def get_udp_warp_matrix(
     return warp_mat
 
 
-def get_warp_matrix(center: np.ndarray,
-                    scale: np.ndarray,
-                    rot: float,
-                    output_size: Tuple[int, int],
-                    shift: Tuple[float, float] = (0., 0.),
-                    inv: bool = False) -> np.ndarray:
+def get_warp_matrix(
+    center: np.ndarray,
+    scale: np.ndarray,
+    rot: float,
+    output_size: Tuple[int, int],
+    shift: Tuple[float, float] = (0., 0.),
+    inv: bool = False,
+    fix_aspect_ratio: bool = True,
+) -> np.ndarray:
     """Calculate the affine transformation matrix that can warp the bbox area
     in the input image to the output size.
 
@@ -389,6 +392,8 @@ def get_warp_matrix(center: np.ndarray,
             Default (0., 0.).
         inv (bool): Option to inverse the affine transform direction.
             (inv=False: src->dst or inv=True: dst->src)
+        fix_aspect_ratio (bool): Whether to fix aspect ratio during transform.
+            Defaults to True.
 
     Returns:
         np.ndarray: A 2x3 transformation matrix
@@ -399,23 +404,29 @@ def get_warp_matrix(center: np.ndarray,
     assert len(shift) == 2
 
     shift = np.array(shift)
-    src_w = scale[0]
-    dst_w = output_size[0]
-    dst_h = output_size[1]
+    src_w, src_h = scale[:2]
+    dst_w, dst_h = output_size[:2]
 
     rot_rad = np.deg2rad(rot)
-    src_dir = _rotate_point(np.array([0., src_w * -0.5]), rot_rad)
-    dst_dir = np.array([0., dst_w * -0.5])
+    src_dir = _rotate_point(np.array([src_w * -0.5, 0.]), rot_rad)
+    dst_dir = np.array([dst_w * -0.5, 0.])
 
     src = np.zeros((3, 2), dtype=np.float32)
     src[0, :] = center + scale * shift
     src[1, :] = center + src_dir + scale * shift
-    src[2, :] = _get_3rd_point(src[0, :], src[1, :])
 
     dst = np.zeros((3, 2), dtype=np.float32)
     dst[0, :] = [dst_w * 0.5, dst_h * 0.5]
     dst[1, :] = np.array([dst_w * 0.5, dst_h * 0.5]) + dst_dir
-    dst[2, :] = _get_3rd_point(dst[0, :], dst[1, :])
+
+    if fix_aspect_ratio:
+        src[2, :] = _get_3rd_point(src[0, :], src[1, :])
+        dst[2, :] = _get_3rd_point(dst[0, :], dst[1, :])
+    else:
+        src_dir_2 = _rotate_point(np.array([0., src_h * -0.5]), rot_rad)
+        dst_dir_2 = np.array([0., dst_h * -0.5])
+        src[2, :] = center + src_dir_2 + scale * shift
+        dst[2, :] = np.array([dst_w * 0.5, dst_h * 0.5]) + dst_dir_2
 
     if inv:
         warp_mat = cv2.getAffineTransform(np.float32(dst), np.float32(src))
diff --git a/mmpose/structures/keypoint/transforms.py b/mmpose/structures/keypoint/transforms.py
index b4a2aff925..aa7cebda90 100644
--- a/mmpose/structures/keypoint/transforms.py
+++ b/mmpose/structures/keypoint/transforms.py
@@ -1,5 +1,5 @@
 # Copyright (c) OpenMMLab. All rights reserved.
-from typing import List, Optional, Tuple
+from typing import List, Optional, Tuple, Union
 
 import numpy as np
 
@@ -71,7 +71,7 @@ def flip_keypoints_custom_center(keypoints: np.ndarray,
                                  flip_indices: List[int],
                                  center_mode: str = 'static',
                                  center_x: float = 0.5,
-                                 center_index: int = 0):
+                                 center_index: Union[int, List] = 0):
     """Flip human joints horizontally.
 
     Note:
@@ -91,9 +91,9 @@ def flip_keypoints_custom_center(keypoints: np.ndarray,
             Defaults: ``'static'``.
         center_x (float): Set the x-axis location of the flip center. Only used
             when ``center_mode`` is ``'static'``. Defaults: 0.5.
-        center_index (int): Set the index of the root joint, whose x location
-            will be used as the flip center. Only used when ``center_mode`` is
-            ``'root'``. Defaults: 0.
+        center_index (Union[int, List]): Set the index of the root joint, whose
+            x location will be used as the flip center. Only used when
+            ``center_mode`` is ``'root'``. Defaults: 0.
 
     Returns:
         np.ndarray([..., K, C]): Flipped joints.
@@ -108,8 +108,10 @@ def flip_keypoints_custom_center(keypoints: np.ndarray,
     if center_mode == 'static':
         x_c = center_x
     elif center_mode == 'root':
-        assert keypoints.shape[-2] > center_index
-        x_c = keypoints[..., center_index, 0]
+        center_index = [center_index] if isinstance(center_index, int) else \
+            center_index
+        assert keypoints.shape[-2] > max(center_index)
+        x_c = keypoints[..., center_index, 0].mean(axis=-1)
 
     keypoints_flipped = keypoints.copy()
     keypoints_visible_flipped = keypoints_visible.copy()
diff --git a/mmpose/structures/pose_data_sample.py b/mmpose/structures/pose_data_sample.py
index 2c1d69034e..53f6e8990e 100644
--- a/mmpose/structures/pose_data_sample.py
+++ b/mmpose/structures/pose_data_sample.py
@@ -39,7 +39,7 @@ class PoseDataSample(BaseDataElement):
         ...                              gt_fields=gt_fields,
         ...                              metainfo=pose_meta)
         >>> assert 'img_shape' in data_sample
-        >>> len(data_sample.gt_intances)
+        >>> len(data_sample.gt_instances)
         1
     """
 
diff --git a/mmpose/utils/hooks.py b/mmpose/utils/hooks.py
index b68940f2b7..4a2eb8aea2 100644
--- a/mmpose/utils/hooks.py
+++ b/mmpose/utils/hooks.py
@@ -52,7 +52,37 @@ def __exit__(self, exc_type, exc_val, exc_tb):
 
 # using wonder's beautiful simplification:
 # https://stackoverflow.com/questions/31174295/getattr-and-setattr-on-nested-objects
+def rsetattr(obj, attr, val):
+    """Set the value of a nested attribute of an object.
+
+    This function splits the attribute path and sets the value of the
+    nested attribute. If the attribute path is nested (e.g., 'x.y.z'), it
+    traverses through each attribute until it reaches the last one and sets
+    its value.
+
+    Args:
+        obj (object): The object whose attribute needs to be set.
+        attr (str): The attribute path in dot notation (e.g., 'x.y.z').
+        val (any): The value to set at the specified attribute path.
+    """
+    pre, _, post = attr.rpartition('.')
+    return setattr(rgetattr(obj, pre) if pre else obj, post, val)
+
+
 def rgetattr(obj, attr, *args):
+    """Recursively get a nested attribute of an object.
+
+    This function splits the attribute path and retrieves the value of the
+    nested attribute. If the attribute path is nested (e.g., 'x.y.z'), it
+    traverses through each attribute. If an attribute in the path does not
+    exist, it returns the value specified as the third argument.
+
+    Args:
+        obj (object): The object whose attribute needs to be retrieved.
+        attr (str): The attribute path in dot notation (e.g., 'x.y.z').
+        *args (any): Optional default value to return if the attribute
+            does not exist.
+    """
 
     def _getattr(obj, attr):
         return getattr(obj, attr, *args)
diff --git a/mmpose/version.py b/mmpose/version.py
index 8a6d7e40d5..2d6bfe1239 100644
--- a/mmpose/version.py
+++ b/mmpose/version.py
@@ -1,6 +1,6 @@
 # Copyright (c) Open-MMLab. All rights reserved.
 
-__version__ = '1.2.0'
+__version__ = '1.3.0'
 short_version = __version__
 
 
diff --git a/mmpose/visualization/local_visualizer_3d.py b/mmpose/visualization/local_visualizer_3d.py
index 1b757a84e5..09603dba80 100644
--- a/mmpose/visualization/local_visualizer_3d.py
+++ b/mmpose/visualization/local_visualizer_3d.py
@@ -280,23 +280,27 @@ def _draw_3d_instances_kpts(keypoints,
                                  'data sample must contain '
                                  '"lifting_target" or "keypoints_gt"')
 
-            _draw_3d_instances_kpts(keypoints, scores, keypoints_visible, 2,
-                                    show_kpt_idx, 'Ground Truth')
+            if scores_2d is None:
+                scores_2d = np.ones(keypoints.shape[:-1])
+
+            _draw_3d_instances_kpts(keypoints, scores, scores_2d,
+                                    keypoints_visible, 2, show_kpt_idx,
+                                    'Ground Truth')
 
         # convert figure to numpy array
         fig.tight_layout()
         fig.canvas.draw()
 
-        pred_img_data = fig.canvas.tostring_rgb()
         pred_img_data = np.frombuffer(
             fig.canvas.tostring_rgb(), dtype=np.uint8)
 
         if not pred_img_data.any():
             pred_img_data = np.full((vis_height, vis_width, 3), 255)
         else:
-            pred_img_data = pred_img_data.reshape(vis_height,
-                                                  vis_width * num_instances,
-                                                  -1)
+            width, height = fig.get_size_inches() * fig.get_dpi()
+            pred_img_data = pred_img_data.reshape(
+                int(height),
+                int(width) * num_instances, 3)
 
         plt.close(fig)
 
diff --git a/model-index.yml b/model-index.yml
index 0ed87b91af..363ab89f08 100644
--- a/model-index.yml
+++ b/model-index.yml
@@ -1,47 +1,51 @@
 Import:
 - configs/animal_2d_keypoint/rtmpose/ap10k/rtmpose_ap10k.yml
+- configs/animal_2d_keypoint/topdown_heatmap/ak/hrnet_animalkingdom.yml
 - configs/animal_2d_keypoint/topdown_heatmap/animalpose/hrnet_animalpose.yml
 - configs/animal_2d_keypoint/topdown_heatmap/animalpose/resnet_animalpose.yml
-- configs/animal_2d_keypoint/topdown_heatmap/ap10k/resnet_ap10k.yml
-- configs/animal_2d_keypoint/topdown_heatmap/ap10k/hrnet_ap10k.yml
 - configs/animal_2d_keypoint/topdown_heatmap/ap10k/cspnext_udp_ap10k.yml
+- configs/animal_2d_keypoint/topdown_heatmap/ap10k/hrnet_ap10k.yml
+- configs/animal_2d_keypoint/topdown_heatmap/ap10k/resnet_ap10k.yml
 - configs/animal_2d_keypoint/topdown_heatmap/locust/resnet_locust.yml
 - configs/animal_2d_keypoint/topdown_heatmap/zebra/resnet_zebra.yml
+- configs/body_2d_keypoint/associative_embedding/coco/hrnet_coco.yml
 - configs/body_2d_keypoint/cid/coco/hrnet_coco.yml
 - configs/body_2d_keypoint/dekr/coco/hrnet_coco.yml
-- configs/body_2d_keypoint/rtmpose/body8/rtmpose_body8-coco.yml
-- configs/body_2d_keypoint/rtmpose/body8/rtmpose_body8-halpe26.yml
 - configs/body_2d_keypoint/dekr/crowdpose/hrnet_crowdpose.yml
 - configs/body_2d_keypoint/edpose/coco/edpose_coco.yml
-- configs/body_2d_keypoint/integral_regression/coco/resnet_ipr_coco.yml
-- configs/body_2d_keypoint/integral_regression/coco/resnet_dsnt_coco.yml
 - configs/body_2d_keypoint/integral_regression/coco/resnet_debias_coco.yml
+- configs/body_2d_keypoint/integral_regression/coco/resnet_dsnt_coco.yml
+- configs/body_2d_keypoint/integral_regression/coco/resnet_ipr_coco.yml
+- configs/body_2d_keypoint/rtmpose/body8/rtmpose_body8-coco.yml
+- configs/body_2d_keypoint/rtmpose/body8/rtmpose_body8-halpe26.yml
 - configs/body_2d_keypoint/rtmpose/coco/rtmpose_coco.yml
 - configs/body_2d_keypoint/rtmpose/crowdpose/rtmpose_crowdpose.yml
-- configs/body_2d_keypoint/rtmpose/mpii/rtmpose_mpii.yml
 - configs/body_2d_keypoint/rtmpose/humanart/rtmpose_humanart.yml
+- configs/body_2d_keypoint/rtmpose/mpii/rtmpose_mpii.yml
+- configs/body_2d_keypoint/rtmo/coco/rtmo_coco.yml
+- configs/body_2d_keypoint/rtmo/body7/rtmo_body7.yml
+- configs/body_2d_keypoint/rtmo/crowdpose/rtmo_crowdpose.yml
 - configs/body_2d_keypoint/simcc/coco/mobilenetv2_coco.yml
 - configs/body_2d_keypoint/simcc/coco/resnet_coco.yml
 - configs/body_2d_keypoint/simcc/coco/vipnas_coco.yml
 - configs/body_2d_keypoint/topdown_heatmap/aic/hrnet_aic.yml
 - configs/body_2d_keypoint/topdown_heatmap/aic/resnet_aic.yml
-- configs/body_2d_keypoint/topdown_heatmap/coco/hourglass_coco.yml
-- configs/body_2d_keypoint/topdown_heatmap/coco/hrnet_coco.yml
-- configs/body_2d_keypoint/topdown_heatmap/coco/litehrnet_coco.yml
-- configs/body_2d_keypoint/topdown_heatmap/coco/mspn_coco.yml
-- configs/body_2d_keypoint/topdown_heatmap/coco/vitpose_coco.yml
 - configs/body_2d_keypoint/topdown_heatmap/coco/alexnet_coco.yml
-- configs/body_2d_keypoint/topdown_heatmap/coco/resnet_coco.yml
 - configs/body_2d_keypoint/topdown_heatmap/coco/cpm_coco.yml
+- configs/body_2d_keypoint/topdown_heatmap/coco/cspnext_udp_coco.yml
+- configs/body_2d_keypoint/topdown_heatmap/coco/hourglass_coco.yml
 - configs/body_2d_keypoint/topdown_heatmap/coco/hrformer_coco.yml
 - configs/body_2d_keypoint/topdown_heatmap/coco/hrnet_augmentation_coco.yml
+- configs/body_2d_keypoint/topdown_heatmap/coco/hrnet_coco.yml
 - configs/body_2d_keypoint/topdown_heatmap/coco/hrnet_dark_coco.yml
 - configs/body_2d_keypoint/topdown_heatmap/coco/hrnet_udp_coco.yml
+- configs/body_2d_keypoint/topdown_heatmap/coco/litehrnet_coco.yml
 - configs/body_2d_keypoint/topdown_heatmap/coco/mobilenetv2_coco.yml
+- configs/body_2d_keypoint/topdown_heatmap/coco/mspn_coco.yml
 - configs/body_2d_keypoint/topdown_heatmap/coco/pvt_coco.yml
 - configs/body_2d_keypoint/topdown_heatmap/coco/resnest_coco.yml
+- configs/body_2d_keypoint/topdown_heatmap/coco/resnet_coco.yml
 - configs/body_2d_keypoint/topdown_heatmap/coco/resnet_dark_coco.yml
-- configs/body_2d_keypoint/topdown_heatmap/coco/cspnext_udp_coco.yml
 - configs/body_2d_keypoint/topdown_heatmap/coco/resnetv1d_coco.yml
 - configs/body_2d_keypoint/topdown_heatmap/coco/resnext_coco.yml
 - configs/body_2d_keypoint/topdown_heatmap/coco/rsn_coco.yml
@@ -52,14 +56,18 @@ Import:
 - configs/body_2d_keypoint/topdown_heatmap/coco/swin_coco.yml
 - configs/body_2d_keypoint/topdown_heatmap/coco/vgg_coco.yml
 - configs/body_2d_keypoint/topdown_heatmap/coco/vipnas_coco.yml
+- configs/body_2d_keypoint/topdown_heatmap/coco/vitpose_coco.yml
+- configs/body_2d_keypoint/topdown_heatmap/crowdpose/cspnext_udp_crowdpose.yml
 - configs/body_2d_keypoint/topdown_heatmap/crowdpose/hrnet_crowdpose.yml
 - configs/body_2d_keypoint/topdown_heatmap/crowdpose/resnet_crowdpose.yml
-- configs/body_2d_keypoint/topdown_heatmap/crowdpose/cspnext_udp_crowdpose.yml
+- configs/body_2d_keypoint/topdown_heatmap/exlpose/hrnet_exlpose.yml
+- configs/body_2d_keypoint/topdown_heatmap/humanart/hrnet_humanart.yml
+- configs/body_2d_keypoint/topdown_heatmap/humanart/vitpose_humanart.yml
 - configs/body_2d_keypoint/topdown_heatmap/jhmdb/cpm_jhmdb.yml
 - configs/body_2d_keypoint/topdown_heatmap/jhmdb/resnet_jhmdb.yml
 - configs/body_2d_keypoint/topdown_heatmap/mpii/cpm_mpii.yml
-- configs/body_2d_keypoint/topdown_heatmap/mpii/hourglass_mpii.yml
 - configs/body_2d_keypoint/topdown_heatmap/mpii/cspnext_udp_mpii.yml
+- configs/body_2d_keypoint/topdown_heatmap/mpii/hourglass_mpii.yml
 - configs/body_2d_keypoint/topdown_heatmap/mpii/hrnet_dark_mpii.yml
 - configs/body_2d_keypoint/topdown_heatmap/mpii/hrnet_mpii.yml
 - configs/body_2d_keypoint/topdown_heatmap/mpii/litehrnet_mpii.yml
@@ -73,19 +81,21 @@ Import:
 - configs/body_2d_keypoint/topdown_heatmap/mpii/shufflenetv2_mpii.yml
 - configs/body_2d_keypoint/topdown_heatmap/posetrack18/hrnet_posetrack18.yml
 - configs/body_2d_keypoint/topdown_heatmap/posetrack18/resnet_posetrack18.yml
+- configs/body_2d_keypoint/topdown_regression/coco/mobilenetv2_rle_coco.yml
 - configs/body_2d_keypoint/topdown_regression/coco/resnet_coco.yml
 - configs/body_2d_keypoint/topdown_regression/coco/resnet_rle_coco.yml
-- configs/body_2d_keypoint/topdown_regression/coco/mobilenetv2_rle_coco.yml
 - configs/body_2d_keypoint/topdown_regression/mpii/resnet_mpii.yml
 - configs/body_2d_keypoint/topdown_regression/mpii/resnet_rle_mpii.yml
 - configs/body_2d_keypoint/yoloxpose/coco/yoloxpose_coco.yml
 - configs/body_3d_keypoint/image_pose_lift/h36m/simplebaseline3d_h36m.yml
-- configs/body_3d_keypoint/video_pose_lift/h36m/videopose3d_h36m.yml
 - configs/body_3d_keypoint/motionbert/h36m/motionbert_h36m.yml
+- configs/body_3d_keypoint/video_pose_lift/h36m/videopose3d_h36m.yml
 - configs/face_2d_keypoint/rtmpose/coco_wholebody_face/rtmpose_coco_wholebody_face.yml
 - configs/face_2d_keypoint/rtmpose/face6/rtmpose_face6.yml
+- configs/face_2d_keypoint/rtmpose/lapa/rtmpose_lapa.yml
 - configs/face_2d_keypoint/rtmpose/wflw/rtmpose_wflw.yml
 - configs/face_2d_keypoint/topdown_heatmap/300w/hrnetv2_300w.yml
+- configs/face_2d_keypoint/topdown_heatmap/300wlp/hrnetv2_300wlp.yml
 - configs/face_2d_keypoint/topdown_heatmap/aflw/hrnetv2_aflw.yml
 - configs/face_2d_keypoint/topdown_heatmap/aflw/hrnetv2_dark_aflw.yml
 - configs/face_2d_keypoint/topdown_heatmap/coco_wholebody_face/hourglass_coco_wholebody_face.yml
@@ -95,9 +105,15 @@ Import:
 - configs/face_2d_keypoint/topdown_heatmap/coco_wholebody_face/resnet_coco_wholebody_face.yml
 - configs/face_2d_keypoint/topdown_heatmap/coco_wholebody_face/scnet_coco_wholebody_face.yml
 - configs/face_2d_keypoint/topdown_heatmap/cofw/hrnetv2_cofw.yml
-- configs/face_2d_keypoint/topdown_heatmap/wflw/hrnetv2_wflw.yml
-- configs/face_2d_keypoint/topdown_heatmap/wflw/hrnetv2_dark_wflw.yml
 - configs/face_2d_keypoint/topdown_heatmap/wflw/hrnetv2_awing_wflw.yml
+- configs/face_2d_keypoint/topdown_heatmap/wflw/hrnetv2_dark_wflw.yml
+- configs/face_2d_keypoint/topdown_heatmap/wflw/hrnetv2_wflw.yml
+- configs/face_2d_keypoint/topdown_regression/wflw/resnet_softwingloss_wflw.yml
+- configs/face_2d_keypoint/topdown_regression/wflw/resnet_wflw.yml
+- configs/face_2d_keypoint/topdown_regression/wflw/resnet_wingloss_wflw.yml
+- configs/fashion_2d_keypoint/topdown_heatmap/deepfashion/hrnet_deepfashion.yml
+- configs/fashion_2d_keypoint/topdown_heatmap/deepfashion/resnet_deepfashion.yml
+- configs/fashion_2d_keypoint/topdown_heatmap/deepfashion2/res50_deepfasion2.yml
 - configs/hand_2d_keypoint/rtmpose/coco_wholebody_hand/rtmpose_coco_wholebody_hand.yml
 - configs/hand_2d_keypoint/rtmpose/hand5/rtmpose_hand5.yml
 - configs/hand_2d_keypoint/topdown_heatmap/coco_wholebody_hand/hourglass_coco_wholebody_hand.yml
@@ -108,11 +124,11 @@ Import:
 - configs/hand_2d_keypoint/topdown_heatmap/coco_wholebody_hand/resnet_coco_wholebody_hand.yml
 - configs/hand_2d_keypoint/topdown_heatmap/coco_wholebody_hand/scnet_coco_wholebody_hand.yml
 - configs/hand_2d_keypoint/topdown_heatmap/freihand2d/resnet_freihand2d.yml
-- configs/hand_2d_keypoint/topdown_heatmap/onehand10k/resnet_onehand10k.yml
 - configs/hand_2d_keypoint/topdown_heatmap/onehand10k/hrnetv2_dark_onehand10k.yml
 - configs/hand_2d_keypoint/topdown_heatmap/onehand10k/hrnetv2_onehand10k.yml
 - configs/hand_2d_keypoint/topdown_heatmap/onehand10k/hrnetv2_udp_onehand10k.yml
 - configs/hand_2d_keypoint/topdown_heatmap/onehand10k/mobilenetv2_onehand10k.yml
+- configs/hand_2d_keypoint/topdown_heatmap/onehand10k/resnet_onehand10k.yml
 - configs/hand_2d_keypoint/topdown_heatmap/rhd2d/hrnetv2_dark_rhd2d.yml
 - configs/hand_2d_keypoint/topdown_heatmap/rhd2d/hrnetv2_rhd2d.yml
 - configs/hand_2d_keypoint/topdown_heatmap/rhd2d/hrnetv2_udp_rhd2d.yml
@@ -122,12 +138,11 @@ Import:
 - configs/hand_2d_keypoint/topdown_regression/rhd2d/resnet_rhd2d.yml
 - configs/hand_3d_keypoint/internet/interhand3d/internet_interhand3d.yml
 - configs/wholebody_2d_keypoint/rtmpose/coco-wholebody/rtmpose_coco-wholebody.yml
+- configs/wholebody_2d_keypoint/topdown_heatmap/coco-wholebody/cspnext_udp_coco-wholebody.yml
 - configs/wholebody_2d_keypoint/topdown_heatmap/coco-wholebody/hrnet_coco-wholebody.yml
 - configs/wholebody_2d_keypoint/topdown_heatmap/coco-wholebody/hrnet_dark_coco-wholebody.yml
 - configs/wholebody_2d_keypoint/topdown_heatmap/coco-wholebody/resnet_coco-wholebody.yml
 - configs/wholebody_2d_keypoint/topdown_heatmap/coco-wholebody/vipnas_coco-wholebody.yml
 - configs/wholebody_2d_keypoint/topdown_heatmap/coco-wholebody/vipnas_dark_coco-wholebody.yml
-- configs/wholebody_2d_keypoint/topdown_heatmap/coco-wholebody/cspnext_udp_coco-wholebody.yml
-- configs/fashion_2d_keypoint/topdown_heatmap/deepfashion2/res50_deepfasion2.yml
-- configs/fashion_2d_keypoint/topdown_heatmap/deepfashion/hrnet_deepfashion.yml
-- configs/fashion_2d_keypoint/topdown_heatmap/deepfashion/resnet_deepfashion.yml
+- configs/wholebody_2d_keypoint/topdown_heatmap/ubody2d/hrnet_coco-wholebody.yml
+- configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw_cocktail14.yml
diff --git a/projects/README.md b/projects/README.md
index 4bdc500e48..a23696640e 100644
--- a/projects/README.md
+++ b/projects/README.md
@@ -40,6 +40,22 @@ We also provide some documentation listed below to help you get started:
   <img src="https://github.com/open-mmlab/mmpose/assets/13503330/5b637d76-41dd-4376-9a7f-854cd120799d" width=800 height=200 />
   </div><br/>
 
+- **[🎳RTMO](./rtmo)**: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation <sup>
+  <a href="https://openxlab.org.cn/apps/detail/mmpose/RTMPose">
+  <i>TRY IT NOW</i>
+  </a>
+  </sup>
+
+  <div align="center">
+  <img src="https://github.com/open-mmlab/mmpose/assets/26127467/61120930-09e5-4457-aa2c-b2f131d4f710" width=800  />
+  </div><br/>
+
+- **[♾️PoseAnything](./pose_anything/)**: A Graph-Based Approach for Category-Agnostic Pose Estimation
+
+  <div align=center>
+  <img src="https://github.com/open-mmlab/mmpose/assets/26127467/47acbd42-e812-4994-b287-b2a2b5ccbd80" width="800"/>
+  </div><br/>
+
 - **[:art:MMPose4AIGC](./mmpose4aigc)**: Guide AI image generation with MMPose
 
   <div align=center>
diff --git a/projects/pose_anything/README.md b/projects/pose_anything/README.md
new file mode 100644
index 0000000000..c00b4a914e
--- /dev/null
+++ b/projects/pose_anything/README.md
@@ -0,0 +1,82 @@
+# Pose Anything: A Graph-Based Approach for Category-Agnostic Pose Estimation
+
+## [Paper](https://arxiv.org/pdf/2311.17891.pdf) | [Project Page](https://orhir.github.io/pose-anything/) | [Official Repo](https://github.com/orhir/PoseAnything)
+
+By [Or Hirschorn](https://scholar.google.co.il/citations?user=GgFuT_QAAAAJ&hl=iw&oi=ao)
+and [Shai Avidan](https://scholar.google.co.il/citations?hl=iw&user=hpItE1QAAAAJ)
+
+![Teaser Figure](https://github.com/open-mmlab/mmpose/assets/26127467/96480360-1a80-41f6-88d3-d6c747506a7e)
+
+## Introduction
+
+We present a novel approach to CAPE that leverages the inherent geometrical
+relations between keypoints through a newly designed Graph Transformer Decoder.
+By capturing and incorporating this crucial structural information, our method
+enhances the accuracy of keypoint localization, marking a significant departure
+from conventional CAPE techniques that treat keypoints as isolated entities.
+
+## Citation
+
+If you find this useful, please cite this work as follows:
+
+```bibtex
+@misc{hirschorn2023pose,
+      title={Pose Anything: A Graph-Based Approach for Category-Agnostic Pose Estimation},
+      author={Or Hirschorn and Shai Avidan},
+      year={2023},
+      eprint={2311.17891},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+```
+
+## Getting Started
+
+📣 Pose Anything is available on OpenXLab now. [\[Try it online\]](https://openxlab.org.cn/apps/detail/orhir/Pose-Anything)
+
+### Install Dependencies
+
+We recommend using a virtual environment for running our code.
+After installing MMPose, you can install the rest of the dependencies by
+running:
+
+```
+pip install timm
+```
+
+### Pretrained Weights
+
+The full list of pretrained models can be found in
+the [Official Repo](https://github.com/orhir/PoseAnything).
+
+## Demo on Custom Images
+
+***A bigger and more accurate version of the model - COMING SOON!***
+
+Download
+the [pretrained model](https://drive.google.com/file/d/1RT1Q8AMEa1kj6k9ZqrtWIKyuR4Jn4Pqc/view?usp=drive_link)
+and run:
+
+```
+python demo.py --support [path_to_support_image] --query [path_to_query_image] --config configs/demo_b.py --checkpoint [path_to_pretrained_ckpt]
+```
+
+***Note:*** The demo code supports any config with suitable checkpoint file.
+More pre-trained models can be found in the official repo.
+
+## Training and Testing on MP-100 Dataset
+
+**We currently only support demo on custom images through the MMPose repo.**
+
+**For training and testing on the MP-100 dataset, please refer to
+the [Official Repo](https://github.com/orhir/PoseAnything).**
+
+## Acknowledgement
+
+Our code is based on code from:
+
+- [CapeFormer](https://github.com/flyinglynx/CapeFormer)
+
+## License
+
+This project is released under the Apache 2.0 license.
diff --git a/projects/pose_anything/configs/demo.py b/projects/pose_anything/configs/demo.py
new file mode 100644
index 0000000000..70a468a8f4
--- /dev/null
+++ b/projects/pose_anything/configs/demo.py
@@ -0,0 +1,207 @@
+log_level = 'INFO'
+load_from = None
+resume_from = None
+dist_params = dict(backend='nccl')
+workflow = [('train', 1)]
+checkpoint_config = dict(interval=20)
+evaluation = dict(
+    interval=25,
+    metric=['PCK', 'NME', 'AUC', 'EPE'],
+    key_indicator='PCK',
+    gpu_collect=True,
+    res_folder='')
+optimizer = dict(
+    type='Adam',
+    lr=1e-5,
+)
+
+optimizer_config = dict(grad_clip=None)
+# learning policy
+lr_config = dict(
+    policy='step',
+    warmup='linear',
+    warmup_iters=1000,
+    warmup_ratio=0.001,
+    step=[160, 180])
+total_epochs = 200
+log_config = dict(
+    interval=50,
+    hooks=[dict(type='TextLoggerHook'),
+           dict(type='TensorboardLoggerHook')])
+
+channel_cfg = dict(
+    num_output_channels=1,
+    dataset_joints=1,
+    dataset_channel=[
+        [
+            0,
+        ],
+    ],
+    inference_channel=[
+        0,
+    ],
+    max_kpt_num=100)
+
+# model settings
+model = dict(
+    type='TransformerPoseTwoStage',
+    pretrained='swinv2_large',
+    encoder_config=dict(
+        type='SwinTransformerV2',
+        embed_dim=192,
+        depths=[2, 2, 18, 2],
+        num_heads=[6, 12, 24, 48],
+        window_size=16,
+        pretrained_window_sizes=[12, 12, 12, 6],
+        drop_path_rate=0.2,
+        img_size=256,
+    ),
+    keypoint_head=dict(
+        type='TwoStageHead',
+        in_channels=1536,
+        transformer=dict(
+            type='TwoStageSupportRefineTransformer',
+            d_model=384,
+            nhead=8,
+            num_encoder_layers=3,
+            num_decoder_layers=3,
+            dim_feedforward=1536,
+            dropout=0.1,
+            similarity_proj_dim=384,
+            dynamic_proj_dim=192,
+            activation='relu',
+            normalize_before=False,
+            return_intermediate_dec=True),
+        share_kpt_branch=False,
+        num_decoder_layer=3,
+        with_heatmap_loss=True,
+        support_pos_embed=False,
+        heatmap_loss_weight=2.0,
+        skeleton_loss_weight=0.02,
+        num_samples=0,
+        support_embedding_type='fixed',
+        num_support=100,
+        support_order_dropout=-1,
+        positional_encoding=dict(
+            type='SinePositionalEncoding', num_feats=192, normalize=True)),
+    # training and testing settings
+    train_cfg=dict(),
+    test_cfg=dict(
+        flip_test=False,
+        post_process='default',
+        shift_heatmap=True,
+        modulate_kernel=11))
+
+data_cfg = dict(
+    image_size=[256, 256],
+    heatmap_size=[64, 64],
+    num_output_channels=channel_cfg['num_output_channels'],
+    num_joints=channel_cfg['dataset_joints'],
+    dataset_channel=channel_cfg['dataset_channel'],
+    inference_channel=channel_cfg['inference_channel'])
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='TopDownGetRandomScaleRotation', rot_factor=15,
+        scale_factor=0.15),
+    dict(type='TopDownAffineFewShot'),
+    dict(type='ToTensor'),
+    dict(
+        type='NormalizeTensor',
+        mean=[0.485, 0.456, 0.406],
+        std=[0.229, 0.224, 0.225]),
+    dict(type='TopDownGenerateTargetFewShot', sigma=1),
+    dict(
+        type='Collect',
+        keys=['img', 'target', 'target_weight'],
+        meta_keys=[
+            'image_file',
+            'joints_3d',
+            'joints_3d_visible',
+            'center',
+            'scale',
+            'rotation',
+            'bbox_score',
+            'flip_pairs',
+            'category_id',
+            'skeleton',
+        ]),
+]
+
+valid_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='TopDownAffineFewShot'),
+    dict(type='ToTensor'),
+    dict(
+        type='NormalizeTensor',
+        mean=[0.485, 0.456, 0.406],
+        std=[0.229, 0.224, 0.225]),
+    dict(type='TopDownGenerateTargetFewShot', sigma=1),
+    dict(
+        type='Collect',
+        keys=['img', 'target', 'target_weight'],
+        meta_keys=[
+            'image_file',
+            'joints_3d',
+            'joints_3d_visible',
+            'center',
+            'scale',
+            'rotation',
+            'bbox_score',
+            'flip_pairs',
+            'category_id',
+            'skeleton',
+        ]),
+]
+
+test_pipeline = valid_pipeline
+
+data_root = 'data/mp100'
+data = dict(
+    samples_per_gpu=8,
+    workers_per_gpu=8,
+    train=dict(
+        type='TransformerPoseDataset',
+        ann_file=f'{data_root}/annotations/mp100_all.json',
+        img_prefix=f'{data_root}/images/',
+        # img_prefix=f'{data_root}',
+        data_cfg=data_cfg,
+        valid_class_ids=None,
+        max_kpt_num=channel_cfg['max_kpt_num'],
+        num_shots=1,
+        pipeline=train_pipeline),
+    val=dict(
+        type='TransformerPoseDataset',
+        ann_file=f'{data_root}/annotations/mp100_split1_val.json',
+        img_prefix=f'{data_root}/images/',
+        # img_prefix=f'{data_root}',
+        data_cfg=data_cfg,
+        valid_class_ids=None,
+        max_kpt_num=channel_cfg['max_kpt_num'],
+        num_shots=1,
+        num_queries=15,
+        num_episodes=100,
+        pipeline=valid_pipeline),
+    test=dict(
+        type='TestPoseDataset',
+        ann_file=f'{data_root}/annotations/mp100_split1_test.json',
+        img_prefix=f'{data_root}/images/',
+        # img_prefix=f'{data_root}',
+        data_cfg=data_cfg,
+        valid_class_ids=None,
+        max_kpt_num=channel_cfg['max_kpt_num'],
+        num_shots=1,
+        num_queries=15,
+        num_episodes=200,
+        pck_threshold_list=[0.05, 0.10, 0.15, 0.2, 0.25],
+        pipeline=test_pipeline),
+)
+vis_backends = [
+    dict(type='LocalVisBackend'),
+    dict(type='TensorboardVisBackend'),
+]
+visualizer = dict(
+    type='PoseLocalVisualizer', vis_backends=vis_backends, name='visualizer')
+
+shuffle_cfg = dict(interval=1)
diff --git a/projects/pose_anything/configs/demo_b.py b/projects/pose_anything/configs/demo_b.py
new file mode 100644
index 0000000000..2b7d8b30ff
--- /dev/null
+++ b/projects/pose_anything/configs/demo_b.py
@@ -0,0 +1,205 @@
+custom_imports = dict(imports=['models'])
+
+log_level = 'INFO'
+load_from = None
+resume_from = None
+dist_params = dict(backend='nccl')
+workflow = [('train', 1)]
+checkpoint_config = dict(interval=20)
+evaluation = dict(
+    interval=25,
+    metric=['PCK', 'NME', 'AUC', 'EPE'],
+    key_indicator='PCK',
+    gpu_collect=True,
+    res_folder='')
+optimizer = dict(
+    type='Adam',
+    lr=1e-5,
+)
+
+optimizer_config = dict(grad_clip=None)
+# learning policy
+lr_config = dict(
+    policy='step',
+    warmup='linear',
+    warmup_iters=1000,
+    warmup_ratio=0.001,
+    step=[160, 180])
+total_epochs = 200
+log_config = dict(
+    interval=50,
+    hooks=[dict(type='TextLoggerHook'),
+           dict(type='TensorboardLoggerHook')])
+
+channel_cfg = dict(
+    num_output_channels=1,
+    dataset_joints=1,
+    dataset_channel=[
+        [
+            0,
+        ],
+    ],
+    inference_channel=[
+        0,
+    ],
+    max_kpt_num=100)
+
+# model settings
+model = dict(
+    type='PoseAnythingModel',
+    pretrained='swinv2_base',
+    encoder_config=dict(
+        type='SwinTransformerV2',
+        embed_dim=128,
+        depths=[2, 2, 18, 2],
+        num_heads=[4, 8, 16, 32],
+        window_size=14,
+        pretrained_window_sizes=[12, 12, 12, 6],
+        drop_path_rate=0.1,
+        img_size=224,
+    ),
+    keypoint_head=dict(
+        type='PoseHead',
+        in_channels=1024,
+        transformer=dict(
+            type='EncoderDecoder',
+            d_model=256,
+            nhead=8,
+            num_encoder_layers=3,
+            num_decoder_layers=3,
+            graph_decoder='pre',
+            dim_feedforward=1024,
+            dropout=0.1,
+            similarity_proj_dim=256,
+            dynamic_proj_dim=128,
+            activation='relu',
+            normalize_before=False,
+            return_intermediate_dec=True),
+        share_kpt_branch=False,
+        num_decoder_layer=3,
+        with_heatmap_loss=True,
+        heatmap_loss_weight=2.0,
+        support_order_dropout=-1,
+        positional_encoding=dict(
+            type='SinePositionalEncoding', num_feats=128, normalize=True)),
+    # training and testing settings
+    train_cfg=dict(),
+    test_cfg=dict(
+        flip_test=False,
+        post_process='default',
+        shift_heatmap=True,
+        modulate_kernel=11))
+
+data_cfg = dict(
+    image_size=[224, 224],
+    heatmap_size=[64, 64],
+    num_output_channels=channel_cfg['num_output_channels'],
+    num_joints=channel_cfg['dataset_joints'],
+    dataset_channel=channel_cfg['dataset_channel'],
+    inference_channel=channel_cfg['inference_channel'])
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='TopDownGetRandomScaleRotation', rot_factor=15,
+        scale_factor=0.15),
+    dict(type='TopDownAffineFewShot'),
+    dict(type='ToTensor'),
+    dict(
+        type='NormalizeTensor',
+        mean=[0.485, 0.456, 0.406],
+        std=[0.229, 0.224, 0.225]),
+    dict(type='TopDownGenerateTargetFewShot', sigma=1),
+    dict(
+        type='Collect',
+        keys=['img', 'target', 'target_weight'],
+        meta_keys=[
+            'image_file',
+            'joints_3d',
+            'joints_3d_visible',
+            'center',
+            'scale',
+            'rotation',
+            'bbox_score',
+            'flip_pairs',
+            'category_id',
+            'skeleton',
+        ]),
+]
+
+valid_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='TopDownAffineFewShot'),
+    dict(type='ToTensor'),
+    dict(
+        type='NormalizeTensor',
+        mean=[0.485, 0.456, 0.406],
+        std=[0.229, 0.224, 0.225]),
+    dict(type='TopDownGenerateTargetFewShot', sigma=1),
+    dict(
+        type='Collect',
+        keys=['img', 'target', 'target_weight'],
+        meta_keys=[
+            'image_file',
+            'joints_3d',
+            'joints_3d_visible',
+            'center',
+            'scale',
+            'rotation',
+            'bbox_score',
+            'flip_pairs',
+            'category_id',
+            'skeleton',
+        ]),
+]
+
+test_pipeline = valid_pipeline
+
+data_root = 'data/mp100'
+data = dict(
+    samples_per_gpu=8,
+    workers_per_gpu=8,
+    train=dict(
+        type='TransformerPoseDataset',
+        ann_file=f'{data_root}/annotations/mp100_split1_train.json',
+        img_prefix=f'{data_root}/images/',
+        # img_prefix=f'{data_root}',
+        data_cfg=data_cfg,
+        valid_class_ids=None,
+        max_kpt_num=channel_cfg['max_kpt_num'],
+        num_shots=1,
+        pipeline=train_pipeline),
+    val=dict(
+        type='TransformerPoseDataset',
+        ann_file=f'{data_root}/annotations/mp100_split1_val.json',
+        img_prefix=f'{data_root}/images/',
+        # img_prefix=f'{data_root}',
+        data_cfg=data_cfg,
+        valid_class_ids=None,
+        max_kpt_num=channel_cfg['max_kpt_num'],
+        num_shots=1,
+        num_queries=15,
+        num_episodes=100,
+        pipeline=valid_pipeline),
+    test=dict(
+        type='TestPoseDataset',
+        ann_file=f'{data_root}/annotations/mp100_split1_test.json',
+        img_prefix=f'{data_root}/images/',
+        # img_prefix=f'{data_root}',
+        data_cfg=data_cfg,
+        valid_class_ids=None,
+        max_kpt_num=channel_cfg['max_kpt_num'],
+        num_shots=1,
+        num_queries=15,
+        num_episodes=200,
+        pck_threshold_list=[0.05, 0.10, 0.15, 0.2, 0.25],
+        pipeline=test_pipeline),
+)
+vis_backends = [
+    dict(type='LocalVisBackend'),
+    dict(type='TensorboardVisBackend'),
+]
+visualizer = dict(
+    type='PoseLocalVisualizer', vis_backends=vis_backends, name='visualizer')
+
+shuffle_cfg = dict(interval=1)
diff --git a/projects/pose_anything/datasets/__init__.py b/projects/pose_anything/datasets/__init__.py
new file mode 100644
index 0000000000..86bb3ce651
--- /dev/null
+++ b/projects/pose_anything/datasets/__init__.py
@@ -0,0 +1 @@
+from .pipelines import *  # noqa
diff --git a/projects/pose_anything/datasets/builder.py b/projects/pose_anything/datasets/builder.py
new file mode 100644
index 0000000000..9f44687143
--- /dev/null
+++ b/projects/pose_anything/datasets/builder.py
@@ -0,0 +1,56 @@
+from mmengine.dataset import RepeatDataset
+from mmengine.registry import build_from_cfg
+from torch.utils.data.dataset import ConcatDataset
+
+from mmpose.datasets.builder import DATASETS
+
+
+def _concat_cfg(cfg):
+    replace = ['ann_file', 'img_prefix']
+    channels = ['num_joints', 'dataset_channel']
+    concat_cfg = []
+    for i in range(len(cfg['type'])):
+        cfg_tmp = cfg.deepcopy()
+        cfg_tmp['type'] = cfg['type'][i]
+        for item in replace:
+            assert item in cfg_tmp
+            assert len(cfg['type']) == len(cfg[item]), (cfg[item])
+            cfg_tmp[item] = cfg[item][i]
+        for item in channels:
+            assert item in cfg_tmp['data_cfg']
+            assert len(cfg['type']) == len(cfg['data_cfg'][item])
+            cfg_tmp['data_cfg'][item] = cfg['data_cfg'][item][i]
+        concat_cfg.append(cfg_tmp)
+    return concat_cfg
+
+
+def _check_vaild(cfg):
+    replace = ['num_joints', 'dataset_channel']
+    if isinstance(cfg['data_cfg'][replace[0]], (list, tuple)):
+        for item in replace:
+            cfg['data_cfg'][item] = cfg['data_cfg'][item][0]
+    return cfg
+
+
+def build_dataset(cfg, default_args=None):
+    """Build a dataset from config dict.
+
+    Args:
+        cfg (dict): Config dict. It should at least contain the key "type".
+        default_args (dict, optional): Default initialization arguments.
+            Default: None.
+
+    Returns:
+        Dataset: The constructed dataset.
+    """
+    if isinstance(cfg['type'],
+                  (list, tuple)):  # In training, type=TransformerPoseDataset
+        dataset = ConcatDataset(
+            [build_dataset(c, default_args) for c in _concat_cfg(cfg)])
+    elif cfg['type'] == 'RepeatDataset':
+        dataset = RepeatDataset(
+            build_dataset(cfg['dataset'], default_args), cfg['times'])
+    else:
+        cfg = _check_vaild(cfg)
+        dataset = build_from_cfg(cfg, DATASETS, default_args)
+    return dataset
diff --git a/projects/pose_anything/datasets/datasets/__init__.py b/projects/pose_anything/datasets/datasets/__init__.py
new file mode 100644
index 0000000000..124dcc3b15
--- /dev/null
+++ b/projects/pose_anything/datasets/datasets/__init__.py
@@ -0,0 +1,7 @@
+from .mp100 import (FewShotBaseDataset, FewShotKeypointDataset,
+                    TransformerBaseDataset, TransformerPoseDataset)
+
+__all__ = [
+    'FewShotBaseDataset', 'FewShotKeypointDataset', 'TransformerBaseDataset',
+    'TransformerPoseDataset'
+]
diff --git a/projects/pose_anything/datasets/datasets/mp100/__init__.py b/projects/pose_anything/datasets/datasets/mp100/__init__.py
new file mode 100644
index 0000000000..229353517b
--- /dev/null
+++ b/projects/pose_anything/datasets/datasets/mp100/__init__.py
@@ -0,0 +1,11 @@
+from .fewshot_base_dataset import FewShotBaseDataset
+from .fewshot_dataset import FewShotKeypointDataset
+from .test_base_dataset import TestBaseDataset
+from .test_dataset import TestPoseDataset
+from .transformer_base_dataset import TransformerBaseDataset
+from .transformer_dataset import TransformerPoseDataset
+
+__all__ = [
+    'FewShotKeypointDataset', 'FewShotBaseDataset', 'TransformerPoseDataset',
+    'TransformerBaseDataset', 'TestBaseDataset', 'TestPoseDataset'
+]
diff --git a/projects/pose_anything/datasets/datasets/mp100/fewshot_base_dataset.py b/projects/pose_anything/datasets/datasets/mp100/fewshot_base_dataset.py
new file mode 100644
index 0000000000..2746a0a35c
--- /dev/null
+++ b/projects/pose_anything/datasets/datasets/mp100/fewshot_base_dataset.py
@@ -0,0 +1,235 @@
+import copy
+from abc import ABCMeta, abstractmethod
+
+import json_tricks as json
+import numpy as np
+from mmcv.parallel import DataContainer as DC
+from torch.utils.data import Dataset
+
+from mmpose.core.evaluation.top_down_eval import keypoint_pck_accuracy
+from mmpose.datasets import DATASETS
+from mmpose.datasets.pipelines import Compose
+
+
+@DATASETS.register_module()
+class FewShotBaseDataset(Dataset, metaclass=ABCMeta):
+
+    def __init__(self,
+                 ann_file,
+                 img_prefix,
+                 data_cfg,
+                 pipeline,
+                 test_mode=False):
+        self.image_info = {}
+        self.ann_info = {}
+
+        self.annotations_path = ann_file
+        if not img_prefix.endswith('/'):
+            img_prefix = img_prefix + '/'
+        self.img_prefix = img_prefix
+        self.pipeline = pipeline
+        self.test_mode = test_mode
+
+        self.ann_info['image_size'] = np.array(data_cfg['image_size'])
+        self.ann_info['heatmap_size'] = np.array(data_cfg['heatmap_size'])
+        self.ann_info['num_joints'] = data_cfg['num_joints']
+
+        self.ann_info['flip_pairs'] = None
+
+        self.ann_info['inference_channel'] = data_cfg['inference_channel']
+        self.ann_info['num_output_channels'] = data_cfg['num_output_channels']
+        self.ann_info['dataset_channel'] = data_cfg['dataset_channel']
+
+        self.db = []
+        self.num_shots = 1
+        self.paired_samples = []
+        self.pipeline = Compose(self.pipeline)
+
+    @abstractmethod
+    def _get_db(self):
+        """Load dataset."""
+        raise NotImplementedError
+
+    @abstractmethod
+    def _select_kpt(self, obj, kpt_id):
+        """Select kpt."""
+        raise NotImplementedError
+
+    @abstractmethod
+    def evaluate(self, cfg, preds, output_dir, *args, **kwargs):
+        """Evaluate keypoint results."""
+        raise NotImplementedError
+
+    @staticmethod
+    def _write_keypoint_results(keypoints, res_file):
+        """Write results into a json file."""
+
+        with open(res_file, 'w') as f:
+            json.dump(keypoints, f, sort_keys=True, indent=4)
+
+    def _report_metric(self,
+                       res_file,
+                       metrics,
+                       pck_thr=0.2,
+                       pckh_thr=0.7,
+                       auc_nor=30):
+        """Keypoint evaluation.
+
+        Args:
+            res_file (str): Json file stored prediction results.
+            metrics (str | list[str]): Metric to be performed.
+                Options: 'PCK', 'PCKh', 'AUC', 'EPE'.
+            pck_thr (float): PCK threshold, default as 0.2.
+            pckh_thr (float): PCKh threshold, default as 0.7.
+            auc_nor (float): AUC normalization factor, default as 30 pixel.
+
+        Returns:
+            List: Evaluation results for evaluation metric.
+        """
+        info_str = []
+
+        with open(res_file, 'r') as fin:
+            preds = json.load(fin)
+        assert len(preds) == len(self.paired_samples)
+
+        outputs = []
+        gts = []
+        masks = []
+        threshold_bbox = []
+        threshold_head_box = []
+
+        for pred, pair in zip(preds, self.paired_samples):
+            item = self.db[pair[-1]]
+            outputs.append(np.array(pred['keypoints'])[:, :-1])
+            gts.append(np.array(item['joints_3d'])[:, :-1])
+
+            mask_query = ((np.array(item['joints_3d_visible'])[:, 0]) > 0)
+            mask_sample = ((np.array(
+                self.db[pair[0]]['joints_3d_visible'])[:, 0]) > 0)
+            for id_s in pair[:-1]:
+                mask_sample = np.bitwise_and(
+                    mask_sample,
+                    ((np.array(self.db[id_s]['joints_3d_visible'])[:, 0]) > 0))
+            masks.append(np.bitwise_and(mask_query, mask_sample))
+
+            if 'PCK' in metrics:
+                bbox = np.array(item['bbox'])
+                bbox_thr = np.max(bbox[2:])
+                threshold_bbox.append(np.array([bbox_thr, bbox_thr]))
+            if 'PCKh' in metrics:
+                head_box_thr = item['head_size']
+                threshold_head_box.append(
+                    np.array([head_box_thr, head_box_thr]))
+
+        if 'PCK' in metrics:
+            pck_avg = []
+            for (output, gt, mask, thr_bbox) in zip(outputs, gts, masks,
+                                                    threshold_bbox):
+                _, pck, _ = keypoint_pck_accuracy(
+                    np.expand_dims(output, 0), np.expand_dims(gt, 0),
+                    np.expand_dims(mask, 0), pck_thr,
+                    np.expand_dims(thr_bbox, 0))
+                pck_avg.append(pck)
+            info_str.append(('PCK', np.mean(pck_avg)))
+
+        return info_str
+
+    def _merge_obj(self, Xs_list, Xq, idx):
+        """merge Xs_list and Xq.
+
+        :param Xs_list: N-shot samples X
+        :param Xq: query X
+        :param idx: id of paired_samples
+        :return: Xall
+        """
+        Xall = dict()
+        Xall['img_s'] = [Xs['img'] for Xs in Xs_list]
+        Xall['target_s'] = [Xs['target'] for Xs in Xs_list]
+        Xall['target_weight_s'] = [Xs['target_weight'] for Xs in Xs_list]
+        xs_img_metas = [Xs['img_metas'].data for Xs in Xs_list]
+
+        Xall['img_q'] = Xq['img']
+        Xall['target_q'] = Xq['target']
+        Xall['target_weight_q'] = Xq['target_weight']
+        xq_img_metas = Xq['img_metas'].data
+
+        img_metas = dict()
+        for key in xq_img_metas.keys():
+            img_metas['sample_' + key] = [
+                xs_img_meta[key] for xs_img_meta in xs_img_metas
+            ]
+            img_metas['query_' + key] = xq_img_metas[key]
+        img_metas['bbox_id'] = idx
+
+        Xall['img_metas'] = DC(img_metas, cpu_only=True)
+
+        return Xall
+
+    def __len__(self):
+        """Get the size of the dataset."""
+        return len(self.paired_samples)
+
+    def __getitem__(self, idx):
+        """Get the sample given index."""
+
+        pair_ids = self.paired_samples[idx]
+        assert len(pair_ids) == self.num_shots + 1
+        sample_id_list = pair_ids[:self.num_shots]
+        query_id = pair_ids[-1]
+
+        sample_obj_list = []
+        for sample_id in sample_id_list:
+            sample_obj = copy.deepcopy(self.db[sample_id])
+            sample_obj['ann_info'] = copy.deepcopy(self.ann_info)
+            sample_obj_list.append(sample_obj)
+
+        query_obj = copy.deepcopy(self.db[query_id])
+        query_obj['ann_info'] = copy.deepcopy(self.ann_info)
+
+        if not self.test_mode:
+            # randomly select "one" keypoint
+            sample_valid = (sample_obj_list[0]['joints_3d_visible'][:, 0] > 0)
+            for sample_obj in sample_obj_list:
+                sample_valid = sample_valid & (
+                    sample_obj['joints_3d_visible'][:, 0] > 0)
+            query_valid = (query_obj['joints_3d_visible'][:, 0] > 0)
+
+            valid_s = np.where(sample_valid)[0]
+            valid_q = np.where(query_valid)[0]
+            valid_sq = np.where(sample_valid & query_valid)[0]
+            if len(valid_sq) > 0:
+                kpt_id = np.random.choice(valid_sq)
+            elif len(valid_s) > 0:
+                kpt_id = np.random.choice(valid_s)
+            elif len(valid_q) > 0:
+                kpt_id = np.random.choice(valid_q)
+            else:
+                kpt_id = np.random.choice(np.array(range(len(query_valid))))
+
+            for i in range(self.num_shots):
+                sample_obj_list[i] = self._select_kpt(sample_obj_list[i],
+                                                      kpt_id)
+            query_obj = self._select_kpt(query_obj, kpt_id)
+
+        # when test, all keypoints will be preserved.
+
+        Xs_list = []
+        for sample_obj in sample_obj_list:
+            Xs = self.pipeline(sample_obj)
+            Xs_list.append(Xs)
+        Xq = self.pipeline(query_obj)
+
+        Xall = self._merge_obj(Xs_list, Xq, idx)
+        Xall['skeleton'] = self.db[query_id]['skeleton']
+
+        return Xall
+
+    def _sort_and_unique_bboxes(self, kpts, key='bbox_id'):
+        """sort kpts and remove the repeated ones."""
+        kpts = sorted(kpts, key=lambda x: x[key])
+        num = len(kpts)
+        for i in range(num - 1, 0, -1):
+            if kpts[i][key] == kpts[i - 1][key]:
+                del kpts[i]
+
+        return kpts
diff --git a/projects/pose_anything/datasets/datasets/mp100/fewshot_dataset.py b/projects/pose_anything/datasets/datasets/mp100/fewshot_dataset.py
new file mode 100644
index 0000000000..bd8287966c
--- /dev/null
+++ b/projects/pose_anything/datasets/datasets/mp100/fewshot_dataset.py
@@ -0,0 +1,334 @@
+import os
+import random
+from collections import OrderedDict
+
+import numpy as np
+from xtcocotools.coco import COCO
+
+from mmpose.datasets import DATASETS
+from .fewshot_base_dataset import FewShotBaseDataset
+
+
+@DATASETS.register_module()
+class FewShotKeypointDataset(FewShotBaseDataset):
+
+    def __init__(self,
+                 ann_file,
+                 img_prefix,
+                 data_cfg,
+                 pipeline,
+                 valid_class_ids,
+                 num_shots=1,
+                 num_queries=100,
+                 num_episodes=1,
+                 test_mode=False):
+        super().__init__(
+            ann_file, img_prefix, data_cfg, pipeline, test_mode=test_mode)
+
+        self.ann_info['flip_pairs'] = []
+
+        self.ann_info['upper_body_ids'] = []
+        self.ann_info['lower_body_ids'] = []
+
+        self.ann_info['use_different_joint_weights'] = False
+        self.ann_info['joint_weights'] = np.array([
+            1.,
+        ], dtype=np.float32).reshape((self.ann_info['num_joints'], 1))
+
+        self.coco = COCO(ann_file)
+
+        self.id2name, self.name2id = self._get_mapping_id_name(self.coco.imgs)
+        self.img_ids = self.coco.getImgIds()
+        self.classes = [
+            cat['name'] for cat in self.coco.loadCats(self.coco.getCatIds())
+        ]
+
+        self.num_classes = len(self.classes)
+        self._class_to_ind = dict(zip(self.classes, self.coco.getCatIds()))
+        self._ind_to_class = dict(zip(self.coco.getCatIds(), self.classes))
+
+        if valid_class_ids is not None:
+            self.valid_class_ids = valid_class_ids
+        else:
+            self.valid_class_ids = self.coco.getCatIds()
+        self.valid_classes = [
+            self._ind_to_class[ind] for ind in self.valid_class_ids
+        ]
+
+        self.cats = self.coco.cats
+
+        # Also update self.cat2obj
+        self.db = self._get_db()
+
+        self.num_shots = num_shots
+
+        if not test_mode:
+            # Update every training epoch
+            self.random_paired_samples()
+        else:
+            self.num_queries = num_queries
+            self.num_episodes = num_episodes
+            self.make_paired_samples()
+
+    def random_paired_samples(self):
+        num_datas = [
+            len(self.cat2obj[self._class_to_ind[cls]])
+            for cls in self.valid_classes
+        ]
+
+        # balance the dataset
+        max_num_data = max(num_datas)
+
+        all_samples = []
+        for cls in self.valid_class_ids:
+            for i in range(max_num_data):
+                shot = random.sample(self.cat2obj[cls], self.num_shots + 1)
+                all_samples.append(shot)
+
+        self.paired_samples = np.array(all_samples)
+        np.random.shuffle(self.paired_samples)
+
+    def make_paired_samples(self):
+        random.seed(1)
+        np.random.seed(0)
+
+        all_samples = []
+        for cls in self.valid_class_ids:
+            for _ in range(self.num_episodes):
+                shots = random.sample(self.cat2obj[cls],
+                                      self.num_shots + self.num_queries)
+                sample_ids = shots[:self.num_shots]
+                query_ids = shots[self.num_shots:]
+                for query_id in query_ids:
+                    all_samples.append(sample_ids + [query_id])
+
+        self.paired_samples = np.array(all_samples)
+
+    def _select_kpt(self, obj, kpt_id):
+        obj['joints_3d'] = obj['joints_3d'][kpt_id:kpt_id + 1]
+        obj['joints_3d_visible'] = obj['joints_3d_visible'][kpt_id:kpt_id + 1]
+        obj['kpt_id'] = kpt_id
+
+        return obj
+
+    @staticmethod
+    def _get_mapping_id_name(imgs):
+        """
+        Args:
+            imgs (dict): dict of image info.
+
+        Returns:
+            tuple: Image name & id mapping dicts.
+
+            - id2name (dict): Mapping image id to name.
+            - name2id (dict): Mapping image name to id.
+        """
+        id2name = {}
+        name2id = {}
+        for image_id, image in imgs.items():
+            file_name = image['file_name']
+            id2name[image_id] = file_name
+            name2id[file_name] = image_id
+
+        return id2name, name2id
+
+    def _get_db(self):
+        """Ground truth bbox and keypoints."""
+        self.obj_id = 0
+
+        self.cat2obj = {}
+        for i in self.coco.getCatIds():
+            self.cat2obj.update({i: []})
+
+        gt_db = []
+        for img_id in self.img_ids:
+            gt_db.extend(self._load_coco_keypoint_annotation_kernel(img_id))
+        return gt_db
+
+    def _load_coco_keypoint_annotation_kernel(self, img_id):
+        """load annotation from COCOAPI.
+
+        Note:
+            bbox:[x1, y1, w, h]
+        Args:
+            img_id: coco image id
+        Returns:
+            dict: db entry
+        """
+        img_ann = self.coco.loadImgs(img_id)[0]
+        width = img_ann['width']
+        height = img_ann['height']
+
+        ann_ids = self.coco.getAnnIds(imgIds=img_id, iscrowd=False)
+        objs = self.coco.loadAnns(ann_ids)
+
+        # sanitize bboxes
+        valid_objs = []
+        for obj in objs:
+            if 'bbox' not in obj:
+                continue
+            x, y, w, h = obj['bbox']
+            x1 = max(0, x)
+            y1 = max(0, y)
+            x2 = min(width - 1, x1 + max(0, w - 1))
+            y2 = min(height - 1, y1 + max(0, h - 1))
+            if ('area' not in obj or obj['area'] > 0) and x2 > x1 and y2 > y1:
+                obj['clean_bbox'] = [x1, y1, x2 - x1, y2 - y1]
+                valid_objs.append(obj)
+        objs = valid_objs
+
+        bbox_id = 0
+        rec = []
+        for obj in objs:
+            if 'keypoints' not in obj:
+                continue
+            if max(obj['keypoints']) == 0:
+                continue
+            if 'num_keypoints' in obj and obj['num_keypoints'] == 0:
+                continue
+
+            category_id = obj['category_id']
+            # the number of keypoint for this specific category
+            cat_kpt_num = int(len(obj['keypoints']) / 3)
+
+            joints_3d = np.zeros((cat_kpt_num, 3), dtype=np.float32)
+            joints_3d_visible = np.zeros((cat_kpt_num, 3), dtype=np.float32)
+
+            keypoints = np.array(obj['keypoints']).reshape(-1, 3)
+            joints_3d[:, :2] = keypoints[:, :2]
+            joints_3d_visible[:, :2] = np.minimum(1, keypoints[:, 2:3])
+
+            center, scale = self._xywh2cs(*obj['clean_bbox'][:4])
+
+            image_file = os.path.join(self.img_prefix, self.id2name[img_id])
+
+            self.cat2obj[category_id].append(self.obj_id)
+
+            rec.append({
+                'image_file':
+                image_file,
+                'center':
+                center,
+                'scale':
+                scale,
+                'rotation':
+                0,
+                'bbox':
+                obj['clean_bbox'][:4],
+                'bbox_score':
+                1,
+                'joints_3d':
+                joints_3d,
+                'joints_3d_visible':
+                joints_3d_visible,
+                'category_id':
+                category_id,
+                'cat_kpt_num':
+                cat_kpt_num,
+                'bbox_id':
+                self.obj_id,
+                'skeleton':
+                self.coco.cats[obj['category_id']]['skeleton'],
+            })
+            bbox_id = bbox_id + 1
+            self.obj_id += 1
+
+        return rec
+
+    def _xywh2cs(self, x, y, w, h):
+        """This encodes bbox(x,y,w,w) into (center, scale)
+
+        Args:
+            x, y, w, h
+
+        Returns:
+            tuple: A tuple containing center and scale.
+
+            - center (np.ndarray[float32](2,)): center of the bbox (x, y).
+            - scale (np.ndarray[float32](2,)): scale of the bbox w & h.
+        """
+        aspect_ratio = self.ann_info['image_size'][0] / self.ann_info[
+            'image_size'][1]
+        center = np.array([x + w * 0.5, y + h * 0.5], dtype=np.float32)
+        #
+        # if (not self.test_mode) and np.random.rand() < 0.3:
+        #     center += 0.4 * (np.random.rand(2) - 0.5) * [w, h]
+
+        if w > aspect_ratio * h:
+            h = w * 1.0 / aspect_ratio
+        elif w < aspect_ratio * h:
+            w = h * aspect_ratio
+
+        # pixel std is 200.0
+        scale = np.array([w / 200.0, h / 200.0], dtype=np.float32)
+        # padding to include proper amount of context
+        scale = scale * 1.25
+
+        return center, scale
+
+    def evaluate(self, outputs, res_folder, metric='PCK', **kwargs):
+        """Evaluate interhand2d keypoint results. The pose prediction results
+        will be saved in `${res_folder}/result_keypoints.json`.
+
+        Note:
+            batch_size: N
+            num_keypoints: K
+            heatmap height: H
+            heatmap width: W
+
+        Args:
+            outputs (list(preds, boxes, image_path, output_heatmap))
+                :preds (np.ndarray[N,K,3]): The first two dimensions are
+                    coordinates, score is the third dimension of the array.
+                :boxes (np.ndarray[N,6]): [center[0], center[1], scale[0]
+                    , scale[1],area, score]
+                :image_paths (list[str]): For example, ['C', 'a', 'p', 't',
+                    'u', 'r', 'e', '1', '2', '/', '0', '3', '9', '0', '_',
+                    'd', 'h', '_', 't', 'o', 'u', 'c', 'h', 'R', 'O', 'M',
+                    '/', 'c', 'a', 'm', '4', '1', '0', '2', '0', '9', '/',
+                    'i', 'm', 'a', 'g', 'e', '6', '2', '4', '3', '4', '.',
+                    'j', 'p', 'g']
+                :output_heatmap (np.ndarray[N, K, H, W]): model outputs.
+
+            res_folder (str): Path of directory to save the results.
+            metric (str | list[str]): Metric to be performed.
+                Options: 'PCK', 'AUC', 'EPE'.
+
+        Returns:
+            dict: Evaluation results for evaluation metric.
+        """
+        metrics = metric if isinstance(metric, list) else [metric]
+        allowed_metrics = ['PCK', 'AUC', 'EPE']
+        for metric in metrics:
+            if metric not in allowed_metrics:
+                raise KeyError(f'metric {metric} is not supported')
+
+        res_file = os.path.join(res_folder, 'result_keypoints.json')
+
+        kpts = []
+        for output in outputs:
+            preds = output['preds']
+            boxes = output['boxes']
+            image_paths = output['image_paths']
+            bbox_ids = output['bbox_ids']
+
+            batch_size = len(image_paths)
+            for i in range(batch_size):
+                image_id = self.name2id[image_paths[i][len(self.img_prefix):]]
+
+                kpts.append({
+                    'keypoints': preds[i].tolist(),
+                    'center': boxes[i][0:2].tolist(),
+                    'scale': boxes[i][2:4].tolist(),
+                    'area': float(boxes[i][4]),
+                    'score': float(boxes[i][5]),
+                    'image_id': image_id,
+                    'bbox_id': bbox_ids[i]
+                })
+        kpts = self._sort_and_unique_bboxes(kpts)
+
+        self._write_keypoint_results(kpts, res_file)
+        info_str = self._report_metric(res_file, metrics)
+        name_value = OrderedDict(info_str)
+
+        return name_value
diff --git a/projects/pose_anything/datasets/datasets/mp100/test_base_dataset.py b/projects/pose_anything/datasets/datasets/mp100/test_base_dataset.py
new file mode 100644
index 0000000000..8fffc84239
--- /dev/null
+++ b/projects/pose_anything/datasets/datasets/mp100/test_base_dataset.py
@@ -0,0 +1,248 @@
+import copy
+from abc import ABCMeta, abstractmethod
+
+import json_tricks as json
+import numpy as np
+from mmcv.parallel import DataContainer as DC
+from torch.utils.data import Dataset
+
+from mmpose.core.evaluation.top_down_eval import (keypoint_auc, keypoint_epe,
+                                                  keypoint_nme,
+                                                  keypoint_pck_accuracy)
+from mmpose.datasets import DATASETS
+from mmpose.datasets.pipelines import Compose
+
+
+@DATASETS.register_module()
+class TestBaseDataset(Dataset, metaclass=ABCMeta):
+
+    def __init__(self,
+                 ann_file,
+                 img_prefix,
+                 data_cfg,
+                 pipeline,
+                 test_mode=True,
+                 PCK_threshold_list=[0.05, 0.1, 0.15, 0.2, 0.25]):
+        self.image_info = {}
+        self.ann_info = {}
+
+        self.annotations_path = ann_file
+        if not img_prefix.endswith('/'):
+            img_prefix = img_prefix + '/'
+        self.img_prefix = img_prefix
+        self.pipeline = pipeline
+        self.test_mode = test_mode
+        self.PCK_threshold_list = PCK_threshold_list
+
+        self.ann_info['image_size'] = np.array(data_cfg['image_size'])
+        self.ann_info['heatmap_size'] = np.array(data_cfg['heatmap_size'])
+        self.ann_info['num_joints'] = data_cfg['num_joints']
+
+        self.ann_info['flip_pairs'] = None
+
+        self.ann_info['inference_channel'] = data_cfg['inference_channel']
+        self.ann_info['num_output_channels'] = data_cfg['num_output_channels']
+        self.ann_info['dataset_channel'] = data_cfg['dataset_channel']
+
+        self.db = []
+        self.num_shots = 1
+        self.paired_samples = []
+        self.pipeline = Compose(self.pipeline)
+
+    @abstractmethod
+    def _get_db(self):
+        """Load dataset."""
+        raise NotImplementedError
+
+    @abstractmethod
+    def _select_kpt(self, obj, kpt_id):
+        """Select kpt."""
+        raise NotImplementedError
+
+    @abstractmethod
+    def evaluate(self, cfg, preds, output_dir, *args, **kwargs):
+        """Evaluate keypoint results."""
+        raise NotImplementedError
+
+    @staticmethod
+    def _write_keypoint_results(keypoints, res_file):
+        """Write results into a json file."""
+
+        with open(res_file, 'w') as f:
+            json.dump(keypoints, f, sort_keys=True, indent=4)
+
+    def _report_metric(self, res_file, metrics):
+        """Keypoint evaluation.
+
+        Args:
+            res_file (str): Json file stored prediction results.
+            metrics (str | list[str]): Metric to be performed.
+                Options: 'PCK', 'PCKh', 'AUC', 'EPE'.
+            pck_thr (float): PCK threshold, default as 0.2.
+            pckh_thr (float): PCKh threshold, default as 0.7.
+            auc_nor (float): AUC normalization factor, default as 30 pixel.
+
+        Returns:
+            List: Evaluation results for evaluation metric.
+        """
+        info_str = []
+
+        with open(res_file, 'r') as fin:
+            preds = json.load(fin)
+        assert len(preds) == len(self.paired_samples)
+
+        outputs = []
+        gts = []
+        masks = []
+        threshold_bbox = []
+        threshold_head_box = []
+
+        for pred, pair in zip(preds, self.paired_samples):
+            item = self.db[pair[-1]]
+            outputs.append(np.array(pred['keypoints'])[:, :-1])
+            gts.append(np.array(item['joints_3d'])[:, :-1])
+
+            mask_query = ((np.array(item['joints_3d_visible'])[:, 0]) > 0)
+            mask_sample = ((np.array(
+                self.db[pair[0]]['joints_3d_visible'])[:, 0]) > 0)
+            for id_s in pair[:-1]:
+                mask_sample = np.bitwise_and(
+                    mask_sample,
+                    ((np.array(self.db[id_s]['joints_3d_visible'])[:, 0]) > 0))
+            masks.append(np.bitwise_and(mask_query, mask_sample))
+
+            if 'PCK' in metrics or 'NME' in metrics or 'AUC' in metrics:
+                bbox = np.array(item['bbox'])
+                bbox_thr = np.max(bbox[2:])
+                threshold_bbox.append(np.array([bbox_thr, bbox_thr]))
+            if 'PCKh' in metrics:
+                head_box_thr = item['head_size']
+                threshold_head_box.append(
+                    np.array([head_box_thr, head_box_thr]))
+
+        if 'PCK' in metrics:
+            pck_results = dict()
+            for pck_thr in self.PCK_threshold_list:
+                pck_results[pck_thr] = []
+
+            for (output, gt, mask, thr_bbox) in zip(outputs, gts, masks,
+                                                    threshold_bbox):
+                for pck_thr in self.PCK_threshold_list:
+                    _, pck, _ = keypoint_pck_accuracy(
+                        np.expand_dims(output, 0), np.expand_dims(gt, 0),
+                        np.expand_dims(mask, 0), pck_thr,
+                        np.expand_dims(thr_bbox, 0))
+                    pck_results[pck_thr].append(pck)
+
+            mPCK = 0
+            for pck_thr in self.PCK_threshold_list:
+                info_str.append(
+                    ['PCK@' + str(pck_thr),
+                     np.mean(pck_results[pck_thr])])
+                mPCK += np.mean(pck_results[pck_thr])
+            info_str.append(['mPCK', mPCK / len(self.PCK_threshold_list)])
+
+        if 'NME' in metrics:
+            nme_results = []
+            for (output, gt, mask, thr_bbox) in zip(outputs, gts, masks,
+                                                    threshold_bbox):
+                nme = keypoint_nme(
+                    np.expand_dims(output, 0), np.expand_dims(gt, 0),
+                    np.expand_dims(mask, 0), np.expand_dims(thr_bbox, 0))
+                nme_results.append(nme)
+            info_str.append(['NME', np.mean(nme_results)])
+
+        if 'AUC' in metrics:
+            auc_results = []
+            for (output, gt, mask, thr_bbox) in zip(outputs, gts, masks,
+                                                    threshold_bbox):
+                auc = keypoint_auc(
+                    np.expand_dims(output, 0), np.expand_dims(gt, 0),
+                    np.expand_dims(mask, 0), thr_bbox[0])
+                auc_results.append(auc)
+            info_str.append(['AUC', np.mean(auc_results)])
+
+        if 'EPE' in metrics:
+            epe_results = []
+            for (output, gt, mask) in zip(outputs, gts, masks):
+                epe = keypoint_epe(
+                    np.expand_dims(output, 0), np.expand_dims(gt, 0),
+                    np.expand_dims(mask, 0))
+                epe_results.append(epe)
+            info_str.append(['EPE', np.mean(epe_results)])
+        return info_str
+
+    def _merge_obj(self, Xs_list, Xq, idx):
+        """merge Xs_list and Xq.
+
+        :param Xs_list: N-shot samples X
+        :param Xq: query X
+        :param idx: id of paired_samples
+        :return: Xall
+        """
+        Xall = dict()
+        Xall['img_s'] = [Xs['img'] for Xs in Xs_list]
+        Xall['target_s'] = [Xs['target'] for Xs in Xs_list]
+        Xall['target_weight_s'] = [Xs['target_weight'] for Xs in Xs_list]
+        xs_img_metas = [Xs['img_metas'].data for Xs in Xs_list]
+
+        Xall['img_q'] = Xq['img']
+        Xall['target_q'] = Xq['target']
+        Xall['target_weight_q'] = Xq['target_weight']
+        xq_img_metas = Xq['img_metas'].data
+
+        img_metas = dict()
+        for key in xq_img_metas.keys():
+            img_metas['sample_' + key] = [
+                xs_img_meta[key] for xs_img_meta in xs_img_metas
+            ]
+            img_metas['query_' + key] = xq_img_metas[key]
+        img_metas['bbox_id'] = idx
+
+        Xall['img_metas'] = DC(img_metas, cpu_only=True)
+
+        return Xall
+
+    def __len__(self):
+        """Get the size of the dataset."""
+        return len(self.paired_samples)
+
+    def __getitem__(self, idx):
+        """Get the sample given index."""
+
+        pair_ids = self.paired_samples[idx]  # [supported id * shots, query id]
+        assert len(pair_ids) == self.num_shots + 1
+        sample_id_list = pair_ids[:self.num_shots]
+        query_id = pair_ids[-1]
+
+        sample_obj_list = []
+        for sample_id in sample_id_list:
+            sample_obj = copy.deepcopy(self.db[sample_id])
+            sample_obj['ann_info'] = copy.deepcopy(self.ann_info)
+            sample_obj_list.append(sample_obj)
+
+        query_obj = copy.deepcopy(self.db[query_id])
+        query_obj['ann_info'] = copy.deepcopy(self.ann_info)
+
+        Xs_list = []
+        for sample_obj in sample_obj_list:
+            Xs = self.pipeline(
+                sample_obj
+            )  # dict with ['img', 'target', 'target_weight', 'img_metas'],
+            Xs_list.append(Xs)  # Xs['target'] is of shape [100, map_h, map_w]
+        Xq = self.pipeline(query_obj)
+
+        Xall = self._merge_obj(Xs_list, Xq, idx)
+        Xall['skeleton'] = self.db[query_id]['skeleton']
+
+        return Xall
+
+    def _sort_and_unique_bboxes(self, kpts, key='bbox_id'):
+        """sort kpts and remove the repeated ones."""
+        kpts = sorted(kpts, key=lambda x: x[key])
+        num = len(kpts)
+        for i in range(num - 1, 0, -1):
+            if kpts[i][key] == kpts[i - 1][key]:
+                del kpts[i]
+
+        return kpts
diff --git a/projects/pose_anything/datasets/datasets/mp100/test_dataset.py b/projects/pose_anything/datasets/datasets/mp100/test_dataset.py
new file mode 100644
index 0000000000..ca0dc7ac6a
--- /dev/null
+++ b/projects/pose_anything/datasets/datasets/mp100/test_dataset.py
@@ -0,0 +1,347 @@
+import os
+import random
+from collections import OrderedDict
+
+import numpy as np
+from xtcocotools.coco import COCO
+
+from mmpose.datasets import DATASETS
+from .test_base_dataset import TestBaseDataset
+
+
+@DATASETS.register_module()
+class TestPoseDataset(TestBaseDataset):
+
+    def __init__(self,
+                 ann_file,
+                 img_prefix,
+                 data_cfg,
+                 pipeline,
+                 valid_class_ids,
+                 max_kpt_num=None,
+                 num_shots=1,
+                 num_queries=100,
+                 num_episodes=1,
+                 pck_threshold_list=[0.05, 0.1, 0.15, 0.20, 0.25],
+                 test_mode=True):
+        super().__init__(
+            ann_file,
+            img_prefix,
+            data_cfg,
+            pipeline,
+            test_mode=test_mode,
+            PCK_threshold_list=pck_threshold_list)
+
+        self.ann_info['flip_pairs'] = []
+
+        self.ann_info['upper_body_ids'] = []
+        self.ann_info['lower_body_ids'] = []
+
+        self.ann_info['use_different_joint_weights'] = False
+        self.ann_info['joint_weights'] = np.array([
+            1.,
+        ], dtype=np.float32).reshape((self.ann_info['num_joints'], 1))
+
+        self.coco = COCO(ann_file)
+
+        self.id2name, self.name2id = self._get_mapping_id_name(self.coco.imgs)
+        self.img_ids = self.coco.getImgIds()
+        self.classes = [
+            cat['name'] for cat in self.coco.loadCats(self.coco.getCatIds())
+        ]
+
+        self.num_classes = len(self.classes)
+        self._class_to_ind = dict(zip(self.classes, self.coco.getCatIds()))
+        self._ind_to_class = dict(zip(self.coco.getCatIds(), self.classes))
+
+        if valid_class_ids is not None:  # None by default
+            self.valid_class_ids = valid_class_ids
+        else:
+            self.valid_class_ids = self.coco.getCatIds()
+        self.valid_classes = [
+            self._ind_to_class[ind] for ind in self.valid_class_ids
+        ]
+
+        self.cats = self.coco.cats
+        self.max_kpt_num = max_kpt_num
+
+        # Also update self.cat2obj
+        self.db = self._get_db()
+
+        self.num_shots = num_shots
+
+        if not test_mode:
+            # Update every training epoch
+            self.random_paired_samples()
+        else:
+            self.num_queries = num_queries
+            self.num_episodes = num_episodes
+            self.make_paired_samples()
+
+    def random_paired_samples(self):
+        num_datas = [
+            len(self.cat2obj[self._class_to_ind[cls]])
+            for cls in self.valid_classes
+        ]
+
+        # balance the dataset
+        max_num_data = max(num_datas)
+
+        all_samples = []
+        for cls in self.valid_class_ids:
+            for i in range(max_num_data):
+                shot = random.sample(self.cat2obj[cls], self.num_shots + 1)
+                all_samples.append(shot)
+
+        self.paired_samples = np.array(all_samples)
+        np.random.shuffle(self.paired_samples)
+
+    def make_paired_samples(self):
+        random.seed(1)
+        np.random.seed(0)
+
+        all_samples = []
+        for cls in self.valid_class_ids:
+            for _ in range(self.num_episodes):
+                shots = random.sample(self.cat2obj[cls],
+                                      self.num_shots + self.num_queries)
+                sample_ids = shots[:self.num_shots]
+                query_ids = shots[self.num_shots:]
+                for query_id in query_ids:
+                    all_samples.append(sample_ids + [query_id])
+
+        self.paired_samples = np.array(all_samples)
+
+    def _select_kpt(self, obj, kpt_id):
+        obj['joints_3d'] = obj['joints_3d'][kpt_id:kpt_id + 1]
+        obj['joints_3d_visible'] = obj['joints_3d_visible'][kpt_id:kpt_id + 1]
+        obj['kpt_id'] = kpt_id
+
+        return obj
+
+    @staticmethod
+    def _get_mapping_id_name(imgs):
+        """
+        Args:
+            imgs (dict): dict of image info.
+
+        Returns:
+            tuple: Image name & id mapping dicts.
+
+            - id2name (dict): Mapping image id to name.
+            - name2id (dict): Mapping image name to id.
+        """
+        id2name = {}
+        name2id = {}
+        for image_id, image in imgs.items():
+            file_name = image['file_name']
+            id2name[image_id] = file_name
+            name2id[file_name] = image_id
+
+        return id2name, name2id
+
+    def _get_db(self):
+        """Ground truth bbox and keypoints."""
+        self.obj_id = 0
+
+        self.cat2obj = {}
+        for i in self.coco.getCatIds():
+            self.cat2obj.update({i: []})
+
+        gt_db = []
+        for img_id in self.img_ids:
+            gt_db.extend(self._load_coco_keypoint_annotation_kernel(img_id))
+        return gt_db
+
+    def _load_coco_keypoint_annotation_kernel(self, img_id):
+        """load annotation from COCOAPI.
+
+        Note:
+            bbox:[x1, y1, w, h]
+        Args:
+            img_id: coco image id
+        Returns:
+            dict: db entry
+        """
+        img_ann = self.coco.loadImgs(img_id)[0]
+        width = img_ann['width']
+        height = img_ann['height']
+
+        ann_ids = self.coco.getAnnIds(imgIds=img_id, iscrowd=False)
+        objs = self.coco.loadAnns(ann_ids)
+
+        # sanitize bboxes
+        valid_objs = []
+        for obj in objs:
+            if 'bbox' not in obj:
+                continue
+            x, y, w, h = obj['bbox']
+            x1 = max(0, x)
+            y1 = max(0, y)
+            x2 = min(width - 1, x1 + max(0, w - 1))
+            y2 = min(height - 1, y1 + max(0, h - 1))
+            if ('area' not in obj or obj['area'] > 0) and x2 > x1 and y2 > y1:
+                obj['clean_bbox'] = [x1, y1, x2 - x1, y2 - y1]
+                valid_objs.append(obj)
+        objs = valid_objs
+
+        bbox_id = 0
+        rec = []
+        for obj in objs:
+            if 'keypoints' not in obj:
+                continue
+            if max(obj['keypoints']) == 0:
+                continue
+            if 'num_keypoints' in obj and obj['num_keypoints'] == 0:
+                continue
+
+            category_id = obj['category_id']
+            # the number of keypoint for this specific category
+            cat_kpt_num = int(len(obj['keypoints']) / 3)
+            if self.max_kpt_num is None:
+                kpt_num = cat_kpt_num
+            else:
+                kpt_num = self.max_kpt_num
+
+            joints_3d = np.zeros((kpt_num, 3), dtype=np.float32)
+            joints_3d_visible = np.zeros((kpt_num, 3), dtype=np.float32)
+
+            keypoints = np.array(obj['keypoints']).reshape(-1, 3)
+            joints_3d[:cat_kpt_num, :2] = keypoints[:, :2]
+            joints_3d_visible[:cat_kpt_num, :2] = np.minimum(
+                1, keypoints[:, 2:3])
+
+            center, scale = self._xywh2cs(*obj['clean_bbox'][:4])
+
+            image_file = os.path.join(self.img_prefix, self.id2name[img_id])
+
+            self.cat2obj[category_id].append(self.obj_id)
+
+            rec.append({
+                'image_file':
+                image_file,
+                'center':
+                center,
+                'scale':
+                scale,
+                'rotation':
+                0,
+                'bbox':
+                obj['clean_bbox'][:4],
+                'bbox_score':
+                1,
+                'joints_3d':
+                joints_3d,
+                'joints_3d_visible':
+                joints_3d_visible,
+                'category_id':
+                category_id,
+                'cat_kpt_num':
+                cat_kpt_num,
+                'bbox_id':
+                self.obj_id,
+                'skeleton':
+                self.coco.cats[obj['category_id']]['skeleton'],
+            })
+            bbox_id = bbox_id + 1
+            self.obj_id += 1
+
+        return rec
+
+    def _xywh2cs(self, x, y, w, h):
+        """This encodes bbox(x,y,w,w) into (center, scale)
+
+        Args:
+            x, y, w, h
+
+        Returns:
+            tuple: A tuple containing center and scale.
+
+            - center (np.ndarray[float32](2,)): center of the bbox (x, y).
+            - scale (np.ndarray[float32](2,)): scale of the bbox w & h.
+        """
+        aspect_ratio = self.ann_info['image_size'][0] / self.ann_info[
+            'image_size'][1]
+        center = np.array([x + w * 0.5, y + h * 0.5], dtype=np.float32)
+        #
+        # if (not self.test_mode) and np.random.rand() < 0.3:
+        #     center += 0.4 * (np.random.rand(2) - 0.5) * [w, h]
+
+        if w > aspect_ratio * h:
+            h = w * 1.0 / aspect_ratio
+        elif w < aspect_ratio * h:
+            w = h * aspect_ratio
+
+        # pixel std is 200.0
+        scale = np.array([w / 200.0, h / 200.0], dtype=np.float32)
+        # padding to include proper amount of context
+        scale = scale * 1.25
+
+        return center, scale
+
+    def evaluate(self, outputs, res_folder, metric='PCK', **kwargs):
+        """Evaluate interhand2d keypoint results. The pose prediction results
+        will be saved in `${res_folder}/result_keypoints.json`.
+
+        Note:
+            batch_size: N
+            num_keypoints: K
+            heatmap height: H
+            heatmap width: W
+
+        Args:
+            outputs (list(preds, boxes, image_path, output_heatmap))
+                :preds (np.ndarray[N,K,3]): The first two dimensions are
+                    coordinates, score is the third dimension of the array.
+                :boxes (np.ndarray[N,6]): [center[0], center[1], scale[0]
+                    , scale[1],area, score]
+                :image_paths (list[str]): For example, ['C', 'a', 'p', 't',
+                    'u', 'r', 'e', '1', '2', '/', '0', '3', '9', '0', '_',
+                    'd', 'h', '_', 't', 'o', 'u', 'c', 'h', 'R', 'O', 'M',
+                    '/', 'c', 'a', 'm', '4', '1', '0', '2', '0', '9', '/',
+                    'i', 'm', 'a', 'g', 'e', '6', '2', '4', '3', '4', '.',
+                    'j', 'p', 'g']
+                :output_heatmap (np.ndarray[N, K, H, W]): model outputs.
+
+            res_folder (str): Path of directory to save the results.
+            metric (str | list[str]): Metric to be performed.
+                Options: 'PCK', 'AUC', 'EPE'.
+
+        Returns:
+            dict: Evaluation results for evaluation metric.
+        """
+        metrics = metric if isinstance(metric, list) else [metric]
+        allowed_metrics = ['PCK', 'AUC', 'EPE', 'NME']
+        for metric in metrics:
+            if metric not in allowed_metrics:
+                raise KeyError(f'metric {metric} is not supported')
+
+        res_file = os.path.join(res_folder, 'result_keypoints.json')
+
+        kpts = []
+        for output in outputs:
+            preds = output['preds']
+            boxes = output['boxes']
+            image_paths = output['image_paths']
+            bbox_ids = output['bbox_ids']
+
+            batch_size = len(image_paths)
+            for i in range(batch_size):
+                image_id = self.name2id[image_paths[i][len(self.img_prefix):]]
+
+                kpts.append({
+                    'keypoints': preds[i].tolist(),
+                    'center': boxes[i][0:2].tolist(),
+                    'scale': boxes[i][2:4].tolist(),
+                    'area': float(boxes[i][4]),
+                    'score': float(boxes[i][5]),
+                    'image_id': image_id,
+                    'bbox_id': bbox_ids[i]
+                })
+        kpts = self._sort_and_unique_bboxes(kpts)
+
+        self._write_keypoint_results(kpts, res_file)
+        info_str = self._report_metric(res_file, metrics)
+        name_value = OrderedDict(info_str)
+
+        return name_value
diff --git a/projects/pose_anything/datasets/datasets/mp100/transformer_base_dataset.py b/projects/pose_anything/datasets/datasets/mp100/transformer_base_dataset.py
new file mode 100644
index 0000000000..9433596a0e
--- /dev/null
+++ b/projects/pose_anything/datasets/datasets/mp100/transformer_base_dataset.py
@@ -0,0 +1,209 @@
+import copy
+from abc import ABCMeta, abstractmethod
+
+import json_tricks as json
+import numpy as np
+from mmcv.parallel import DataContainer as DC
+from mmengine.dataset import Compose
+from torch.utils.data import Dataset
+
+from mmpose.core.evaluation.top_down_eval import keypoint_pck_accuracy
+from mmpose.datasets import DATASETS
+
+
+@DATASETS.register_module()
+class TransformerBaseDataset(Dataset, metaclass=ABCMeta):
+
+    def __init__(self,
+                 ann_file,
+                 img_prefix,
+                 data_cfg,
+                 pipeline,
+                 test_mode=False):
+        self.image_info = {}
+        self.ann_info = {}
+
+        self.annotations_path = ann_file
+        if not img_prefix.endswith('/'):
+            img_prefix = img_prefix + '/'
+        self.img_prefix = img_prefix
+        self.pipeline = pipeline
+        self.test_mode = test_mode
+
+        self.ann_info['image_size'] = np.array(data_cfg['image_size'])
+        self.ann_info['heatmap_size'] = np.array(data_cfg['heatmap_size'])
+        self.ann_info['num_joints'] = data_cfg['num_joints']
+
+        self.ann_info['flip_pairs'] = None
+
+        self.ann_info['inference_channel'] = data_cfg['inference_channel']
+        self.ann_info['num_output_channels'] = data_cfg['num_output_channels']
+        self.ann_info['dataset_channel'] = data_cfg['dataset_channel']
+
+        self.db = []
+        self.num_shots = 1
+        self.paired_samples = []
+        self.pipeline = Compose(self.pipeline)
+
+    @abstractmethod
+    def _get_db(self):
+        """Load dataset."""
+        raise NotImplementedError
+
+    @abstractmethod
+    def _select_kpt(self, obj, kpt_id):
+        """Select kpt."""
+        raise NotImplementedError
+
+    @abstractmethod
+    def evaluate(self, cfg, preds, output_dir, *args, **kwargs):
+        """Evaluate keypoint results."""
+        raise NotImplementedError
+
+    @staticmethod
+    def _write_keypoint_results(keypoints, res_file):
+        """Write results into a json file."""
+
+        with open(res_file, 'w') as f:
+            json.dump(keypoints, f, sort_keys=True, indent=4)
+
+    def _report_metric(self,
+                       res_file,
+                       metrics,
+                       pck_thr=0.2,
+                       pckh_thr=0.7,
+                       auc_nor=30):
+        """Keypoint evaluation.
+
+        Args:
+            res_file (str): Json file stored prediction results.
+            metrics (str | list[str]): Metric to be performed.
+                Options: 'PCK', 'PCKh', 'AUC', 'EPE'.
+            pck_thr (float): PCK threshold, default as 0.2.
+            pckh_thr (float): PCKh threshold, default as 0.7.
+            auc_nor (float): AUC normalization factor, default as 30 pixel.
+
+        Returns:
+            List: Evaluation results for evaluation metric.
+        """
+        info_str = []
+
+        with open(res_file, 'r') as fin:
+            preds = json.load(fin)
+        assert len(preds) == len(self.paired_samples)
+
+        outputs = []
+        gts = []
+        masks = []
+        threshold_bbox = []
+        threshold_head_box = []
+
+        for pred, pair in zip(preds, self.paired_samples):
+            item = self.db[pair[-1]]
+            outputs.append(np.array(pred['keypoints'])[:, :-1])
+            gts.append(np.array(item['joints_3d'])[:, :-1])
+
+            mask_query = ((np.array(item['joints_3d_visible'])[:, 0]) > 0)
+            mask_sample = ((np.array(
+                self.db[pair[0]]['joints_3d_visible'])[:, 0]) > 0)
+            for id_s in pair[:-1]:
+                mask_sample = np.bitwise_and(
+                    mask_sample,
+                    ((np.array(self.db[id_s]['joints_3d_visible'])[:, 0]) > 0))
+            masks.append(np.bitwise_and(mask_query, mask_sample))
+
+            if 'PCK' in metrics:
+                bbox = np.array(item['bbox'])
+                bbox_thr = np.max(bbox[2:])
+                threshold_bbox.append(np.array([bbox_thr, bbox_thr]))
+            if 'PCKh' in metrics:
+                head_box_thr = item['head_size']
+                threshold_head_box.append(
+                    np.array([head_box_thr, head_box_thr]))
+
+        if 'PCK' in metrics:
+            pck_avg = []
+            for (output, gt, mask, thr_bbox) in zip(outputs, gts, masks,
+                                                    threshold_bbox):
+                _, pck, _ = keypoint_pck_accuracy(
+                    np.expand_dims(output, 0), np.expand_dims(gt, 0),
+                    np.expand_dims(mask, 0), pck_thr,
+                    np.expand_dims(thr_bbox, 0))
+                pck_avg.append(pck)
+            info_str.append(('PCK', np.mean(pck_avg)))
+
+        return info_str
+
+    def _merge_obj(self, Xs_list, Xq, idx):
+        """merge Xs_list and Xq.
+
+        :param Xs_list: N-shot samples X
+        :param Xq: query X
+        :param idx: id of paired_samples
+        :return: Xall
+        """
+        Xall = dict()
+        Xall['img_s'] = [Xs['img'] for Xs in Xs_list]
+        Xall['target_s'] = [Xs['target'] for Xs in Xs_list]
+        Xall['target_weight_s'] = [Xs['target_weight'] for Xs in Xs_list]
+        xs_img_metas = [Xs['img_metas'].data for Xs in Xs_list]
+
+        Xall['img_q'] = Xq['img']
+        Xall['target_q'] = Xq['target']
+        Xall['target_weight_q'] = Xq['target_weight']
+        xq_img_metas = Xq['img_metas'].data
+
+        img_metas = dict()
+        for key in xq_img_metas.keys():
+            img_metas['sample_' + key] = [
+                xs_img_meta[key] for xs_img_meta in xs_img_metas
+            ]
+            img_metas['query_' + key] = xq_img_metas[key]
+        img_metas['bbox_id'] = idx
+
+        Xall['img_metas'] = DC(img_metas, cpu_only=True)
+
+        return Xall
+
+    def __len__(self):
+        """Get the size of the dataset."""
+        return len(self.paired_samples)
+
+    def __getitem__(self, idx):
+        """Get the sample given index."""
+
+        pair_ids = self.paired_samples[idx]  # [supported id * shots, query id]
+        assert len(pair_ids) == self.num_shots + 1
+        sample_id_list = pair_ids[:self.num_shots]
+        query_id = pair_ids[-1]
+
+        sample_obj_list = []
+        for sample_id in sample_id_list:
+            sample_obj = copy.deepcopy(self.db[sample_id])
+            sample_obj['ann_info'] = copy.deepcopy(self.ann_info)
+            sample_obj_list.append(sample_obj)
+
+        query_obj = copy.deepcopy(self.db[query_id])
+        query_obj['ann_info'] = copy.deepcopy(self.ann_info)
+
+        Xs_list = []
+        for sample_obj in sample_obj_list:
+            Xs = self.pipeline(
+                sample_obj
+            )  # dict with ['img', 'target', 'target_weight', 'img_metas'],
+            Xs_list.append(Xs)  # Xs['target'] is of shape [100, map_h, map_w]
+        Xq = self.pipeline(query_obj)
+
+        Xall = self._merge_obj(Xs_list, Xq, idx)
+        Xall['skeleton'] = self.db[query_id]['skeleton']
+        return Xall
+
+    def _sort_and_unique_bboxes(self, kpts, key='bbox_id'):
+        """sort kpts and remove the repeated ones."""
+        kpts = sorted(kpts, key=lambda x: x[key])
+        num = len(kpts)
+        for i in range(num - 1, 0, -1):
+            if kpts[i][key] == kpts[i - 1][key]:
+                del kpts[i]
+
+        return kpts
diff --git a/projects/pose_anything/datasets/datasets/mp100/transformer_dataset.py b/projects/pose_anything/datasets/datasets/mp100/transformer_dataset.py
new file mode 100644
index 0000000000..3244f2a0d7
--- /dev/null
+++ b/projects/pose_anything/datasets/datasets/mp100/transformer_dataset.py
@@ -0,0 +1,342 @@
+import os
+import random
+from collections import OrderedDict
+
+import numpy as np
+from xtcocotools.coco import COCO
+
+from mmpose.datasets import DATASETS
+from .transformer_base_dataset import TransformerBaseDataset
+
+
+@DATASETS.register_module()
+class TransformerPoseDataset(TransformerBaseDataset):
+
+    def __init__(self,
+                 ann_file,
+                 img_prefix,
+                 data_cfg,
+                 pipeline,
+                 valid_class_ids,
+                 max_kpt_num=None,
+                 num_shots=1,
+                 num_queries=100,
+                 num_episodes=1,
+                 test_mode=False):
+        super().__init__(
+            ann_file, img_prefix, data_cfg, pipeline, test_mode=test_mode)
+
+        self.ann_info['flip_pairs'] = []
+
+        self.ann_info['upper_body_ids'] = []
+        self.ann_info['lower_body_ids'] = []
+
+        self.ann_info['use_different_joint_weights'] = False
+        self.ann_info['joint_weights'] = np.array([
+            1.,
+        ], dtype=np.float32).reshape((self.ann_info['num_joints'], 1))
+
+        self.coco = COCO(ann_file)
+
+        self.id2name, self.name2id = self._get_mapping_id_name(self.coco.imgs)
+        self.img_ids = self.coco.getImgIds()
+        self.classes = [
+            cat['name'] for cat in self.coco.loadCats(self.coco.getCatIds())
+        ]
+
+        self.num_classes = len(self.classes)
+        self._class_to_ind = dict(zip(self.classes, self.coco.getCatIds()))
+        self._ind_to_class = dict(zip(self.coco.getCatIds(), self.classes))
+
+        if valid_class_ids is not None:  # None by default
+            self.valid_class_ids = valid_class_ids
+        else:
+            self.valid_class_ids = self.coco.getCatIds()
+        self.valid_classes = [
+            self._ind_to_class[ind] for ind in self.valid_class_ids
+        ]
+
+        self.cats = self.coco.cats
+        self.max_kpt_num = max_kpt_num
+
+        # Also update self.cat2obj
+        self.db = self._get_db()
+
+        self.num_shots = num_shots
+
+        if not test_mode:
+            # Update every training epoch
+            self.random_paired_samples()
+        else:
+            self.num_queries = num_queries
+            self.num_episodes = num_episodes
+            self.make_paired_samples()
+
+    def random_paired_samples(self):
+        num_datas = [
+            len(self.cat2obj[self._class_to_ind[cls]])
+            for cls in self.valid_classes
+        ]
+
+        # balance the dataset
+        max_num_data = max(num_datas)
+
+        all_samples = []
+        for cls in self.valid_class_ids:
+            for i in range(max_num_data):
+                shot = random.sample(self.cat2obj[cls], self.num_shots + 1)
+                all_samples.append(shot)
+
+        self.paired_samples = np.array(all_samples)
+        np.random.shuffle(self.paired_samples)
+
+    def make_paired_samples(self):
+        random.seed(1)
+        np.random.seed(0)
+
+        all_samples = []
+        for cls in self.valid_class_ids:
+            for _ in range(self.num_episodes):
+                shots = random.sample(self.cat2obj[cls],
+                                      self.num_shots + self.num_queries)
+                sample_ids = shots[:self.num_shots]
+                query_ids = shots[self.num_shots:]
+                for query_id in query_ids:
+                    all_samples.append(sample_ids + [query_id])
+
+        self.paired_samples = np.array(all_samples)
+
+    def _select_kpt(self, obj, kpt_id):
+        obj['joints_3d'] = obj['joints_3d'][kpt_id:kpt_id + 1]
+        obj['joints_3d_visible'] = obj['joints_3d_visible'][kpt_id:kpt_id + 1]
+        obj['kpt_id'] = kpt_id
+
+        return obj
+
+    @staticmethod
+    def _get_mapping_id_name(imgs):
+        """
+        Args:
+            imgs (dict): dict of image info.
+
+        Returns:
+            tuple: Image name & id mapping dicts.
+
+            - id2name (dict): Mapping image id to name.
+            - name2id (dict): Mapping image name to id.
+        """
+        id2name = {}
+        name2id = {}
+        for image_id, image in imgs.items():
+            file_name = image['file_name']
+            id2name[image_id] = file_name
+            name2id[file_name] = image_id
+
+        return id2name, name2id
+
+    def _get_db(self):
+        """Ground truth bbox and keypoints."""
+        self.obj_id = 0
+
+        self.cat2obj = {}
+        for i in self.coco.getCatIds():
+            self.cat2obj.update({i: []})
+
+        gt_db = []
+        for img_id in self.img_ids:
+            gt_db.extend(self._load_coco_keypoint_annotation_kernel(img_id))
+
+        return gt_db
+
+    def _load_coco_keypoint_annotation_kernel(self, img_id):
+        """load annotation from COCOAPI.
+
+        Note:
+            bbox:[x1, y1, w, h]
+        Args:
+            img_id: coco image id
+        Returns:
+            dict: db entry
+        """
+        img_ann = self.coco.loadImgs(img_id)[0]
+        width = img_ann['width']
+        height = img_ann['height']
+
+        ann_ids = self.coco.getAnnIds(imgIds=img_id, iscrowd=False)
+        objs = self.coco.loadAnns(ann_ids)
+
+        # sanitize bboxes
+        valid_objs = []
+        for obj in objs:
+            if 'bbox' not in obj:
+                continue
+            x, y, w, h = obj['bbox']
+            x1 = max(0, x)
+            y1 = max(0, y)
+            x2 = min(width - 1, x1 + max(0, w - 1))
+            y2 = min(height - 1, y1 + max(0, h - 1))
+            if ('area' not in obj or obj['area'] > 0) and x2 > x1 and y2 > y1:
+                obj['clean_bbox'] = [x1, y1, x2 - x1, y2 - y1]
+                valid_objs.append(obj)
+        objs = valid_objs
+
+        bbox_id = 0
+        rec = []
+        for obj in objs:
+            if 'keypoints' not in obj:
+                continue
+            if max(obj['keypoints']) == 0:
+                continue
+            if 'num_keypoints' in obj and obj['num_keypoints'] == 0:
+                continue
+
+            category_id = obj['category_id']
+            # the number of keypoint for this specific category
+            cat_kpt_num = int(len(obj['keypoints']) / 3)
+            if self.max_kpt_num is None:
+                kpt_num = cat_kpt_num
+            else:
+                kpt_num = self.max_kpt_num
+
+            joints_3d = np.zeros((kpt_num, 3), dtype=np.float32)
+            joints_3d_visible = np.zeros((kpt_num, 3), dtype=np.float32)
+
+            keypoints = np.array(obj['keypoints']).reshape(-1, 3)
+            joints_3d[:cat_kpt_num, :2] = keypoints[:, :2]
+            joints_3d_visible[:cat_kpt_num, :2] = np.minimum(
+                1, keypoints[:, 2:3])
+
+            center, scale = self._xywh2cs(*obj['clean_bbox'][:4])
+
+            image_file = os.path.join(self.img_prefix, self.id2name[img_id])
+            if os.path.exists(image_file):
+                self.cat2obj[category_id].append(self.obj_id)
+
+                rec.append({
+                    'image_file':
+                    image_file,
+                    'center':
+                    center,
+                    'scale':
+                    scale,
+                    'rotation':
+                    0,
+                    'bbox':
+                    obj['clean_bbox'][:4],
+                    'bbox_score':
+                    1,
+                    'joints_3d':
+                    joints_3d,
+                    'joints_3d_visible':
+                    joints_3d_visible,
+                    'category_id':
+                    category_id,
+                    'cat_kpt_num':
+                    cat_kpt_num,
+                    'bbox_id':
+                    self.obj_id,
+                    'skeleton':
+                    self.coco.cats[obj['category_id']]['skeleton'],
+                })
+                bbox_id = bbox_id + 1
+                self.obj_id += 1
+
+        return rec
+
+    def _xywh2cs(self, x, y, w, h):
+        """This encodes bbox(x,y,w,w) into (center, scale)
+
+        Args:
+            x, y, w, h
+
+        Returns:
+            tuple: A tuple containing center and scale.
+
+            - center (np.ndarray[float32](2,)): center of the bbox (x, y).
+            - scale (np.ndarray[float32](2,)): scale of the bbox w & h.
+        """
+        aspect_ratio = self.ann_info['image_size'][0] / self.ann_info[
+            'image_size'][1]
+        center = np.array([x + w * 0.5, y + h * 0.5], dtype=np.float32)
+        #
+        # if (not self.test_mode) and np.random.rand() < 0.3:
+        #     center += 0.4 * (np.random.rand(2) - 0.5) * [w, h]
+
+        if w > aspect_ratio * h:
+            h = w * 1.0 / aspect_ratio
+        elif w < aspect_ratio * h:
+            w = h * aspect_ratio
+
+        # pixel std is 200.0
+        scale = np.array([w / 200.0, h / 200.0], dtype=np.float32)
+        # padding to include proper amount of context
+        scale = scale * 1.25
+
+        return center, scale
+
+    def evaluate(self, outputs, res_folder, metric='PCK', **kwargs):
+        """Evaluate interhand2d keypoint results. The pose prediction results
+        will be saved in `${res_folder}/result_keypoints.json`.
+
+        Note:
+            batch_size: N
+            num_keypoints: K
+            heatmap height: H
+            heatmap width: W
+
+        Args:
+            outputs (list(preds, boxes, image_path, output_heatmap))
+                :preds (np.ndarray[N,K,3]): The first two dimensions are
+                    coordinates, score is the third dimension of the array.
+                :boxes (np.ndarray[N,6]): [center[0], center[1], scale[0]
+                    , scale[1],area, score]
+                :image_paths (list[str]): For example, ['C', 'a', 'p', 't',
+                    'u', 'r', 'e', '1', '2', '/', '0', '3', '9', '0', '_',
+                    'd', 'h', '_', 't', 'o', 'u', 'c', 'h', 'R', 'O', 'M',
+                    '/', 'c', 'a', 'm', '4', '1', '0', '2', '0', '9', '/',
+                    'i', 'm', 'a', 'g', 'e', '6', '2', '4', '3', '4', '.',
+                    'j', 'p', 'g']
+                :output_heatmap (np.ndarray[N, K, H, W]): model outputs.
+
+            res_folder (str): Path of directory to save the results.
+            metric (str | list[str]): Metric to be performed.
+                Options: 'PCK', 'AUC', 'EPE'.
+
+        Returns:
+            dict: Evaluation results for evaluation metric.
+        """
+        metrics = metric if isinstance(metric, list) else [metric]
+        allowed_metrics = ['PCK', 'AUC', 'EPE', 'NME']
+        for metric in metrics:
+            if metric not in allowed_metrics:
+                raise KeyError(f'metric {metric} is not supported')
+
+        res_file = os.path.join(res_folder, 'result_keypoints.json')
+
+        kpts = []
+        for output in outputs:
+            preds = output['preds']
+            boxes = output['boxes']
+            image_paths = output['image_paths']
+            bbox_ids = output['bbox_ids']
+
+            batch_size = len(image_paths)
+            for i in range(batch_size):
+                image_id = self.name2id[image_paths[i][len(self.img_prefix):]]
+
+                kpts.append({
+                    'keypoints': preds[i].tolist(),
+                    'center': boxes[i][0:2].tolist(),
+                    'scale': boxes[i][2:4].tolist(),
+                    'area': float(boxes[i][4]),
+                    'score': float(boxes[i][5]),
+                    'image_id': image_id,
+                    'bbox_id': bbox_ids[i]
+                })
+        kpts = self._sort_and_unique_bboxes(kpts)
+
+        self._write_keypoint_results(kpts, res_file)
+        info_str = self._report_metric(res_file, metrics)
+        name_value = OrderedDict(info_str)
+
+        return name_value
diff --git a/projects/pose_anything/datasets/pipelines/__init__.py b/projects/pose_anything/datasets/pipelines/__init__.py
new file mode 100644
index 0000000000..762593d15b
--- /dev/null
+++ b/projects/pose_anything/datasets/pipelines/__init__.py
@@ -0,0 +1,3 @@
+from .top_down_transform import TopDownGenerateTargetFewShot
+
+__all__ = ['TopDownGenerateTargetFewShot']
diff --git a/projects/pose_anything/datasets/pipelines/post_transforms.py b/projects/pose_anything/datasets/pipelines/post_transforms.py
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/projects/pose_anything/datasets/pipelines/top_down_transform.py b/projects/pose_anything/datasets/pipelines/top_down_transform.py
new file mode 100644
index 0000000000..a0c0839307
--- /dev/null
+++ b/projects/pose_anything/datasets/pipelines/top_down_transform.py
@@ -0,0 +1,317 @@
+import numpy as np
+
+
+class TopDownGenerateTargetFewShot:
+    """Generate the target heatmap.
+
+    Required keys: 'joints_3d', 'joints_3d_visible', 'ann_info'.
+    Modified keys: 'target', and 'target_weight'.
+
+    Args:
+        sigma: Sigma of heatmap gaussian for 'MSRA' approach.
+        kernel: Kernel of heatmap gaussian for 'Megvii' approach.
+        encoding (str): Approach to generate target heatmaps.
+            Currently supported approaches: 'MSRA', 'Megvii', 'UDP'.
+            Default:'MSRA'
+
+        unbiased_encoding (bool): Option to use unbiased
+            encoding methods.
+            Paper ref: Zhang et al. Distribution-Aware Coordinate
+            Representation for Human Pose Estimation (CVPR 2020).
+        keypoint_pose_distance: Keypoint pose distance for UDP.
+            Paper ref: Huang et al. The Devil is in the Details: Delving into
+            Unbiased Data Processing for Human Pose Estimation (CVPR 2020).
+        target_type (str): supported targets: 'GaussianHeatMap',
+            'CombinedTarget'. Default:'GaussianHeatMap'
+            CombinedTarget: The combination of classification target
+            (response map) and regression target (offset map).
+            Paper ref: Huang et al. The Devil is in the Details: Delving into
+            Unbiased Data Processing for Human Pose Estimation (CVPR 2020).
+    """
+
+    def __init__(self,
+                 sigma=2,
+                 kernel=(11, 11),
+                 valid_radius_factor=0.0546875,
+                 target_type='GaussianHeatMap',
+                 encoding='MSRA',
+                 unbiased_encoding=False):
+        self.sigma = sigma
+        self.unbiased_encoding = unbiased_encoding
+        self.kernel = kernel
+        self.valid_radius_factor = valid_radius_factor
+        self.target_type = target_type
+        self.encoding = encoding
+
+    def _msra_generate_target(self, cfg, joints_3d, joints_3d_visible, sigma):
+        """Generate the target heatmap via "MSRA" approach.
+
+        Args:
+            cfg (dict): data config
+            joints_3d: np.ndarray ([num_joints, 3])
+            joints_3d_visible: np.ndarray ([num_joints, 3])
+            sigma: Sigma of heatmap gaussian
+        Returns:
+            tuple: A tuple containing targets.
+
+            - target: Target heatmaps.
+            - target_weight: (1: visible, 0: invisible)
+        """
+        num_joints = len(joints_3d)
+        image_size = cfg['image_size']
+        W, H = cfg['heatmap_size']
+        joint_weights = cfg['joint_weights']
+        use_different_joint_weights = cfg['use_different_joint_weights']
+        assert not use_different_joint_weights
+
+        target_weight = np.zeros((num_joints, 1), dtype=np.float32)
+        target = np.zeros((num_joints, H, W), dtype=np.float32)
+
+        # 3-sigma rule
+        tmp_size = sigma * 3
+
+        if self.unbiased_encoding:
+            for joint_id in range(num_joints):
+                target_weight[joint_id] = joints_3d_visible[joint_id, 0]
+
+                feat_stride = image_size / [W, H]
+                mu_x = joints_3d[joint_id][0] / feat_stride[0]
+                mu_y = joints_3d[joint_id][1] / feat_stride[1]
+                # Check that any part of the gaussian is in-bounds
+                ul = [mu_x - tmp_size, mu_y - tmp_size]
+                br = [mu_x + tmp_size + 1, mu_y + tmp_size + 1]
+                if ul[0] >= W or ul[1] >= H or br[0] < 0 or br[1] < 0:
+                    target_weight[joint_id] = 0
+
+                if target_weight[joint_id] == 0:
+                    continue
+
+                x = np.arange(0, W, 1, np.float32)
+                y = np.arange(0, H, 1, np.float32)
+                y = y[:, None]
+
+                if target_weight[joint_id] > 0.5:
+                    target[joint_id] = np.exp(-((x - mu_x)**2 +
+                                                (y - mu_y)**2) /
+                                              (2 * sigma**2))
+        else:
+            for joint_id in range(num_joints):
+                target_weight[joint_id] = joints_3d_visible[joint_id, 0]
+
+                feat_stride = image_size / [W, H]
+                mu_x = int(joints_3d[joint_id][0] / feat_stride[0] + 0.5)
+                mu_y = int(joints_3d[joint_id][1] / feat_stride[1] + 0.5)
+                # Check that any part of the gaussian is in-bounds
+                ul = [int(mu_x - tmp_size), int(mu_y - tmp_size)]
+                br = [int(mu_x + tmp_size + 1), int(mu_y + tmp_size + 1)]
+                if ul[0] >= W or ul[1] >= H or br[0] < 0 or br[1] < 0:
+                    target_weight[joint_id] = 0
+
+                if target_weight[joint_id] > 0.5:
+                    size = 2 * tmp_size + 1
+                    x = np.arange(0, size, 1, np.float32)
+                    y = x[:, None]
+                    x0 = y0 = size // 2
+                    # The gaussian is not normalized,
+                    # we want the center value to equal 1
+                    g = np.exp(-((x - x0)**2 + (y - y0)**2) / (2 * sigma**2))
+
+                    # Usable gaussian range
+                    g_x = max(0, -ul[0]), min(br[0], W) - ul[0]
+                    g_y = max(0, -ul[1]), min(br[1], H) - ul[1]
+                    # Image range
+                    img_x = max(0, ul[0]), min(br[0], W)
+                    img_y = max(0, ul[1]), min(br[1], H)
+
+                    target[joint_id][img_y[0]:img_y[1], img_x[0]:img_x[1]] = \
+                        g[g_y[0]:g_y[1], g_x[0]:g_x[1]]
+
+        if use_different_joint_weights:
+            target_weight = np.multiply(target_weight, joint_weights)
+
+        return target, target_weight
+
+    def _udp_generate_target(self, cfg, joints_3d, joints_3d_visible, factor,
+                             target_type):
+        """Generate the target heatmap via 'UDP' approach. Paper ref: Huang et
+        al. The Devil is in the Details: Delving into Unbiased Data Processing
+        for Human Pose Estimation (CVPR 2020).
+
+        Note:
+            num keypoints: K
+            heatmap height: H
+            heatmap width: W
+            num target channels: C
+            C = K if target_type=='GaussianHeatMap'
+            C = 3*K if target_type=='CombinedTarget'
+
+        Args:
+            cfg (dict): data config
+            joints_3d (np.ndarray[K, 3]): Annotated keypoints.
+            joints_3d_visible (np.ndarray[K, 3]): Visibility of keypoints.
+            factor (float): kernel factor for GaussianHeatMap target or
+                valid radius factor for CombinedTarget.
+            target_type (str): 'GaussianHeatMap' or 'CombinedTarget'.
+                GaussianHeatMap: Heatmap target with gaussian distribution.
+                CombinedTarget: The combination of classification target
+                (response map) and regression target (offset map).
+
+        Returns:
+            tuple: A tuple containing targets.
+
+            - target (np.ndarray[C, H, W]): Target heatmaps.
+            - target_weight (np.ndarray[K, 1]): (1: visible, 0: invisible)
+        """
+        num_joints = len(joints_3d)
+        image_size = cfg['image_size']
+        heatmap_size = cfg['heatmap_size']
+        joint_weights = cfg['joint_weights']
+        use_different_joint_weights = cfg['use_different_joint_weights']
+        assert not use_different_joint_weights
+
+        target_weight = np.ones((num_joints, 1), dtype=np.float32)
+        target_weight[:, 0] = joints_3d_visible[:, 0]
+
+        assert target_type in ['GaussianHeatMap', 'CombinedTarget']
+
+        if target_type == 'GaussianHeatMap':
+            target = np.zeros((num_joints, heatmap_size[1], heatmap_size[0]),
+                              dtype=np.float32)
+
+            tmp_size = factor * 3
+
+            # prepare for gaussian
+            size = 2 * tmp_size + 1
+            x = np.arange(0, size, 1, np.float32)
+            y = x[:, None]
+
+            for joint_id in range(num_joints):
+                feat_stride = (image_size - 1.0) / (heatmap_size - 1.0)
+                mu_x = int(joints_3d[joint_id][0] / feat_stride[0] + 0.5)
+                mu_y = int(joints_3d[joint_id][1] / feat_stride[1] + 0.5)
+                # Check that any part of the gaussian is in-bounds
+                ul = [int(mu_x - tmp_size), int(mu_y - tmp_size)]
+                br = [int(mu_x + tmp_size + 1), int(mu_y + tmp_size + 1)]
+                if ul[0] >= heatmap_size[0] or ul[1] >= heatmap_size[1] \
+                        or br[0] < 0 or br[1] < 0:
+                    # If not, just return the image as is
+                    target_weight[joint_id] = 0
+                    continue
+
+                # # Generate gaussian
+                mu_x_ac = joints_3d[joint_id][0] / feat_stride[0]
+                mu_y_ac = joints_3d[joint_id][1] / feat_stride[1]
+                x0 = y0 = size // 2
+                x0 += mu_x_ac - mu_x
+                y0 += mu_y_ac - mu_y
+                g = np.exp(-((x - x0)**2 + (y - y0)**2) / (2 * factor**2))
+
+                # Usable gaussian range
+                g_x = max(0, -ul[0]), min(br[0], heatmap_size[0]) - ul[0]
+                g_y = max(0, -ul[1]), min(br[1], heatmap_size[1]) - ul[1]
+                # Image range
+                img_x = max(0, ul[0]), min(br[0], heatmap_size[0])
+                img_y = max(0, ul[1]), min(br[1], heatmap_size[1])
+
+                v = target_weight[joint_id]
+                if v > 0.5:
+                    target[joint_id][img_y[0]:img_y[1], img_x[0]:img_x[1]] = \
+                        g[g_y[0]:g_y[1], g_x[0]:g_x[1]]
+        elif target_type == 'CombinedTarget':
+            target = np.zeros(
+                (num_joints, 3, heatmap_size[1] * heatmap_size[0]),
+                dtype=np.float32)
+            feat_width = heatmap_size[0]
+            feat_height = heatmap_size[1]
+            feat_x_int = np.arange(0, feat_width)
+            feat_y_int = np.arange(0, feat_height)
+            feat_x_int, feat_y_int = np.meshgrid(feat_x_int, feat_y_int)
+            feat_x_int = feat_x_int.flatten()
+            feat_y_int = feat_y_int.flatten()
+            # Calculate the radius of the positive area in classification
+            #   heatmap.
+            valid_radius = factor * heatmap_size[1]
+            feat_stride = (image_size - 1.0) / (heatmap_size - 1.0)
+            for joint_id in range(num_joints):
+                mu_x = joints_3d[joint_id][0] / feat_stride[0]
+                mu_y = joints_3d[joint_id][1] / feat_stride[1]
+                x_offset = (mu_x - feat_x_int) / valid_radius
+                y_offset = (mu_y - feat_y_int) / valid_radius
+                dis = x_offset**2 + y_offset**2
+                keep_pos = np.where(dis <= 1)[0]
+                v = target_weight[joint_id]
+                if v > 0.5:
+                    target[joint_id, 0, keep_pos] = 1
+                    target[joint_id, 1, keep_pos] = x_offset[keep_pos]
+                    target[joint_id, 2, keep_pos] = y_offset[keep_pos]
+            target = target.reshape(num_joints * 3, heatmap_size[1],
+                                    heatmap_size[0])
+
+        if use_different_joint_weights:
+            target_weight = np.multiply(target_weight, joint_weights)
+
+        return target, target_weight
+
+    def __call__(self, results):
+        """Generate the target heatmap."""
+        joints_3d = results['joints_3d']
+        joints_3d_visible = results['joints_3d_visible']
+
+        assert self.encoding in ['MSRA', 'UDP']
+
+        if self.encoding == 'MSRA':
+            if isinstance(self.sigma, list):
+                num_sigmas = len(self.sigma)
+                cfg = results['ann_info']
+                num_joints = len(joints_3d)
+                heatmap_size = cfg['heatmap_size']
+
+                target = np.empty(
+                    (0, num_joints, heatmap_size[1], heatmap_size[0]),
+                    dtype=np.float32)
+                target_weight = np.empty((0, num_joints, 1), dtype=np.float32)
+                for i in range(num_sigmas):
+                    target_i, target_weight_i = self._msra_generate_target(
+                        cfg, joints_3d, joints_3d_visible, self.sigma[i])
+                    target = np.concatenate([target, target_i[None]], axis=0)
+                    target_weight = np.concatenate(
+                        [target_weight, target_weight_i[None]], axis=0)
+            else:
+                target, target_weight = self._msra_generate_target(
+                    results['ann_info'], joints_3d, joints_3d_visible,
+                    self.sigma)
+        elif self.encoding == 'UDP':
+            if self.target_type == 'CombinedTarget':
+                factors = self.valid_radius_factor
+                channel_factor = 3
+            elif self.target_type == 'GaussianHeatMap':
+                factors = self.sigma
+                channel_factor = 1
+            if isinstance(factors, list):
+                num_factors = len(factors)
+                cfg = results['ann_info']
+                num_joints = len(joints_3d)
+                W, H = cfg['heatmap_size']
+
+                target = np.empty((0, channel_factor * num_joints, H, W),
+                                  dtype=np.float32)
+                target_weight = np.empty((0, num_joints, 1), dtype=np.float32)
+                for i in range(num_factors):
+                    target_i, target_weight_i = self._udp_generate_target(
+                        cfg, joints_3d, joints_3d_visible, factors[i],
+                        self.target_type)
+                    target = np.concatenate([target, target_i[None]], axis=0)
+                    target_weight = np.concatenate(
+                        [target_weight, target_weight_i[None]], axis=0)
+            else:
+                target, target_weight = self._udp_generate_target(
+                    results['ann_info'], joints_3d, joints_3d_visible, factors,
+                    self.target_type)
+        else:
+            raise ValueError(
+                f'Encoding approach {self.encoding} is not supported!')
+
+        results['target'] = target
+        results['target_weight'] = target_weight
+
+        return results
diff --git a/projects/pose_anything/demo.py b/projects/pose_anything/demo.py
new file mode 100644
index 0000000000..64490a163b
--- /dev/null
+++ b/projects/pose_anything/demo.py
@@ -0,0 +1,291 @@
+import argparse
+import copy
+import os
+import random
+
+import cv2
+import numpy as np
+import torch
+import torchvision.transforms.functional as F
+from datasets.pipelines import TopDownGenerateTargetFewShot
+from mmcv.cnn import fuse_conv_bn
+from mmengine.config import Config, DictAction
+from mmengine.runner import load_checkpoint
+from torchvision import transforms
+
+from mmpose.models import build_pose_estimator
+from tools.visualization import COLORS, plot_results
+
+
+class ResizePad:
+
+    def __init__(self, w=256, h=256):
+        self.w = w
+        self.h = h
+
+    def __call__(self, image):
+        _, w_1, h_1 = image.shape
+        ratio_1 = w_1 / h_1
+        # check if the original and final aspect ratios are the same within a
+        # margin
+        if round(ratio_1, 2) != 1:
+            # padding to preserve aspect ratio
+            if ratio_1 > 1:  # Make the image higher
+                hp = int(w_1 - h_1)
+                hp = hp // 2
+                image = F.pad(image, (hp, 0, hp, 0), 0, 'constant')
+                return F.resize(image, [self.h, self.w])
+            else:
+                wp = int(h_1 - w_1)
+                wp = wp // 2
+                image = F.pad(image, (0, wp, 0, wp), 0, 'constant')
+                return F.resize(image, [self.h, self.w])
+        else:
+            return F.resize(image, [self.h, self.w])
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='Pose Anything Demo')
+    parser.add_argument('--support', help='Image file')
+    parser.add_argument('--query', help='Image file')
+    parser.add_argument(
+        '--config', default='configs/demo.py', help='test config file path')
+    parser.add_argument(
+        '--checkpoint', default='pretrained', help='checkpoint file')
+    parser.add_argument('--outdir', default='output', help='checkpoint file')
+
+    parser.add_argument(
+        '--fuse-conv-bn',
+        action='store_true',
+        help='Whether to fuse conv and bn, this will slightly increase'
+        'the inference speed')
+    parser.add_argument(
+        '--cfg-options',
+        nargs='+',
+        action=DictAction,
+        default={},
+        help='override some settings in the used config, the key-value pair '
+        'in xxx=yyy format will be merged into config file. For example, '
+        "'--cfg-options model.backbone.depth=18 "
+        "model.backbone.with_cp=True'")
+    args = parser.parse_args()
+    return args
+
+
+def merge_configs(cfg1, cfg2):
+    # Merge cfg2 into cfg1
+    # Overwrite cfg1 if repeated, ignore if value is None.
+    cfg1 = {} if cfg1 is None else cfg1.copy()
+    cfg2 = {} if cfg2 is None else cfg2
+    for k, v in cfg2.items():
+        if v:
+            cfg1[k] = v
+    return cfg1
+
+
+def main():
+    random.seed(0)
+    np.random.seed(0)
+    torch.manual_seed(0)
+
+    args = parse_args()
+    cfg = Config.fromfile(args.config)
+
+    if args.cfg_options is not None:
+        cfg.merge_from_dict(args.cfg_options)
+    # set cudnn_benchmark
+    if cfg.get('cudnn_benchmark', False):
+        torch.backends.cudnn.benchmark = True
+    cfg.data.test.test_mode = True
+
+    os.makedirs(args.outdir, exist_ok=True)
+
+    # Load data
+    support_img = cv2.imread(args.support)
+    query_img = cv2.imread(args.query)
+    if support_img is None or query_img is None:
+        raise ValueError('Fail to read images')
+
+    preprocess = transforms.Compose([
+        transforms.ToTensor(),
+        ResizePad(cfg.model.encoder_config.img_size,
+                  cfg.model.encoder_config.img_size)
+    ])
+
+    # frame = copy.deepcopy(support_img)
+    padded_support_img = preprocess(support_img).cpu().numpy().transpose(
+        1, 2, 0) * 255
+    frame = copy.deepcopy(padded_support_img.astype(np.uint8).copy())
+    kp_src = []
+    skeleton = []
+    count = 0
+    prev_pt = None
+    prev_pt_idx = None
+    color_idx = 0
+
+    def selectKP(event, x, y, flags, param):
+        nonlocal kp_src, frame
+        # if we are in points selection mode, the mouse was clicked,
+        # list of  points with the (x, y) location of the click
+        # and draw the circle
+
+        if event == cv2.EVENT_LBUTTONDOWN:
+            kp_src.append((x, y))
+            cv2.circle(frame, (x, y), 2, (0, 0, 255), 1)
+            cv2.imshow('Source', frame)
+
+        if event == cv2.EVENT_RBUTTONDOWN:
+            kp_src = []
+            frame = copy.deepcopy(support_img)
+            cv2.imshow('Source', frame)
+
+    def draw_line(event, x, y, flags, param):
+        nonlocal skeleton, kp_src, frame, count, prev_pt, prev_pt_idx, \
+            marked_frame, color_idx
+        if event == cv2.EVENT_LBUTTONDOWN:
+            closest_point = min(
+                kp_src, key=lambda p: (p[0] - x)**2 + (p[1] - y)**2)
+            closest_point_index = kp_src.index(closest_point)
+            if color_idx < len(COLORS):
+                c = COLORS[color_idx]
+            else:
+                c = random.choices(range(256), k=3)
+
+            cv2.circle(frame, closest_point, 2, c, 1)
+            if count == 0:
+                prev_pt = closest_point
+                prev_pt_idx = closest_point_index
+                count = count + 1
+                cv2.imshow('Source', frame)
+            else:
+                cv2.line(frame, prev_pt, closest_point, c, 2)
+                cv2.imshow('Source', frame)
+                count = 0
+                skeleton.append((prev_pt_idx, closest_point_index))
+                color_idx = color_idx + 1
+        elif event == cv2.EVENT_RBUTTONDOWN:
+            frame = copy.deepcopy(marked_frame)
+            cv2.imshow('Source', frame)
+            count = 0
+            color_idx = 0
+            skeleton = []
+            prev_pt = None
+
+    cv2.namedWindow('Source', cv2.WINDOW_NORMAL)
+    cv2.resizeWindow('Source', 800, 600)
+    cv2.setMouseCallback('Source', selectKP)
+    cv2.imshow('Source', frame)
+
+    # keep looping until points have been selected
+    print('Press any key when finished marking the points!! ')
+    while True:
+        if cv2.waitKey(1) > 0:
+            break
+
+    marked_frame = copy.deepcopy(frame)
+    cv2.setMouseCallback('Source', draw_line)
+    print('Press any key when finished creating skeleton!!')
+    while True:
+        if cv2.waitKey(1) > 0:
+            break
+
+    cv2.destroyAllWindows()
+    kp_src = torch.tensor(kp_src).float()
+    preprocess = transforms.Compose([
+        transforms.ToTensor(),
+        transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
+        ResizePad(cfg.model.encoder_config.img_size,
+                  cfg.model.encoder_config.img_size)
+    ])
+
+    if len(skeleton) == 0:
+        skeleton = [(0, 0)]
+
+    support_img = preprocess(support_img).flip(0)[None]
+    query_img = preprocess(query_img).flip(0)[None]
+    # Create heatmap from keypoints
+    genHeatMap = TopDownGenerateTargetFewShot()
+    data_cfg = cfg.data_cfg
+    data_cfg['image_size'] = np.array(
+        [cfg.model.encoder_config.img_size, cfg.model.encoder_config.img_size])
+    data_cfg['joint_weights'] = None
+    data_cfg['use_different_joint_weights'] = False
+    kp_src_3d = torch.cat((kp_src, torch.zeros(kp_src.shape[0], 1)), dim=-1)
+    kp_src_3d_weight = torch.cat(
+        (torch.ones_like(kp_src), torch.zeros(kp_src.shape[0], 1)), dim=-1)
+    target_s, target_weight_s = genHeatMap._msra_generate_target(
+        data_cfg, kp_src_3d, kp_src_3d_weight, sigma=2)
+    target_s = torch.tensor(target_s).float()[None]
+    target_weight_s = torch.tensor(target_weight_s).float()[None]
+
+    data = {
+        'img_s': [support_img],
+        'img_q':
+        query_img,
+        'target_s': [target_s],
+        'target_weight_s': [target_weight_s],
+        'target_q':
+        None,
+        'target_weight_q':
+        None,
+        'return_loss':
+        False,
+        'img_metas': [{
+            'sample_skeleton': [skeleton],
+            'query_skeleton':
+            skeleton,
+            'sample_joints_3d': [kp_src_3d],
+            'query_joints_3d':
+            kp_src_3d,
+            'sample_center': [kp_src.mean(dim=0)],
+            'query_center':
+            kp_src.mean(dim=0),
+            'sample_scale': [kp_src.max(dim=0)[0] - kp_src.min(dim=0)[0]],
+            'query_scale':
+            kp_src.max(dim=0)[0] - kp_src.min(dim=0)[0],
+            'sample_rotation': [0],
+            'query_rotation':
+            0,
+            'sample_bbox_score': [1],
+            'query_bbox_score':
+            1,
+            'query_image_file':
+            '',
+            'sample_image_file': [''],
+        }]
+    }
+
+    # Load model
+    model = build_pose_estimator(cfg.model)
+    load_checkpoint(model, args.checkpoint, map_location='cpu')
+    if args.fuse_conv_bn:
+        model = fuse_conv_bn(model)
+    model.eval()
+
+    with torch.no_grad():
+        outputs = model(**data)
+
+    # visualize results
+    vis_s_weight = target_weight_s[0]
+    vis_q_weight = target_weight_s[0]
+    vis_s_image = support_img[0].detach().cpu().numpy().transpose(1, 2, 0)
+    vis_q_image = query_img[0].detach().cpu().numpy().transpose(1, 2, 0)
+    support_kp = kp_src_3d
+
+    plot_results(
+        vis_s_image,
+        vis_q_image,
+        support_kp,
+        vis_s_weight,
+        None,
+        vis_q_weight,
+        skeleton,
+        None,
+        torch.tensor(outputs['points']).squeeze(0),
+        out_dir=args.outdir)
+
+    print('Output saved to output dir: {}'.format(args.outdir))
+
+
+if __name__ == '__main__':
+    main()
diff --git a/projects/pose_anything/models/__init__.py b/projects/pose_anything/models/__init__.py
new file mode 100644
index 0000000000..4c2fedaea2
--- /dev/null
+++ b/projects/pose_anything/models/__init__.py
@@ -0,0 +1,4 @@
+from .backbones import *  # noqa
+from .detectors import *  # noqa
+from .keypoint_heads import *  # noqa
+from .utils import *  # noqa
diff --git a/projects/pose_anything/models/backbones/__init__.py b/projects/pose_anything/models/backbones/__init__.py
new file mode 100644
index 0000000000..c23ac1d15c
--- /dev/null
+++ b/projects/pose_anything/models/backbones/__init__.py
@@ -0,0 +1 @@
+from .swin_transformer_v2 import SwinTransformerV2  # noqa
diff --git a/projects/pose_anything/models/backbones/simmim.py b/projects/pose_anything/models/backbones/simmim.py
new file mode 100644
index 0000000000..189011891a
--- /dev/null
+++ b/projects/pose_anything/models/backbones/simmim.py
@@ -0,0 +1,235 @@
+# --------------------------------------------------------
+# SimMIM
+# Copyright (c) 2021 Microsoft
+# Licensed under The MIT License [see LICENSE for details]
+# Written by Zhenda Xie
+# --------------------------------------------------------
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from timm.models.layers import trunc_normal_
+
+from .swin_transformer import SwinTransformer
+from .swin_transformer_v2 import SwinTransformerV2
+
+
+def norm_targets(targets, patch_size):
+    assert patch_size % 2 == 1
+
+    targets_ = targets
+    targets_count = torch.ones_like(targets)
+
+    targets_square = targets**2.
+
+    targets_mean = F.avg_pool2d(
+        targets,
+        kernel_size=patch_size,
+        stride=1,
+        padding=patch_size // 2,
+        count_include_pad=False)
+    targets_square_mean = F.avg_pool2d(
+        targets_square,
+        kernel_size=patch_size,
+        stride=1,
+        padding=patch_size // 2,
+        count_include_pad=False)
+    targets_count = F.avg_pool2d(
+        targets_count,
+        kernel_size=patch_size,
+        stride=1,
+        padding=patch_size // 2,
+        count_include_pad=True) * (
+            patch_size**2)
+
+    targets_var = (targets_square_mean - targets_mean**2.) * (
+        targets_count / (targets_count - 1))
+    targets_var = torch.clamp(targets_var, min=0.)
+
+    targets_ = (targets_ - targets_mean) / (targets_var + 1.e-6)**0.5
+
+    return targets_
+
+
+class SwinTransformerForSimMIM(SwinTransformer):
+
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+
+        assert self.num_classes == 0
+
+        self.mask_token = nn.Parameter(torch.zeros(1, 1, self.embed_dim))
+        trunc_normal_(self.mask_token, mean=0., std=.02)
+
+    def forward(self, x, mask):
+        x = self.patch_embed(x)
+
+        assert mask is not None
+        B, L, _ = x.shape
+
+        mask_tokens = self.mask_token.expand(B, L, -1)
+        w = mask.flatten(1).unsqueeze(-1).type_as(mask_tokens)
+        x = x * (1. - w) + mask_tokens * w
+
+        if self.ape:
+            x = x + self.absolute_pos_embed
+        x = self.pos_drop(x)
+
+        for layer in self.layers:
+            x = layer(x)
+        x = self.norm(x)
+
+        x = x.transpose(1, 2)
+        B, C, L = x.shape
+        H = W = int(L**0.5)
+        x = x.reshape(B, C, H, W)
+        return x
+
+    @torch.jit.ignore
+    def no_weight_decay(self):
+        return super().no_weight_decay() | {'mask_token'}
+
+
+class SwinTransformerV2ForSimMIM(SwinTransformerV2):
+
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+
+        assert self.num_classes == 0
+
+        self.mask_token = nn.Parameter(torch.zeros(1, 1, self.embed_dim))
+        trunc_normal_(self.mask_token, mean=0., std=.02)
+
+    def forward(self, x, mask):
+        x = self.patch_embed(x)
+
+        assert mask is not None
+        B, L, _ = x.shape
+
+        mask_tokens = self.mask_token.expand(B, L, -1)
+        w = mask.flatten(1).unsqueeze(-1).type_as(mask_tokens)
+        x = x * (1. - w) + mask_tokens * w
+
+        if self.ape:
+            x = x + self.absolute_pos_embed
+        x = self.pos_drop(x)
+
+        for layer in self.layers:
+            x = layer(x)
+        x = self.norm(x)
+
+        x = x.transpose(1, 2)
+        B, C, L = x.shape
+        H = W = int(L**0.5)
+        x = x.reshape(B, C, H, W)
+        return x
+
+    @torch.jit.ignore
+    def no_weight_decay(self):
+        return super().no_weight_decay() | {'mask_token'}
+
+
+class SimMIM(nn.Module):
+
+    def __init__(self, config, encoder, encoder_stride, in_chans, patch_size):
+        super().__init__()
+        self.config = config
+        self.encoder = encoder
+        self.encoder_stride = encoder_stride
+
+        self.decoder = nn.Sequential(
+            nn.Conv2d(
+                in_channels=self.encoder.num_features,
+                out_channels=self.encoder_stride**2 * 3,
+                kernel_size=1),
+            nn.PixelShuffle(self.encoder_stride),
+        )
+
+        self.in_chans = in_chans
+        self.patch_size = patch_size
+
+    def forward(self, x, mask):
+        z = self.encoder(x, mask)
+        x_rec = self.decoder(z)
+
+        mask = mask.repeat_interleave(self.patch_size, 1).repeat_interleave(
+            self.patch_size, 2).unsqueeze(1).contiguous()
+
+        # norm target as prompted
+        if self.config.NORM_TARGET.ENABLE:
+            x = norm_targets(x, self.config.NORM_TARGET.PATCH_SIZE)
+
+        loss_recon = F.l1_loss(x, x_rec, reduction='none')
+        loss = (loss_recon * mask).sum() / (mask.sum() + 1e-5) / self.in_chans
+        return loss
+
+    @torch.jit.ignore
+    def no_weight_decay(self):
+        if hasattr(self.encoder, 'no_weight_decay'):
+            return {'encoder.' + i for i in self.encoder.no_weight_decay()}
+        return {}
+
+    @torch.jit.ignore
+    def no_weight_decay_keywords(self):
+        if hasattr(self.encoder, 'no_weight_decay_keywords'):
+            return {
+                'encoder.' + i
+                for i in self.encoder.no_weight_decay_keywords()
+            }
+        return {}
+
+
+def build_simmim(config):
+    model_type = config.MODEL.TYPE
+    if model_type == 'swin':
+        encoder = SwinTransformerForSimMIM(
+            img_size=config.DATA.IMG_SIZE,
+            patch_size=config.MODEL.SWIN.PATCH_SIZE,
+            in_chans=config.MODEL.SWIN.IN_CHANS,
+            num_classes=0,
+            embed_dim=config.MODEL.SWIN.EMBED_DIM,
+            depths=config.MODEL.SWIN.DEPTHS,
+            num_heads=config.MODEL.SWIN.NUM_HEADS,
+            window_size=config.MODEL.SWIN.WINDOW_SIZE,
+            mlp_ratio=config.MODEL.SWIN.MLP_RATIO,
+            qkv_bias=config.MODEL.SWIN.QKV_BIAS,
+            qk_scale=config.MODEL.SWIN.QK_SCALE,
+            drop_rate=config.MODEL.DROP_RATE,
+            drop_path_rate=config.MODEL.DROP_PATH_RATE,
+            ape=config.MODEL.SWIN.APE,
+            patch_norm=config.MODEL.SWIN.PATCH_NORM,
+            use_checkpoint=config.TRAIN.USE_CHECKPOINT)
+        encoder_stride = 32
+        in_chans = config.MODEL.SWIN.IN_CHANS
+        patch_size = config.MODEL.SWIN.PATCH_SIZE
+    elif model_type == 'swinv2':
+        encoder = SwinTransformerV2ForSimMIM(
+            img_size=config.DATA.IMG_SIZE,
+            patch_size=config.MODEL.SWINV2.PATCH_SIZE,
+            in_chans=config.MODEL.SWINV2.IN_CHANS,
+            num_classes=0,
+            embed_dim=config.MODEL.SWINV2.EMBED_DIM,
+            depths=config.MODEL.SWINV2.DEPTHS,
+            num_heads=config.MODEL.SWINV2.NUM_HEADS,
+            window_size=config.MODEL.SWINV2.WINDOW_SIZE,
+            mlp_ratio=config.MODEL.SWINV2.MLP_RATIO,
+            qkv_bias=config.MODEL.SWINV2.QKV_BIAS,
+            drop_rate=config.MODEL.DROP_RATE,
+            drop_path_rate=config.MODEL.DROP_PATH_RATE,
+            ape=config.MODEL.SWINV2.APE,
+            patch_norm=config.MODEL.SWINV2.PATCH_NORM,
+            use_checkpoint=config.TRAIN.USE_CHECKPOINT)
+        encoder_stride = 32
+        in_chans = config.MODEL.SWINV2.IN_CHANS
+        patch_size = config.MODEL.SWINV2.PATCH_SIZE
+    else:
+        raise NotImplementedError(f'Unknown pre-train model: {model_type}')
+
+    model = SimMIM(
+        config=config.MODEL.SIMMIM,
+        encoder=encoder,
+        encoder_stride=encoder_stride,
+        in_chans=in_chans,
+        patch_size=patch_size)
+
+    return model
diff --git a/projects/pose_anything/models/backbones/swin_mlp.py b/projects/pose_anything/models/backbones/swin_mlp.py
new file mode 100644
index 0000000000..0442c2fab7
--- /dev/null
+++ b/projects/pose_anything/models/backbones/swin_mlp.py
@@ -0,0 +1,565 @@
+# --------------------------------------------------------
+# Swin Transformer
+# Copyright (c) 2021 Microsoft
+# Licensed under The MIT License [see LICENSE for details]
+# Written by Ze Liu
+# --------------------------------------------------------
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.utils.checkpoint as checkpoint
+from timm.models.layers import DropPath, to_2tuple, trunc_normal_
+
+
+class Mlp(nn.Module):
+
+    def __init__(self,
+                 in_features,
+                 hidden_features=None,
+                 out_features=None,
+                 act_layer=nn.GELU,
+                 drop=0.):
+        super().__init__()
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+        self.fc1 = nn.Linear(in_features, hidden_features)
+        self.act = act_layer()
+        self.fc2 = nn.Linear(hidden_features, out_features)
+        self.drop = nn.Dropout(drop)
+
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.drop(x)
+        x = self.fc2(x)
+        x = self.drop(x)
+        return x
+
+
+def window_partition(x, window_size):
+    """
+    Args:
+        x: (B, H, W, C)
+        window_size (int): window size
+
+    Returns:
+        windows: (num_windows*B, window_size, window_size, C)
+    """
+    B, H, W, C = x.shape
+    x = x.view(B, H // window_size, window_size, W // window_size, window_size,
+               C)
+    windows = x.permute(0, 1, 3, 2, 4,
+                        5).contiguous().view(-1, window_size, window_size, C)
+    return windows
+
+
+def window_reverse(windows, window_size, H, W):
+    """
+    Args:
+        windows: (num_windows*B, window_size, window_size, C)
+        window_size (int): Window size
+        H (int): Height of image
+        W (int): Width of image
+
+    Returns:
+        x: (B, H, W, C)
+    """
+    B = int(windows.shape[0] / (H * W / window_size / window_size))
+    x = windows.view(B, H // window_size, W // window_size, window_size,
+                     window_size, -1)
+    x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, H, W, -1)
+    return x
+
+
+class SwinMLPBlock(nn.Module):
+    r""" Swin MLP Block.
+
+    Args: dim (int): Number of input channels. input_resolution (tuple[int]):
+    Input resolution. num_heads (int): Number of attention heads. window_size
+    (int): Window size. shift_size (int): Shift size for SW-MSA. mlp_ratio (
+    float): Ratio of mlp hidden dim to embedding dim. drop (float, optional):
+    Dropout rate. Default: 0.0 drop_path (float, optional): Stochastic depth
+    rate. Default: 0.0 act_layer (nn.Module, optional): Activation layer.
+    Default: nn.GELU norm_layer (nn.Module, optional): Normalization layer.
+    Default: nn.LayerNorm
+    """
+
+    def __init__(self,
+                 dim,
+                 input_resolution,
+                 num_heads,
+                 window_size=7,
+                 shift_size=0,
+                 mlp_ratio=4.,
+                 drop=0.,
+                 drop_path=0.,
+                 act_layer=nn.GELU,
+                 norm_layer=nn.LayerNorm):
+        super().__init__()
+        self.dim = dim
+        self.input_resolution = input_resolution
+        self.num_heads = num_heads
+        self.window_size = window_size
+        self.shift_size = shift_size
+        self.mlp_ratio = mlp_ratio
+        if min(self.input_resolution) <= self.window_size:
+            # if window size is larger than input resolution, we don't
+            # partition windows
+            self.shift_size = 0
+            self.window_size = min(self.input_resolution)
+        assert 0 <= self.shift_size < self.window_size, ('shift_size must in '
+                                                         '0-window_size')
+
+        self.padding = [
+            self.window_size - self.shift_size, self.shift_size,
+            self.window_size - self.shift_size, self.shift_size
+        ]  # P_l,P_r,P_t,P_b
+
+        self.norm1 = norm_layer(dim)
+        # use group convolution to implement multi-head MLP
+        self.spatial_mlp = nn.Conv1d(
+            self.num_heads * self.window_size**2,
+            self.num_heads * self.window_size**2,
+            kernel_size=1,
+            groups=self.num_heads)
+
+        self.drop_path = DropPath(
+            drop_path) if drop_path > 0. else nn.Identity()
+        self.norm2 = norm_layer(dim)
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = Mlp(
+            in_features=dim,
+            hidden_features=mlp_hidden_dim,
+            act_layer=act_layer,
+            drop=drop)
+
+    def forward(self, x):
+        H, W = self.input_resolution
+        B, L, C = x.shape
+        assert L == H * W, 'input feature has wrong size'
+
+        shortcut = x
+        x = self.norm1(x)
+        x = x.view(B, H, W, C)
+
+        # shift
+        if self.shift_size > 0:
+            P_l, P_r, P_t, P_b = self.padding
+            shifted_x = F.pad(x, [0, 0, P_l, P_r, P_t, P_b], 'constant', 0)
+        else:
+            shifted_x = x
+        _, _H, _W, _ = shifted_x.shape
+
+        # partition windows
+        x_windows = window_partition(
+            shifted_x, self.window_size)  # nW*B, window_size, window_size, C
+        x_windows = x_windows.view(-1, self.window_size * self.window_size,
+                                   C)  # nW*B, window_size*window_size, C
+
+        # Window/Shifted-Window Spatial MLP
+        x_windows_heads = x_windows.view(-1,
+                                         self.window_size * self.window_size,
+                                         self.num_heads, C // self.num_heads)
+        x_windows_heads = x_windows_heads.transpose(
+            1, 2)  # nW*B, nH, window_size*window_size, C//nH
+        x_windows_heads = x_windows_heads.reshape(
+            -1, self.num_heads * self.window_size * self.window_size,
+            C // self.num_heads)
+        spatial_mlp_windows = self.spatial_mlp(
+            x_windows_heads)  # nW*B, nH*window_size*window_size, C//nH
+        spatial_mlp_windows = spatial_mlp_windows.view(
+            -1, self.num_heads, self.window_size * self.window_size,
+            C // self.num_heads).transpose(1, 2)
+        spatial_mlp_windows = spatial_mlp_windows.reshape(
+            -1, self.window_size * self.window_size, C)
+
+        # merge windows
+        spatial_mlp_windows = spatial_mlp_windows.reshape(
+            -1, self.window_size, self.window_size, C)
+        shifted_x = window_reverse(spatial_mlp_windows, self.window_size, _H,
+                                   _W)  # B H' W' C
+
+        # reverse shift
+        if self.shift_size > 0:
+            P_l, P_r, P_t, P_b = self.padding
+            x = shifted_x[:, P_t:-P_b, P_l:-P_r, :].contiguous()
+        else:
+            x = shifted_x
+        x = x.view(B, H * W, C)
+
+        # FFN
+        x = shortcut + self.drop_path(x)
+        x = x + self.drop_path(self.mlp(self.norm2(x)))
+
+        return x
+
+    def extra_repr(self) -> str:
+        return (f'dim={self.dim}, input_resolution={self.input_resolution}, '
+                f'num_heads={self.num_heads}, '
+                f'window_size={self.window_size}, '
+                f'shift_size={self.shift_size}, '
+                f'mlp_ratio={self.mlp_ratio}')
+
+    def flops(self):
+        flops = 0
+        H, W = self.input_resolution
+        # norm1
+        flops += self.dim * H * W
+
+        # Window/Shifted-Window Spatial MLP
+        if self.shift_size > 0:
+            nW = (H / self.window_size + 1) * (W / self.window_size + 1)
+        else:
+            nW = H * W / self.window_size / self.window_size
+        flops += nW * self.dim * (self.window_size * self.window_size) * (
+            self.window_size * self.window_size)
+        # mlp
+        flops += 2 * H * W * self.dim * self.dim * self.mlp_ratio
+        # norm2
+        flops += self.dim * H * W
+        return flops
+
+
+class PatchMerging(nn.Module):
+    r""" Patch Merging Layer.
+
+    Args:
+        input_resolution (tuple[int]): Resolution of input feature.
+        dim (int): Number of input channels.
+        norm_layer (nn.Module, optional): Normalization layer.  Default:
+        nn.LayerNorm
+    """
+
+    def __init__(self, input_resolution, dim, norm_layer=nn.LayerNorm):
+        super().__init__()
+        self.input_resolution = input_resolution
+        self.dim = dim
+        self.reduction = nn.Linear(4 * dim, 2 * dim, bias=False)
+        self.norm = norm_layer(4 * dim)
+
+    def forward(self, x):
+        """
+        x: B, H*W, C
+        """
+        H, W = self.input_resolution
+        B, L, C = x.shape
+        assert L == H * W, 'input feature has wrong size'
+        assert H % 2 == 0 and W % 2 == 0, f'x size ({H}*{W}) are not even.'
+
+        x = x.view(B, H, W, C)
+
+        x0 = x[:, 0::2, 0::2, :]  # B H/2 W/2 C
+        x1 = x[:, 1::2, 0::2, :]  # B H/2 W/2 C
+        x2 = x[:, 0::2, 1::2, :]  # B H/2 W/2 C
+        x3 = x[:, 1::2, 1::2, :]  # B H/2 W/2 C
+        x = torch.cat([x0, x1, x2, x3], -1)  # B H/2 W/2 4*C
+        x = x.view(B, -1, 4 * C)  # B H/2*W/2 4*C
+
+        x = self.norm(x)
+        x = self.reduction(x)
+
+        return x
+
+    def extra_repr(self) -> str:
+        return f'input_resolution={self.input_resolution}, dim={self.dim}'
+
+    def flops(self):
+        H, W = self.input_resolution
+        flops = H * W * self.dim
+        flops += (H // 2) * (W // 2) * 4 * self.dim * 2 * self.dim
+        return flops
+
+
+class BasicLayer(nn.Module):
+    """A basic Swin MLP layer for one stage.
+
+    Args:
+        dim (int): Number of input channels.
+        input_resolution (tuple[int]): Input resolution.
+        depth (int): Number of blocks.
+        num_heads (int): Number of attention heads.
+        window_size (int): Local window size.
+        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.
+        drop (float, optional): Dropout rate. Default: 0.0
+        drop_path (float | tuple[float], optional): Stochastic depth rate.
+        Default: 0.0
+        norm_layer (nn.Module, optional): Normalization layer. Default:
+        nn.LayerNorm
+        downsample (nn.Module | None, optional): Downsample layer at the end
+        of the layer. Default: None
+        use_checkpoint (bool): Whether to use checkpointing to save memory.
+        Default: False.
+    """
+
+    def __init__(self,
+                 dim,
+                 input_resolution,
+                 depth,
+                 num_heads,
+                 window_size,
+                 mlp_ratio=4.,
+                 drop=0.,
+                 drop_path=0.,
+                 norm_layer=nn.LayerNorm,
+                 downsample=None,
+                 use_checkpoint=False):
+
+        super().__init__()
+        self.dim = dim
+        self.input_resolution = input_resolution
+        self.depth = depth
+        self.use_checkpoint = use_checkpoint
+
+        # build blocks
+        self.blocks = nn.ModuleList([
+            SwinMLPBlock(
+                dim=dim,
+                input_resolution=input_resolution,
+                num_heads=num_heads,
+                window_size=window_size,
+                shift_size=0 if (i % 2 == 0) else window_size // 2,
+                mlp_ratio=mlp_ratio,
+                drop=drop,
+                drop_path=drop_path[i]
+                if isinstance(drop_path, list) else drop_path,
+                norm_layer=norm_layer) for i in range(depth)
+        ])
+
+        # patch merging layer
+        if downsample is not None:
+            self.downsample = downsample(
+                input_resolution, dim=dim, norm_layer=norm_layer)
+        else:
+            self.downsample = None
+
+    def forward(self, x):
+        for blk in self.blocks:
+            if self.use_checkpoint:
+                x = checkpoint.checkpoint(blk, x)
+            else:
+                x = blk(x)
+        if self.downsample is not None:
+            x = self.downsample(x)
+        return x
+
+    def extra_repr(self) -> str:
+        return (f'dim={self.dim}, input_resolution={self.input_resolution}, '
+                f'depth={self.depth}')
+
+    def flops(self):
+        flops = 0
+        for blk in self.blocks:
+            flops += blk.flops()
+        if self.downsample is not None:
+            flops += self.downsample.flops()
+        return flops
+
+
+class PatchEmbed(nn.Module):
+    r""" Image to Patch Embedding
+
+    Args:
+        img_size (int): Image size.  Default: 224.
+        patch_size (int): Patch token size. Default: 4.
+        in_chans (int): Number of input image channels. Default: 3.
+        embed_dim (int): Number of linear projection output channels.
+        Default: 96.
+        norm_layer (nn.Module, optional): Normalization layer. Default: None
+    """
+
+    def __init__(self,
+                 img_size=224,
+                 patch_size=4,
+                 in_chans=3,
+                 embed_dim=96,
+                 norm_layer=None):
+        super().__init__()
+        img_size = to_2tuple(img_size)
+        patch_size = to_2tuple(patch_size)
+        patches_resolution = [
+            img_size[0] // patch_size[0], img_size[1] // patch_size[1]
+        ]
+        self.img_size = img_size
+        self.patch_size = patch_size
+        self.patches_resolution = patches_resolution
+        self.num_patches = patches_resolution[0] * patches_resolution[1]
+
+        self.in_chans = in_chans
+        self.embed_dim = embed_dim
+
+        self.proj = nn.Conv2d(
+            in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)
+        if norm_layer is not None:
+            self.norm = norm_layer(embed_dim)
+        else:
+            self.norm = None
+
+    def forward(self, x):
+        B, C, H, W = x.shape
+        # FIXME look at relaxing size constraints
+        assert H == self.img_size[0] and W == self.img_size[1], \
+            (f"Input image size ({H}*{W}) doesn't match model ("
+             f'{self.img_size[0]}*{self.img_size[1]}).')
+        x = self.proj(x).flatten(2).transpose(1, 2)  # B Ph*Pw C
+        if self.norm is not None:
+            x = self.norm(x)
+        return x
+
+    def flops(self):
+        Ho, Wo = self.patches_resolution
+        flops = Ho * Wo * self.embed_dim * self.in_chans * (
+            self.patch_size[0] * self.patch_size[1])
+        if self.norm is not None:
+            flops += Ho * Wo * self.embed_dim
+        return flops
+
+
+class SwinMLP(nn.Module):
+    r""" Swin MLP
+
+    Args:
+        img_size (int | tuple(int)): Input image size. Default 224
+        patch_size (int | tuple(int)): Patch size. Default: 4
+        in_chans (int): Number of input image channels. Default: 3
+        num_classes (int): Number of classes for classification head.
+        Default: 1000
+        embed_dim (int): Patch embedding dimension. Default: 96
+        depths (tuple(int)): Depth of each Swin MLP layer.
+        num_heads (tuple(int)): Number of attention heads in different layers.
+        window_size (int): Window size. Default: 7
+        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim. Default: 4
+        drop_rate (float): Dropout rate. Default: 0
+        drop_path_rate (float): Stochastic depth rate. Default: 0.1
+        norm_layer (nn.Module): Normalization layer. Default: nn.LayerNorm.
+        ape (bool): If True, add absolute position embedding to the patch
+        embedding. Default: False
+        patch_norm (bool): If True, add normalization after patch embedding.
+        Default: True
+        use_checkpoint (bool): Whether to use checkpointing to save memory.
+        Default: False
+    """
+
+    def __init__(self,
+                 img_size=224,
+                 patch_size=4,
+                 in_chans=3,
+                 num_classes=1000,
+                 embed_dim=96,
+                 depths=[2, 2, 6, 2],
+                 num_heads=[3, 6, 12, 24],
+                 window_size=7,
+                 mlp_ratio=4.,
+                 drop_rate=0.,
+                 drop_path_rate=0.1,
+                 norm_layer=nn.LayerNorm,
+                 ape=False,
+                 patch_norm=True,
+                 use_checkpoint=False,
+                 **kwargs):
+        super().__init__()
+
+        self.num_classes = num_classes
+        self.num_layers = len(depths)
+        self.embed_dim = embed_dim
+        self.ape = ape
+        self.patch_norm = patch_norm
+        self.num_features = int(embed_dim * 2**(self.num_layers - 1))
+        self.mlp_ratio = mlp_ratio
+
+        # split image into non-overlapping patches
+        self.patch_embed = PatchEmbed(
+            img_size=img_size,
+            patch_size=patch_size,
+            in_chans=in_chans,
+            embed_dim=embed_dim,
+            norm_layer=norm_layer if self.patch_norm else None)
+        num_patches = self.patch_embed.num_patches
+        patches_resolution = self.patch_embed.patches_resolution
+        self.patches_resolution = patches_resolution
+
+        # absolute position embedding
+        if self.ape:
+            self.absolute_pos_embed = nn.Parameter(
+                torch.zeros(1, num_patches, embed_dim))
+            trunc_normal_(self.absolute_pos_embed, std=.02)
+
+        self.pos_drop = nn.Dropout(p=drop_rate)
+
+        # stochastic depth
+        dpr = [
+            x.item() for x in torch.linspace(0, drop_path_rate, sum(depths))
+        ]  # stochastic depth decay rule
+
+        # build layers
+        self.layers = nn.ModuleList()
+        for i_layer in range(self.num_layers):
+            layer = BasicLayer(
+                dim=int(embed_dim * 2**i_layer),
+                input_resolution=(patches_resolution[0] // (2**i_layer),
+                                  patches_resolution[1] // (2**i_layer)),
+                depth=depths[i_layer],
+                num_heads=num_heads[i_layer],
+                window_size=window_size,
+                mlp_ratio=self.mlp_ratio,
+                drop=drop_rate,
+                drop_path=dpr[sum(depths[:i_layer]):sum(depths[:i_layer + 1])],
+                norm_layer=norm_layer,
+                downsample=PatchMerging if
+                (i_layer < self.num_layers - 1) else None,
+                use_checkpoint=use_checkpoint)
+            self.layers.append(layer)
+
+        self.norm = norm_layer(self.num_features)
+        self.avgpool = nn.AdaptiveAvgPool1d(1)
+        self.head = nn.Linear(
+            self.num_features,
+            num_classes) if num_classes > 0 else nn.Identity()
+
+        self.apply(self._init_weights)
+
+    def _init_weights(self, m):
+        if isinstance(m, (nn.Linear, nn.Conv1d)):
+            trunc_normal_(m.weight, std=.02)
+            if m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
+
+    @torch.jit.ignore
+    def no_weight_decay(self):
+        return {'absolute_pos_embed'}
+
+    @torch.jit.ignore
+    def no_weight_decay_keywords(self):
+        return {'relative_position_bias_table'}
+
+    def forward_features(self, x):
+        x = self.patch_embed(x)
+        if self.ape:
+            x = x + self.absolute_pos_embed
+        x = self.pos_drop(x)
+
+        for layer in self.layers:
+            x = layer(x)
+
+        x = self.norm(x)  # B L C
+        x = self.avgpool(x.transpose(1, 2))  # B C 1
+        x = torch.flatten(x, 1)
+        return x
+
+    def forward(self, x):
+        x = self.forward_features(x)
+        x = self.head(x)
+        return x
+
+    def flops(self):
+        flops = 0
+        flops += self.patch_embed.flops()
+        for i, layer in enumerate(self.layers):
+            flops += layer.flops()
+        flops += self.num_features * self.patches_resolution[
+            0] * self.patches_resolution[1] // (2**self.num_layers)
+        flops += self.num_features * self.num_classes
+        return flops
diff --git a/projects/pose_anything/models/backbones/swin_transformer.py b/projects/pose_anything/models/backbones/swin_transformer.py
new file mode 100644
index 0000000000..17ba4d2c1a
--- /dev/null
+++ b/projects/pose_anything/models/backbones/swin_transformer.py
@@ -0,0 +1,751 @@
+# --------------------------------------------------------
+# Swin Transformer
+# Copyright (c) 2021 Microsoft
+# Licensed under The MIT License [see LICENSE for details]
+# Written by Ze Liu
+# --------------------------------------------------------
+
+import torch
+import torch.nn as nn
+import torch.utils.checkpoint as checkpoint
+from timm.models.layers import DropPath, to_2tuple, trunc_normal_
+
+try:
+    import os
+    import sys
+
+    kernel_path = os.path.abspath(os.path.join('..'))
+    sys.path.append(kernel_path)
+    from kernels.window_process.window_process import (WindowProcess,
+                                                       WindowProcessReverse)
+
+except ImportError:
+    WindowProcess = None
+    WindowProcessReverse = None
+    print(
+        '[Warning] Fused window process have not been installed. Please refer '
+        'to get_started.md for installation.')
+
+
+class Mlp(nn.Module):
+
+    def __init__(self,
+                 in_features,
+                 hidden_features=None,
+                 out_features=None,
+                 act_layer=nn.GELU,
+                 drop=0.):
+        super().__init__()
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+        self.fc1 = nn.Linear(in_features, hidden_features)
+        self.act = act_layer()
+        self.fc2 = nn.Linear(hidden_features, out_features)
+        self.drop = nn.Dropout(drop)
+
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.drop(x)
+        x = self.fc2(x)
+        x = self.drop(x)
+        return x
+
+
+def window_partition(x, window_size):
+    """
+    Args:
+        x: (B, H, W, C)
+        window_size (int): window size
+
+    Returns:
+        windows: (num_windows*B, window_size, window_size, C)
+    """
+    B, H, W, C = x.shape
+    x = x.view(B, H // window_size, window_size, W // window_size, window_size,
+               C)
+    windows = x.permute(0, 1, 3, 2, 4,
+                        5).contiguous().view(-1, window_size, window_size, C)
+    return windows
+
+
+def window_reverse(windows, window_size, H, W):
+    """
+    Args:
+        windows: (num_windows*B, window_size, window_size, C)
+        window_size (int): Window size
+        H (int): Height of image
+        W (int): Width of image
+
+    Returns:
+        x: (B, H, W, C)
+    """
+    B = int(windows.shape[0] / (H * W / window_size / window_size))
+    x = windows.view(B, H // window_size, W // window_size, window_size,
+                     window_size, -1)
+    x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, H, W, -1)
+    return x
+
+
+class WindowAttention(nn.Module):
+    r""" Window based multi-head self attention (W-MSA) module with relative
+    position bias. It supports both of shifted and non-shifted window.
+
+    Args: dim (int): Number of input channels. window_size (tuple[int]): The
+    height and width of the window. num_heads (int): Number of attention
+    heads. qkv_bias (bool, optional):  If True, add a learnable bias to
+    query, key, value. Default: True qk_scale (float | None, optional):
+    Override default qk scale of head_dim ** -0.5 if set attn_drop (float,
+    optional): Dropout ratio of attention weight. Default: 0.0 proj_drop (
+    float, optional): Dropout ratio of output. Default: 0.0
+    """
+
+    def __init__(self,
+                 dim,
+                 window_size,
+                 num_heads,
+                 qkv_bias=True,
+                 qk_scale=None,
+                 attn_drop=0.,
+                 proj_drop=0.):
+
+        super().__init__()
+        self.dim = dim
+        self.window_size = window_size  # Wh, Ww
+        self.num_heads = num_heads
+        head_dim = dim // num_heads
+        self.scale = qk_scale or head_dim**-0.5
+
+        # define a parameter table of relative position bias
+        self.relative_position_bias_table = nn.Parameter(
+            torch.zeros((2 * window_size[0] - 1) * (2 * window_size[1] - 1),
+                        num_heads))  # 2*Wh-1 * 2*Ww-1, nH
+
+        # get pair-wise relative position index for each token inside the
+        # window
+        coords_h = torch.arange(self.window_size[0])
+        coords_w = torch.arange(self.window_size[1])
+        coords = torch.stack(torch.meshgrid([coords_h, coords_w]))  # 2, Wh, Ww
+        coords_flatten = torch.flatten(coords, 1)  # 2, Wh*Ww
+        relative_coords = (coords_flatten[:, :, None] -
+                           coords_flatten[:, None, :])  # 2, Wh*Ww, Wh*Ww
+        relative_coords = relative_coords.permute(
+            1, 2, 0).contiguous()  # Wh*Ww, Wh*Ww, 2
+        relative_coords[:, :, 0] += self.window_size[0] - 1
+        relative_coords[:, :, 1] += self.window_size[1] - 1
+        relative_coords[:, :, 0] *= 2 * self.window_size[1] - 1
+        relative_position_index = relative_coords.sum(-1)  # Wh*Ww, Wh*Ww
+        self.register_buffer('relative_position_index',
+                             relative_position_index)
+
+        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(dim, dim)
+        self.proj_drop = nn.Dropout(proj_drop)
+
+        trunc_normal_(self.relative_position_bias_table, std=.02)
+        self.softmax = nn.Softmax(dim=-1)
+
+    def forward(self, x, mask=None):
+        """
+        Args: x: input features with shape of (num_windows*B, N, C) mask: (
+        0/-inf) mask with shape of (num_windows, Wh*Ww, Wh*Ww) or None
+        """
+        B_, N, C = x.shape
+        qkv = self.qkv(x).reshape(B_, N, 3, self.num_heads,
+                                  C // self.num_heads).permute(2, 0, 3, 1, 4)
+        q, k, v = qkv[0], qkv[1], qkv[
+            2]  # make torch script happy (cannot use tensor as tuple)
+
+        q = q * self.scale
+        attn = (q @ k.transpose(-2, -1))
+
+        relative_position_bias = self.relative_position_bias_table[
+            self.relative_position_index.view(-1)].view(
+                self.window_size[0] * self.window_size[1],
+                self.window_size[0] * self.window_size[1],
+                -1)  # Wh*Ww,Wh*Ww,nH
+        relative_position_bias = relative_position_bias.permute(
+            2, 0, 1).contiguous()  # nH, Wh*Ww, Wh*Ww
+        attn = attn + relative_position_bias.unsqueeze(0)
+
+        if mask is not None:
+            nW = mask.shape[0]
+            attn = attn.view(B_ // nW, nW, self.num_heads, N,
+                             N) + mask.unsqueeze(1).unsqueeze(0)
+            attn = attn.view(-1, self.num_heads, N, N)
+            attn = self.softmax(attn)
+        else:
+            attn = self.softmax(attn)
+
+        attn = self.attn_drop(attn)
+
+        x = (attn @ v).transpose(1, 2).reshape(B_, N, C)
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+
+    def extra_repr(self) -> str:
+        return (f'dim={self.dim}, window_size={self.window_size}, num_heads='
+                f'{self.num_heads}')
+
+    def flops(self, N):
+        # calculate flops for 1 window with token length of N
+        flops = 0
+        # qkv = self.qkv(x)
+        flops += N * self.dim * 3 * self.dim
+        # attn = (q @ k.transpose(-2, -1))
+        flops += self.num_heads * N * (self.dim // self.num_heads) * N
+        #  x = (attn @ v)
+        flops += self.num_heads * N * N * (self.dim // self.num_heads)
+        # x = self.proj(x)
+        flops += N * self.dim * self.dim
+        return flops
+
+
+class SwinTransformerBlock(nn.Module):
+    r""" Swin Transformer Block.
+
+    Args: dim (int): Number of input channels. input_resolution (tuple[int]):
+    Input resolution. num_heads (int): Number of attention heads. window_size
+    (int): Window size. shift_size (int): Shift size for SW-MSA. mlp_ratio (
+    float): Ratio of mlp hidden dim to embedding dim. qkv_bias (bool,
+    optional): If True, add a learnable bias to query, key, value. Default:
+    True qk_scale (float | None, optional): Override default qk scale of
+    head_dim ** -0.5 if set. drop (float, optional): Dropout rate. Default:
+    0.0 attn_drop (float, optional): Attention dropout rate. Default: 0.0
+    drop_path (float, optional): Stochastic depth rate. Default: 0.0
+    act_layer (nn.Module, optional): Activation layer. Default: nn.GELU
+    norm_layer (nn.Module, optional): Normalization layer.  Default:
+    nn.LayerNorm fused_window_process (bool, optional): If True, use one
+    kernel to fused window shift & window partition for acceleration, similar
+    for the reversed part. Default: False
+    """
+
+    def __init__(self,
+                 dim,
+                 input_resolution,
+                 num_heads,
+                 window_size=7,
+                 shift_size=0,
+                 mlp_ratio=4.,
+                 qkv_bias=True,
+                 qk_scale=None,
+                 drop=0.,
+                 attn_drop=0.,
+                 drop_path=0.,
+                 act_layer=nn.GELU,
+                 norm_layer=nn.LayerNorm,
+                 fused_window_process=False):
+        super().__init__()
+        self.dim = dim
+        self.input_resolution = input_resolution
+        self.num_heads = num_heads
+        self.window_size = window_size
+        self.shift_size = shift_size
+        self.mlp_ratio = mlp_ratio
+        if min(self.input_resolution) <= self.window_size:
+            # if window size is larger than input resolution, we don't
+            # partition windows
+            self.shift_size = 0
+            self.window_size = min(self.input_resolution)
+        assert 0 <= self.shift_size < self.window_size, ('shift_size must in '
+                                                         '0-window_size')
+
+        self.norm1 = norm_layer(dim)
+        self.attn = WindowAttention(
+            dim,
+            window_size=to_2tuple(self.window_size),
+            num_heads=num_heads,
+            qkv_bias=qkv_bias,
+            qk_scale=qk_scale,
+            attn_drop=attn_drop,
+            proj_drop=drop)
+
+        self.drop_path = DropPath(
+            drop_path) if drop_path > 0. else nn.Identity()
+        self.norm2 = norm_layer(dim)
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = Mlp(
+            in_features=dim,
+            hidden_features=mlp_hidden_dim,
+            act_layer=act_layer,
+            drop=drop)
+
+        if self.shift_size > 0:
+            # calculate attention mask for SW-MSA
+            H, W = self.input_resolution
+            img_mask = torch.zeros((1, H, W, 1))  # 1 H W 1
+            h_slices = (slice(0, -self.window_size),
+                        slice(-self.window_size,
+                              -self.shift_size), slice(-self.shift_size, None))
+            w_slices = (slice(0, -self.window_size),
+                        slice(-self.window_size,
+                              -self.shift_size), slice(-self.shift_size, None))
+            cnt = 0
+            for h in h_slices:
+                for w in w_slices:
+                    img_mask[:, h, w, :] = cnt
+                    cnt += 1
+
+            mask_windows = window_partition(
+                img_mask, self.window_size)  # nW, window_size, window_size, 1
+            mask_windows = mask_windows.view(
+                -1, self.window_size * self.window_size)
+            attn_mask = mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
+            attn_mask = attn_mask.masked_fill(attn_mask != 0,
+                                              float(-100.0)).masked_fill(
+                                                  attn_mask == 0, float(0.0))
+        else:
+            attn_mask = None
+
+        self.register_buffer('attn_mask', attn_mask)
+        self.fused_window_process = fused_window_process
+
+    def forward(self, x):
+        H, W = self.input_resolution
+        B, L, C = x.shape
+        assert L == H * W, 'input feature has wrong size'
+
+        shortcut = x
+        x = self.norm1(x)
+        x = x.view(B, H, W, C)
+
+        # cyclic shift
+        if self.shift_size > 0:
+            if not self.fused_window_process:
+                shifted_x = torch.roll(
+                    x,
+                    shifts=(-self.shift_size, -self.shift_size),
+                    dims=(1, 2))
+                # partition windows
+                x_windows = window_partition(
+                    shifted_x,
+                    self.window_size)  # nW*B, window_size, window_size, C
+            else:
+                x_windows = WindowProcess.apply(x, B, H, W, C,
+                                                -self.shift_size,
+                                                self.window_size)
+        else:
+            shifted_x = x
+            # partition windows
+            x_windows = window_partition(
+                shifted_x,
+                self.window_size)  # nW*B, window_size, window_size, C
+
+        x_windows = x_windows.view(-1, self.window_size * self.window_size,
+                                   C)  # nW*B, window_size*window_size, C
+
+        # W-MSA/SW-MSA
+        attn_windows = self.attn(
+            x_windows, mask=self.attn_mask)  # nW*B, window_size*window_size, C
+
+        # merge windows
+        attn_windows = attn_windows.view(-1, self.window_size,
+                                         self.window_size, C)
+
+        # reverse cyclic shift
+        if self.shift_size > 0:
+            if not self.fused_window_process:
+                shifted_x = window_reverse(attn_windows, self.window_size, H,
+                                           W)  # B H' W' C
+                x = torch.roll(
+                    shifted_x,
+                    shifts=(self.shift_size, self.shift_size),
+                    dims=(1, 2))
+            else:
+                x = WindowProcessReverse.apply(attn_windows, B, H, W, C,
+                                               self.shift_size,
+                                               self.window_size)
+        else:
+            shifted_x = window_reverse(attn_windows, self.window_size, H,
+                                       W)  # B H' W' C
+            x = shifted_x
+        x = x.view(B, H * W, C)
+        x = shortcut + self.drop_path(x)
+
+        # FFN
+        x = x + self.drop_path(self.mlp(self.norm2(x)))
+
+        return x
+
+    def extra_repr(self) -> str:
+        return (f'dim={self.dim}, '
+                f'input_resolution={self.input_resolution}, '
+                f'num_heads={self.num_heads}, '
+                f'window_size={self.window_size}, '
+                f'shift_size={self.shift_size}, '
+                f'mlp_ratio={self.mlp_ratio}')
+
+    def flops(self):
+        flops = 0
+        H, W = self.input_resolution
+        # norm1
+        flops += self.dim * H * W
+        # W-MSA/SW-MSA
+        nW = H * W / self.window_size / self.window_size
+        flops += nW * self.attn.flops(self.window_size * self.window_size)
+        # mlp
+        flops += 2 * H * W * self.dim * self.dim * self.mlp_ratio
+        # norm2
+        flops += self.dim * H * W
+        return flops
+
+
+class PatchMerging(nn.Module):
+    r""" Patch Merging Layer.
+
+    Args: input_resolution (tuple[int]): Resolution of input feature. dim (
+    int): Number of input channels. norm_layer (nn.Module, optional):
+    Normalization layer.  Default: nn.LayerNorm
+    """
+
+    def __init__(self, input_resolution, dim, norm_layer=nn.LayerNorm):
+        super().__init__()
+        self.input_resolution = input_resolution
+        self.dim = dim
+        self.reduction = nn.Linear(4 * dim, 2 * dim, bias=False)
+        self.norm = norm_layer(4 * dim)
+
+    def forward(self, x):
+        """
+        x: B, H*W, C
+        """
+        H, W = self.input_resolution
+        B, L, C = x.shape
+        assert L == H * W, 'input feature has wrong size'
+        assert H % 2 == 0 and W % 2 == 0, f'x size ({H}*{W}) are not even.'
+
+        x = x.view(B, H, W, C)
+
+        x0 = x[:, 0::2, 0::2, :]  # B H/2 W/2 C
+        x1 = x[:, 1::2, 0::2, :]  # B H/2 W/2 C
+        x2 = x[:, 0::2, 1::2, :]  # B H/2 W/2 C
+        x3 = x[:, 1::2, 1::2, :]  # B H/2 W/2 C
+        x = torch.cat([x0, x1, x2, x3], -1)  # B H/2 W/2 4*C
+        x = x.view(B, -1, 4 * C)  # B H/2*W/2 4*C
+
+        x = self.norm(x)
+        x = self.reduction(x)
+
+        return x
+
+    def extra_repr(self) -> str:
+        return f'input_resolution={self.input_resolution}, dim={self.dim}'
+
+    def flops(self):
+        H, W = self.input_resolution
+        flops = H * W * self.dim
+        flops += (H // 2) * (W // 2) * 4 * self.dim * 2 * self.dim
+        return flops
+
+
+class BasicLayer(nn.Module):
+    """A basic Swin Transformer layer for one stage.
+
+    Args: dim (int): Number of input channels. input_resolution (tuple[int]):
+    Input resolution. depth (int): Number of blocks. num_heads (int): Number
+    of attention heads. window_size (int): Local window size. mlp_ratio (
+    float): Ratio of mlp hidden dim to embedding dim. qkv_bias (bool,
+    optional): If True, add a learnable bias to query, key, value. Default:
+    True qk_scale (float | None, optional): Override default qk scale of
+    head_dim ** -0.5 if set. drop (float, optional): Dropout rate. Default:
+    0.0 attn_drop (float, optional): Attention dropout rate. Default: 0.0
+    drop_path (float | tuple[float], optional): Stochastic depth rate.
+    Default: 0.0 norm_layer (nn.Module, optional): Normalization layer.
+    Default: nn.LayerNorm downsample (nn.Module | None, optional): Downsample
+    layer at the end of the layer. Default: None use_checkpoint (bool):
+    Whether to use checkpointing to save memory. Default: False.
+    fused_window_process (bool, optional): If True, use one kernel to fused
+    window shift & window partition for acceleration, similar for the
+    reversed part. Default: False
+    """
+
+    def __init__(self,
+                 dim,
+                 input_resolution,
+                 depth,
+                 num_heads,
+                 window_size,
+                 mlp_ratio=4.,
+                 qkv_bias=True,
+                 qk_scale=None,
+                 drop=0.,
+                 attn_drop=0.,
+                 drop_path=0.,
+                 norm_layer=nn.LayerNorm,
+                 downsample=None,
+                 use_checkpoint=False,
+                 fused_window_process=False):
+
+        super().__init__()
+        self.dim = dim
+        self.input_resolution = input_resolution
+        self.depth = depth
+        self.use_checkpoint = use_checkpoint
+
+        # build blocks
+        self.blocks = nn.ModuleList([
+            SwinTransformerBlock(
+                dim=dim,
+                input_resolution=input_resolution,
+                num_heads=num_heads,
+                window_size=window_size,
+                shift_size=0 if (i % 2 == 0) else window_size // 2,
+                mlp_ratio=mlp_ratio,
+                qkv_bias=qkv_bias,
+                qk_scale=qk_scale,
+                drop=drop,
+                attn_drop=attn_drop,
+                drop_path=drop_path[i]
+                if isinstance(drop_path, list) else drop_path,
+                norm_layer=norm_layer,
+                fused_window_process=fused_window_process)
+            for i in range(depth)
+        ])
+
+        # patch merging layer
+        if downsample is not None:
+            self.downsample = downsample(
+                input_resolution, dim=dim, norm_layer=norm_layer)
+        else:
+            self.downsample = None
+
+    def forward(self, x):
+        for blk in self.blocks:
+            if self.use_checkpoint:
+                x = checkpoint.checkpoint(blk, x)
+            else:
+                x = blk(x)
+        if self.downsample is not None:
+            x = self.downsample(x)
+        return x
+
+    def extra_repr(self) -> str:
+        return (f'dim={self.dim}, input_resolution={self.input_resolution}, '
+                f'depth={self.depth}')
+
+    def flops(self):
+        flops = 0
+        for blk in self.blocks:
+            flops += blk.flops()
+        if self.downsample is not None:
+            flops += self.downsample.flops()
+        return flops
+
+
+class PatchEmbed(nn.Module):
+    r""" Image to Patch Embedding
+
+    Args: img_size (int): Image size.  Default: 224. patch_size (int): Patch
+    token size. Default: 4. in_chans (int): Number of input image channels.
+    Default: 3. embed_dim (int): Number of linear projection output channels.
+    Default: 96. norm_layer (nn.Module, optional): Normalization layer.
+    Default: None
+    """
+
+    def __init__(self,
+                 img_size=224,
+                 patch_size=4,
+                 in_chans=3,
+                 embed_dim=96,
+                 norm_layer=None):
+        super().__init__()
+        img_size = to_2tuple(img_size)
+        patch_size = to_2tuple(patch_size)
+        patches_resolution = [
+            img_size[0] // patch_size[0], img_size[1] // patch_size[1]
+        ]
+        self.img_size = img_size
+        self.patch_size = patch_size
+        self.patches_resolution = patches_resolution
+        self.num_patches = patches_resolution[0] * patches_resolution[1]
+
+        self.in_chans = in_chans
+        self.embed_dim = embed_dim
+
+        self.proj = nn.Conv2d(
+            in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)
+        if norm_layer is not None:
+            self.norm = norm_layer(embed_dim)
+        else:
+            self.norm = None
+
+    def forward(self, x):
+        B, C, H, W = x.shape
+        # FIXME look at relaxing size constraints
+        assert H == self.img_size[0] and W == self.img_size[1], \
+            (f"Input image size ({H}*{W}) doesn't match model "
+             f'({self.img_size[0]}*{self.img_size[1]}).')
+        x = self.proj(x).flatten(2).transpose(1, 2)  # B Ph*Pw C
+        if self.norm is not None:
+            x = self.norm(x)
+        return x
+
+    def flops(self):
+        Ho, Wo = self.patches_resolution
+        flops = Ho * Wo * self.embed_dim * self.in_chans * (
+            self.patch_size[0] * self.patch_size[1])
+        if self.norm is not None:
+            flops += Ho * Wo * self.embed_dim
+        return flops
+
+
+class SwinTransformer(nn.Module):
+    r""" Swin Transformer A PyTorch impl of : `Swin Transformer: Hierarchical
+    Vision Transformer using Shifted Windows`  -
+    https://arxiv.org/pdf/2103.14030
+
+    Args: img_size (int | tuple(int)): Input image size. Default 224
+    patch_size (int | tuple(int)): Patch size. Default: 4 in_chans (int):
+    Number of input image channels. Default: 3 num_classes (int): Number of
+    classes for classification head. Default: 1000 embed_dim (int): Patch
+    embedding dimension. Default: 96 depths (tuple(int)): Depth of each Swin
+    Transformer layer. num_heads (tuple(int)): Number of attention heads in
+    different layers. window_size (int): Window size. Default: 7 mlp_ratio (
+    float): Ratio of mlp hidden dim to embedding dim. Default: 4 qkv_bias (
+    bool): If True, add a learnable bias to query, key, value. Default: True
+    qk_scale (float): Override default qk scale of head_dim ** -0.5 if set.
+    Default: None drop_rate (float): Dropout rate. Default: 0 attn_drop_rate
+    (float): Attention dropout rate. Default: 0 drop_path_rate (float):
+    Stochastic depth rate. Default: 0.1 norm_layer (nn.Module): Normalization
+    layer. Default: nn.LayerNorm. ape (bool): If True, add absolute position
+    embedding to the patch embedding. Default: False patch_norm (bool): If
+    True, add normalization after patch embedding. Default: True
+    use_checkpoint (bool): Whether to use checkpointing to save memory.
+    Default: False fused_window_process (bool, optional): If True, use one
+    kernel to fused window shift & window partition for acceleration, similar
+    for the reversed part. Default: False
+    """
+
+    def __init__(self,
+                 img_size=224,
+                 patch_size=4,
+                 in_chans=3,
+                 num_classes=1000,
+                 embed_dim=96,
+                 depths=[2, 2, 6, 2],
+                 num_heads=[3, 6, 12, 24],
+                 window_size=7,
+                 mlp_ratio=4.,
+                 qkv_bias=True,
+                 qk_scale=None,
+                 drop_rate=0.,
+                 attn_drop_rate=0.,
+                 drop_path_rate=0.1,
+                 norm_layer=nn.LayerNorm,
+                 ape=False,
+                 patch_norm=True,
+                 use_checkpoint=False,
+                 fused_window_process=False,
+                 **kwargs):
+        super().__init__()
+
+        self.num_classes = num_classes
+        self.num_layers = len(depths)
+        self.embed_dim = embed_dim
+        self.ape = ape
+        self.patch_norm = patch_norm
+        self.num_features = int(embed_dim * 2**(self.num_layers - 1))
+        self.mlp_ratio = mlp_ratio
+
+        # split image into non-overlapping patches
+        self.patch_embed = PatchEmbed(
+            img_size=img_size,
+            patch_size=patch_size,
+            in_chans=in_chans,
+            embed_dim=embed_dim,
+            norm_layer=norm_layer if self.patch_norm else None)
+        num_patches = self.patch_embed.num_patches
+        patches_resolution = self.patch_embed.patches_resolution
+        self.patches_resolution = patches_resolution
+
+        # absolute position embedding
+        if self.ape:
+            self.absolute_pos_embed = nn.Parameter(
+                torch.zeros(1, num_patches, embed_dim))
+            trunc_normal_(self.absolute_pos_embed, std=.02)
+
+        self.pos_drop = nn.Dropout(p=drop_rate)
+
+        # stochastic depth
+        dpr = [
+            x.item() for x in torch.linspace(0, drop_path_rate, sum(depths))
+        ]  # stochastic depth decay rule
+
+        # build layers
+        self.layers = nn.ModuleList()
+        for i_layer in range(self.num_layers):
+            layer = BasicLayer(
+                dim=int(embed_dim * 2**i_layer),
+                input_resolution=(patches_resolution[0] // (2**i_layer),
+                                  patches_resolution[1] // (2**i_layer)),
+                depth=depths[i_layer],
+                num_heads=num_heads[i_layer],
+                window_size=window_size,
+                mlp_ratio=self.mlp_ratio,
+                qkv_bias=qkv_bias,
+                qk_scale=qk_scale,
+                drop=drop_rate,
+                attn_drop=attn_drop_rate,
+                drop_path=dpr[sum(depths[:i_layer]):sum(depths[:i_layer + 1])],
+                norm_layer=norm_layer,
+                downsample=PatchMerging if
+                (i_layer < self.num_layers - 1) else None,
+                use_checkpoint=use_checkpoint,
+                fused_window_process=fused_window_process)
+            self.layers.append(layer)
+
+        self.norm = norm_layer(self.num_features)
+        self.avgpool = nn.AdaptiveAvgPool1d(1)
+        self.head = nn.Linear(
+            self.num_features,
+            num_classes) if num_classes > 0 else nn.Identity()
+
+        self.apply(self._init_weights)
+
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            trunc_normal_(m.weight, std=.02)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
+
+    @torch.jit.ignore
+    def no_weight_decay(self):
+        return {'absolute_pos_embed'}
+
+    @torch.jit.ignore
+    def no_weight_decay_keywords(self):
+        return {'relative_position_bias_table'}
+
+    def forward_features(self, x):
+        x = self.patch_embed(x)
+        if self.ape:
+            x = x + self.absolute_pos_embed
+        x = self.pos_drop(x)
+
+        for layer in self.layers:
+            x = layer(x)
+
+        x = self.norm(x)  # B L C
+        x = self.avgpool(x.transpose(1, 2))  # B C 1
+        x = torch.flatten(x, 1)
+        return x
+
+    def forward(self, x):
+        x = self.forward_features(x)
+        x = self.head(x)
+        return x
+
+    def flops(self):
+        flops = 0
+        flops += self.patch_embed.flops()
+        for i, layer in enumerate(self.layers):
+            flops += layer.flops()
+        flops += self.num_features * self.patches_resolution[
+            0] * self.patches_resolution[1] // (2**self.num_layers)
+        flops += self.num_features * self.num_classes
+        return flops
diff --git a/projects/pose_anything/models/backbones/swin_transformer_moe.py b/projects/pose_anything/models/backbones/swin_transformer_moe.py
new file mode 100644
index 0000000000..f7e07d540f
--- /dev/null
+++ b/projects/pose_anything/models/backbones/swin_transformer_moe.py
@@ -0,0 +1,1066 @@
+# --------------------------------------------------------
+# Swin Transformer MoE
+# Copyright (c) 2022 Microsoft
+# Licensed under The MIT License [see LICENSE for details]
+# Written by Ze Liu
+# --------------------------------------------------------
+
+import numpy as np
+import torch
+import torch.distributed as dist
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.utils.checkpoint as checkpoint
+from timm.models.layers import DropPath, to_2tuple, trunc_normal_
+
+try:
+    from tutel import moe as tutel_moe
+except ImportError:
+    tutel_moe = None
+    print(
+        'Tutel has not been installed. To use Swin-MoE, please install Tutel;'
+        'otherwise, just ignore this.')
+
+
+class Mlp(nn.Module):
+
+    def __init__(self,
+                 in_features,
+                 hidden_features=None,
+                 out_features=None,
+                 act_layer=nn.GELU,
+                 drop=0.,
+                 mlp_fc2_bias=True):
+        super().__init__()
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+        self.fc1 = nn.Linear(in_features, hidden_features)
+        self.act = act_layer()
+        self.fc2 = nn.Linear(hidden_features, out_features, bias=mlp_fc2_bias)
+        self.drop = nn.Dropout(drop)
+
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.drop(x)
+        x = self.fc2(x)
+        x = self.drop(x)
+        return x
+
+
+class MoEMlp(nn.Module):
+
+    def __init__(self,
+                 in_features,
+                 hidden_features,
+                 num_local_experts,
+                 top_value,
+                 capacity_factor=1.25,
+                 cosine_router=False,
+                 normalize_gate=False,
+                 use_bpr=True,
+                 is_gshard_loss=True,
+                 gate_noise=1.0,
+                 cosine_router_dim=256,
+                 cosine_router_init_t=0.5,
+                 moe_drop=0.0,
+                 init_std=0.02,
+                 mlp_fc2_bias=True):
+        super().__init__()
+
+        self.in_features = in_features
+        self.hidden_features = hidden_features
+        self.num_local_experts = num_local_experts
+        self.top_value = top_value
+        self.capacity_factor = capacity_factor
+        self.cosine_router = cosine_router
+        self.normalize_gate = normalize_gate
+        self.use_bpr = use_bpr
+        self.init_std = init_std
+        self.mlp_fc2_bias = mlp_fc2_bias
+
+        self.dist_rank = dist.get_rank()
+
+        self._dropout = nn.Dropout(p=moe_drop)
+
+        _gate_type = {
+            'type': 'cosine_top' if cosine_router else 'top',
+            'k': top_value,
+            'capacity_factor': capacity_factor,
+            'gate_noise': gate_noise,
+            'fp32_gate': True
+        }
+        if cosine_router:
+            _gate_type['proj_dim'] = cosine_router_dim
+            _gate_type['init_t'] = cosine_router_init_t
+        self._moe_layer = tutel_moe.moe_layer(
+            gate_type=_gate_type,
+            model_dim=in_features,
+            experts={
+                'type': 'ffn',
+                'count_per_node': num_local_experts,
+                'hidden_size_per_expert': hidden_features,
+                'activation_fn': lambda x: self._dropout(F.gelu(x))
+            },
+            scan_expert_func=lambda name, param: setattr(
+                param, 'skip_allreduce', True),
+            seeds=(1, self.dist_rank + 1, self.dist_rank + 1),
+            batch_prioritized_routing=use_bpr,
+            normalize_gate=normalize_gate,
+            is_gshard_loss=is_gshard_loss,
+        )
+        if not self.mlp_fc2_bias:
+            self._moe_layer.experts.batched_fc2_bias.requires_grad = False
+
+    def forward(self, x):
+        x = self._moe_layer(x)
+        return x, x.l_aux
+
+    def extra_repr(self) -> str:
+        return (f'[Statistics-{self.dist_rank}] param count for MoE, '
+                f'in_features = {self.in_features}, '
+                f'hidden_features = {self.hidden_features}, '
+                f'num_local_experts = {self.num_local_experts}, '
+                f'top_value = {self.top_value}, '
+                f'cosine_router={self.cosine_router} '
+                f'normalize_gate={self.normalize_gate}, '
+                f'use_bpr = {self.use_bpr}')
+
+    def _init_weights(self):
+        if hasattr(self._moe_layer, 'experts'):
+            trunc_normal_(
+                self._moe_layer.experts.batched_fc1_w, std=self.init_std)
+            trunc_normal_(
+                self._moe_layer.experts.batched_fc2_w, std=self.init_std)
+            nn.init.constant_(self._moe_layer.experts.batched_fc1_bias, 0)
+            nn.init.constant_(self._moe_layer.experts.batched_fc2_bias, 0)
+
+
+def window_partition(x, window_size):
+    """
+    Args:
+        x: (B, H, W, C)
+        window_size (int): window size
+
+    Returns:
+        windows: (num_windows*B, window_size, window_size, C)
+    """
+    B, H, W, C = x.shape
+    x = x.view(B, H // window_size, window_size, W // window_size, window_size,
+               C)
+    windows = x.permute(0, 1, 3, 2, 4,
+                        5).contiguous().view(-1, window_size, window_size, C)
+    return windows
+
+
+def window_reverse(windows, window_size, H, W):
+    """
+    Args:
+        windows: (num_windows*B, window_size, window_size, C)
+        window_size (int): Window size
+        H (int): Height of image
+        W (int): Width of image
+
+    Returns:
+        x: (B, H, W, C)
+    """
+    B = int(windows.shape[0] / (H * W / window_size / window_size))
+    x = windows.view(B, H // window_size, W // window_size, window_size,
+                     window_size, -1)
+    x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, H, W, -1)
+    return x
+
+
+class WindowAttention(nn.Module):
+    r""" Window based multi-head self attention (W-MSA) module with relative
+    position bias.
+    It supports both of shifted and non-shifted window.
+
+    Args:
+        dim (int): Number of input channels.
+        window_size (tuple[int]): The height and width of the window.
+        num_heads (int): Number of attention heads.
+        qkv_bias (bool, optional):  If True, add a learnable bias to query,
+        key, value. Default: True
+        qk_scale (float | None, optional): Override default qk scale of
+        head_dim ** -0.5 if set
+        attn_drop (float, optional): Dropout ratio of attention weight.
+        Default: 0.0
+        proj_drop (float, optional): Dropout ratio of output. Default: 0.0
+        pretrained_window_size (tuple[int]): The height and width of the
+        window in pretraining.
+    """
+
+    def __init__(self,
+                 dim,
+                 window_size,
+                 num_heads,
+                 qkv_bias=True,
+                 qk_scale=None,
+                 attn_drop=0.,
+                 proj_drop=0.,
+                 pretrained_window_size=[0, 0]):
+
+        super().__init__()
+        self.dim = dim
+        self.window_size = window_size  # Wh, Ww
+        self.pretrained_window_size = pretrained_window_size
+        self.num_heads = num_heads
+
+        head_dim = dim // num_heads
+        self.scale = qk_scale or head_dim**-0.5
+
+        # mlp to generate continuous relative position bias
+        self.cpb_mlp = nn.Sequential(
+            nn.Linear(2, 512, bias=True), nn.ReLU(inplace=True),
+            nn.Linear(512, num_heads, bias=False))
+
+        # get relative_coords_table
+        relative_coords_h = torch.arange(
+            -(self.window_size[0] - 1),
+            self.window_size[0],
+            dtype=torch.float32)
+        relative_coords_w = torch.arange(
+            -(self.window_size[1] - 1),
+            self.window_size[1],
+            dtype=torch.float32)
+        relative_coords_table = torch.stack(
+            torch.meshgrid([relative_coords_h, relative_coords_w])).permute(
+                1, 2, 0).contiguous().unsqueeze(0)  # 1, 2*Wh-1, 2*Ww-1, 2
+        if pretrained_window_size[0] > 0:
+            relative_coords_table[:, :, :, 0] /= (
+                pretrained_window_size[0] - 1)
+            relative_coords_table[:, :, :, 1] /= (
+                pretrained_window_size[1] - 1)
+        else:
+            relative_coords_table[:, :, :, 0] /= (self.window_size[0] - 1)
+            relative_coords_table[:, :, :, 1] /= (self.window_size[1] - 1)
+        relative_coords_table *= 8  # normalize to -8, 8
+        relative_coords_table = torch.sign(relative_coords_table) * torch.log2(
+            torch.abs(relative_coords_table) + 1.0) / np.log2(8)
+
+        self.register_buffer('relative_coords_table', relative_coords_table)
+
+        # get pair-wise relative position index for each token inside the
+        # window
+        coords_h = torch.arange(self.window_size[0])
+        coords_w = torch.arange(self.window_size[1])
+        coords = torch.stack(torch.meshgrid([coords_h, coords_w]))  # 2, Wh, Ww
+        coords_flatten = torch.flatten(coords, 1)  # 2, Wh*Ww
+        relative_coords = (coords_flatten[:, :, None] -
+                           coords_flatten[:, None, :])  # 2, Wh*Ww, Wh*Ww
+        relative_coords = relative_coords.permute(
+            1, 2, 0).contiguous()  # Wh*Ww, Wh*Ww, 2
+        relative_coords[:, :, 0] += self.window_size[0] - 1
+        relative_coords[:, :, 1] += self.window_size[1] - 1
+        relative_coords[:, :, 0] *= 2 * self.window_size[1] - 1
+        relative_position_index = relative_coords.sum(-1)  # Wh*Ww, Wh*Ww
+        self.register_buffer('relative_position_index',
+                             relative_position_index)
+
+        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(dim, dim)
+        self.proj_drop = nn.Dropout(proj_drop)
+        self.softmax = nn.Softmax(dim=-1)
+
+    def forward(self, x, mask=None):
+        """
+        Args:
+            x: input features with shape of (num_windows*B, N, C)
+            mask: (0/-inf) mask with shape of (num_windows, Wh*Ww, Wh*Ww) or
+            None
+        """
+        B_, N, C = x.shape
+        qkv = self.qkv(x).reshape(B_, N, 3, self.num_heads,
+                                  C // self.num_heads).permute(2, 0, 3, 1, 4)
+        q, k, v = qkv[0], qkv[1], qkv[
+            2]  # make torchscript happy (cannot use tensor as tuple)
+
+        q = q * self.scale
+        attn = (q @ k.transpose(-2, -1))
+
+        relative_position_bias_table = self.cpb_mlp(
+            self.relative_coords_table).view(-1, self.num_heads)
+        relative_position_bias = relative_position_bias_table[
+            self.relative_position_index.view(-1)].view(
+                self.window_size[0] * self.window_size[1],
+                self.window_size[0] * self.window_size[1],
+                -1)  # Wh*Ww,Wh*Ww,nH
+        relative_position_bias = relative_position_bias.permute(
+            2, 0, 1).contiguous()  # nH, Wh*Ww, Wh*Ww
+        attn = attn + relative_position_bias.unsqueeze(0)
+
+        if mask is not None:
+            nW = mask.shape[0]
+            attn = attn.view(B_ // nW, nW, self.num_heads, N,
+                             N) + mask.unsqueeze(1).unsqueeze(0)
+            attn = attn.view(-1, self.num_heads, N, N)
+            attn = self.softmax(attn)
+        else:
+            attn = self.softmax(attn)
+
+        attn = self.attn_drop(attn)
+
+        x = (attn @ v).transpose(1, 2).reshape(B_, N, C)
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+
+    def extra_repr(self) -> str:
+        return f'dim={self.dim}, window_size={self.window_size}, ' \
+            (f'pretrained_window_size={self.pretrained_window_size}, '
+             f'num_heads={self.num_heads}')
+
+    def flops(self, N):
+        # calculate flops for 1 window with token length of N
+        flops = 0
+        # qkv = self.qkv(x)
+        flops += N * self.dim * 3 * self.dim
+        # attn = (q @ k.transpose(-2, -1))
+        flops += self.num_heads * N * (self.dim // self.num_heads) * N
+        #  x = (attn @ v)
+        flops += self.num_heads * N * N * (self.dim // self.num_heads)
+        # x = self.proj(x)
+        flops += N * self.dim * self.dim
+        return flops
+
+
+class SwinTransformerBlock(nn.Module):
+    r""" Swin Transformer Block.
+
+    Args:
+        dim (int): Number of input channels.
+        input_resolution (tuple[int]): Input resolution.
+        num_heads (int): Number of attention heads.
+        window_size (int): Window size.
+        shift_size (int): Shift size for SW-MSA.
+        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.
+        qkv_bias (bool, optional): If True, add a learnable bias to query,
+        key, value. Default: True
+        qk_scale (float | None, optional): Override default qk scale of
+        head_dim ** -0.5 if set.
+        drop (float, optional): Dropout rate. Default: 0.0
+        attn_drop (float, optional): Attention dropout rate. Default: 0.0
+        drop_path (float, optional): Stochastic depth rate. Default: 0.0
+        act_layer (nn.Module, optional): Activation layer. Default: nn.GELU
+        norm_layer (nn.Module, optional): Normalization layer.  Default:
+        nn.LayerNorm
+        mlp_fc2_bias (bool): Whether to add bias in fc2 of Mlp. Default: True
+        init_std: Initialization std. Default: 0.02
+        pretrained_window_size (int): Window size in pretraining.
+        is_moe (bool): If True, this block is a MoE block.
+        num_local_experts (int): number of local experts in each device (
+        GPU). Default: 1
+        top_value (int): the value of k in top-k gating. Default: 1
+        capacity_factor (float): the capacity factor in MoE. Default: 1.25
+        cosine_router (bool): Whether to use cosine router. Default: False
+        normalize_gate (bool): Whether to normalize the gating score in top-k
+        gating. Default: False
+        use_bpr (bool): Whether to use batch-prioritized-routing. Default: True
+        is_gshard_loss (bool): If True, use Gshard balance loss.
+                               If False, use the load loss and importance
+                               loss in "arXiv:1701.06538". Default: False
+        gate_noise (float): the noise ratio in top-k gating. Default: 1.0
+        cosine_router_dim (int): Projection dimension in cosine router.
+        cosine_router_init_t (float): Initialization temperature in cosine
+        router.
+        moe_drop (float): Dropout rate in MoE. Default: 0.0
+    """
+
+    def __init__(self,
+                 dim,
+                 input_resolution,
+                 num_heads,
+                 window_size=7,
+                 shift_size=0,
+                 mlp_ratio=4.,
+                 qkv_bias=True,
+                 qk_scale=None,
+                 drop=0.,
+                 attn_drop=0.,
+                 drop_path=0.,
+                 act_layer=nn.GELU,
+                 norm_layer=nn.LayerNorm,
+                 mlp_fc2_bias=True,
+                 init_std=0.02,
+                 pretrained_window_size=0,
+                 is_moe=False,
+                 num_local_experts=1,
+                 top_value=1,
+                 capacity_factor=1.25,
+                 cosine_router=False,
+                 normalize_gate=False,
+                 use_bpr=True,
+                 is_gshard_loss=True,
+                 gate_noise=1.0,
+                 cosine_router_dim=256,
+                 cosine_router_init_t=0.5,
+                 moe_drop=0.0):
+        super().__init__()
+        self.dim = dim
+        self.input_resolution = input_resolution
+        self.num_heads = num_heads
+        self.window_size = window_size
+        self.shift_size = shift_size
+        self.mlp_ratio = mlp_ratio
+        self.is_moe = is_moe
+        self.capacity_factor = capacity_factor
+        self.top_value = top_value
+
+        if min(self.input_resolution) <= self.window_size:
+            # if window size is larger than input resolution, we don't
+            # partition windows
+            self.shift_size = 0
+            self.window_size = min(self.input_resolution)
+        assert 0 <= self.shift_size < self.window_size, ('shift_size must in '
+                                                         '0-window_size')
+
+        self.norm1 = norm_layer(dim)
+        self.attn = WindowAttention(
+            dim,
+            window_size=to_2tuple(self.window_size),
+            num_heads=num_heads,
+            qkv_bias=qkv_bias,
+            qk_scale=qk_scale,
+            attn_drop=attn_drop,
+            proj_drop=drop,
+            pretrained_window_size=to_2tuple(pretrained_window_size))
+
+        self.drop_path = DropPath(
+            drop_path) if drop_path > 0. else nn.Identity()
+        self.norm2 = norm_layer(dim)
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        if self.is_moe:
+            self.mlp = MoEMlp(
+                in_features=dim,
+                hidden_features=mlp_hidden_dim,
+                num_local_experts=num_local_experts,
+                top_value=top_value,
+                capacity_factor=capacity_factor,
+                cosine_router=cosine_router,
+                normalize_gate=normalize_gate,
+                use_bpr=use_bpr,
+                is_gshard_loss=is_gshard_loss,
+                gate_noise=gate_noise,
+                cosine_router_dim=cosine_router_dim,
+                cosine_router_init_t=cosine_router_init_t,
+                moe_drop=moe_drop,
+                mlp_fc2_bias=mlp_fc2_bias,
+                init_std=init_std)
+        else:
+            self.mlp = Mlp(
+                in_features=dim,
+                hidden_features=mlp_hidden_dim,
+                act_layer=act_layer,
+                drop=drop,
+                mlp_fc2_bias=mlp_fc2_bias)
+
+        if self.shift_size > 0:
+            # calculate attention mask for SW-MSA
+            H, W = self.input_resolution
+            img_mask = torch.zeros((1, H, W, 1))  # 1 H W 1
+            h_slices = (slice(0, -self.window_size),
+                        slice(-self.window_size,
+                              -self.shift_size), slice(-self.shift_size, None))
+            w_slices = (slice(0, -self.window_size),
+                        slice(-self.window_size,
+                              -self.shift_size), slice(-self.shift_size, None))
+            cnt = 0
+            for h in h_slices:
+                for w in w_slices:
+                    img_mask[:, h, w, :] = cnt
+                    cnt += 1
+
+            mask_windows = window_partition(
+                img_mask, self.window_size)  # nW, window_size, window_size, 1
+            mask_windows = mask_windows.view(
+                -1, self.window_size * self.window_size)
+            attn_mask = mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
+            attn_mask = attn_mask.masked_fill(attn_mask != 0,
+                                              float(-100.0)).masked_fill(
+                                                  attn_mask == 0, float(0.0))
+        else:
+            attn_mask = None
+
+        self.register_buffer('attn_mask', attn_mask)
+
+    def forward(self, x):
+        H, W = self.input_resolution
+        B, L, C = x.shape
+        assert L == H * W, 'input feature has wrong size'
+
+        shortcut = x
+        x = self.norm1(x)
+        x = x.view(B, H, W, C)
+
+        # cyclic shift
+        if self.shift_size > 0:
+            shifted_x = torch.roll(
+                x, shifts=(-self.shift_size, -self.shift_size), dims=(1, 2))
+        else:
+            shifted_x = x
+
+        # partition windows
+        x_windows = window_partition(
+            shifted_x, self.window_size)  # nW*B, window_size, window_size, C
+        x_windows = x_windows.view(-1, self.window_size * self.window_size,
+                                   C)  # nW*B, window_size*window_size, C
+
+        # W-MSA/SW-MSA
+        attn_windows = self.attn(
+            x_windows, mask=self.attn_mask)  # nW*B, window_size*window_size, C
+
+        # merge windows
+        attn_windows = attn_windows.view(-1, self.window_size,
+                                         self.window_size, C)
+        shifted_x = window_reverse(attn_windows, self.window_size, H,
+                                   W)  # B H' W' C
+
+        # reverse cyclic shift
+        if self.shift_size > 0:
+            x = torch.roll(
+                shifted_x,
+                shifts=(self.shift_size, self.shift_size),
+                dims=(1, 2))
+        else:
+            x = shifted_x
+        x = x.view(B, H * W, C)
+        x = shortcut + self.drop_path(x)
+
+        # FFN
+        shortcut = x
+        x = self.norm2(x)
+        if self.is_moe:
+            x, l_aux = self.mlp(x)
+            x = shortcut + self.drop_path(x)
+            return x, l_aux
+        else:
+            x = shortcut + self.drop_path(self.mlp(x))
+            return x
+
+    def extra_repr(self) -> str:
+        return (f'dim={self.dim}, '
+                f'input_resolution={self.input_resolution}, '
+                f'num_heads={self.num_heads}, '
+                f'window_size={self.window_size}, '
+                f'shift_size={self.shift_size}, '
+                f'mlp_ratio={self.mlp_ratio}')
+
+    def flops(self):
+        flops = 0
+        H, W = self.input_resolution
+        # norm1
+        flops += self.dim * H * W
+        # W-MSA/SW-MSA
+        nW = H * W / self.window_size / self.window_size
+        flops += nW * self.attn.flops(self.window_size * self.window_size)
+        # mlp
+        if self.is_moe:
+            flops += (2 * H * W * self.dim * self.dim * self.mlp_ratio *
+                      self.capacity_factor * self.top_value)
+        else:
+            flops += 2 * H * W * self.dim * self.dim * self.mlp_ratio
+        # norm2
+        flops += self.dim * H * W
+        return flops
+
+
+class PatchMerging(nn.Module):
+    r""" Patch Merging Layer.
+
+    Args:
+        input_resolution (tuple[int]): Resolution of input feature.
+        dim (int): Number of input channels.
+        norm_layer (nn.Module, optional): Normalization layer.  Default:
+        nn.LayerNorm
+    """
+
+    def __init__(self, input_resolution, dim, norm_layer=nn.LayerNorm):
+        super().__init__()
+        self.input_resolution = input_resolution
+        self.dim = dim
+        self.reduction = nn.Linear(4 * dim, 2 * dim, bias=False)
+        self.norm = norm_layer(4 * dim)
+
+    def forward(self, x):
+        """
+        x: B, H*W, C
+        """
+        H, W = self.input_resolution
+        B, L, C = x.shape
+        assert L == H * W, 'input feature has wrong size'
+        assert H % 2 == 0 and W % 2 == 0, f'x size ({H}*{W}) are not even.'
+
+        x = x.view(B, H, W, C)
+
+        x0 = x[:, 0::2, 0::2, :]  # B H/2 W/2 C
+        x1 = x[:, 1::2, 0::2, :]  # B H/2 W/2 C
+        x2 = x[:, 0::2, 1::2, :]  # B H/2 W/2 C
+        x3 = x[:, 1::2, 1::2, :]  # B H/2 W/2 C
+        x = torch.cat([x0, x1, x2, x3], -1)  # B H/2 W/2 4*C
+        x = x.view(B, -1, 4 * C)  # B H/2*W/2 4*C
+
+        x = self.norm(x)
+        x = self.reduction(x)
+
+        return x
+
+    def extra_repr(self) -> str:
+        return f'input_resolution={self.input_resolution}, dim={self.dim}'
+
+    def flops(self):
+        H, W = self.input_resolution
+        flops = H * W * self.dim
+        flops += (H // 2) * (W // 2) * 4 * self.dim * 2 * self.dim
+        return flops
+
+
+class BasicLayer(nn.Module):
+    """A basic Swin Transformer layer for one stage.
+
+    Args:
+        dim (int): Number of input channels.
+        input_resolution (tuple[int]): Input resolution.
+        depth (int): Number of blocks.
+        num_heads (int): Number of attention heads.
+        window_size (int): Local window size.
+        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.
+        qkv_bias (bool, optional): If True, add a learnable bias to query,
+        key, value. Default: True
+        qk_scale (float | None, optional): Override default qk scale of
+        head_dim ** -0.5 if set.
+        drop (float, optional): Dropout rate. Default: 0.0
+        attn_drop (float, optional): Attention dropout rate. Default: 0.0
+        drop_path (float | tuple[float], optional): Stochastic depth rate.
+        Default: 0.0
+        norm_layer (nn.Module, optional): Normalization layer. Default:
+        nn.LayerNorm
+        downsample (nn.Module | None, optional): Downsample layer at the end
+        of the layer. Default: None
+        mlp_fc2_bias (bool): Whether to add bias in fc2 of Mlp. Default: True
+        init_std: Initialization std. Default: 0.02
+        use_checkpoint (bool): Whether to use checkpointing to save memory.
+        Default: False.
+        pretrained_window_size (int): Local window size in pretraining.
+        moe_blocks (tuple(int)): The index of each MoE block.
+        num_local_experts (int): number of local experts in each device (
+        GPU). Default: 1
+        top_value (int): the value of k in top-k gating. Default: 1
+        capacity_factor (float): the capacity factor in MoE. Default: 1.25
+        cosine_router (bool): Whether to use cosine router Default: False
+        normalize_gate (bool): Whether to normalize the gating score in top-k
+        gating. Default: False
+        use_bpr (bool): Whether to use batch-prioritized-routing. Default: True
+        is_gshard_loss (bool): If True, use Gshard balance loss.
+                               If False, use the load loss and importance
+                               loss in "arXiv:1701.06538". Default: False
+        gate_noise (float): the noise ratio in top-k gating. Default: 1.0
+        cosine_router_dim (int): Projection dimension in cosine router.
+        cosine_router_init_t (float): Initialization temperature in cosine
+        router.
+        moe_drop (float): Dropout rate in MoE. Default: 0.0
+    """
+
+    def __init__(self,
+                 dim,
+                 input_resolution,
+                 depth,
+                 num_heads,
+                 window_size,
+                 mlp_ratio=4.,
+                 qkv_bias=True,
+                 qk_scale=None,
+                 drop=0.,
+                 attn_drop=0.,
+                 drop_path=0.,
+                 norm_layer=nn.LayerNorm,
+                 downsample=None,
+                 mlp_fc2_bias=True,
+                 init_std=0.02,
+                 use_checkpoint=False,
+                 pretrained_window_size=0,
+                 moe_block=[-1],
+                 num_local_experts=1,
+                 top_value=1,
+                 capacity_factor=1.25,
+                 cosine_router=False,
+                 normalize_gate=False,
+                 use_bpr=True,
+                 is_gshard_loss=True,
+                 cosine_router_dim=256,
+                 cosine_router_init_t=0.5,
+                 gate_noise=1.0,
+                 moe_drop=0.0):
+
+        super().__init__()
+        self.dim = dim
+        self.input_resolution = input_resolution
+        self.depth = depth
+        self.use_checkpoint = use_checkpoint
+
+        # build blocks
+        self.blocks = nn.ModuleList([
+            SwinTransformerBlock(
+                dim=dim,
+                input_resolution=input_resolution,
+                num_heads=num_heads,
+                window_size=window_size,
+                shift_size=0 if (i % 2 == 0) else window_size // 2,
+                mlp_ratio=mlp_ratio,
+                qkv_bias=qkv_bias,
+                qk_scale=qk_scale,
+                drop=drop,
+                attn_drop=attn_drop,
+                drop_path=drop_path[i]
+                if isinstance(drop_path, list) else drop_path,
+                norm_layer=norm_layer,
+                mlp_fc2_bias=mlp_fc2_bias,
+                init_std=init_std,
+                pretrained_window_size=pretrained_window_size,
+                is_moe=True if i in moe_block else False,
+                num_local_experts=num_local_experts,
+                top_value=top_value,
+                capacity_factor=capacity_factor,
+                cosine_router=cosine_router,
+                normalize_gate=normalize_gate,
+                use_bpr=use_bpr,
+                is_gshard_loss=is_gshard_loss,
+                gate_noise=gate_noise,
+                cosine_router_dim=cosine_router_dim,
+                cosine_router_init_t=cosine_router_init_t,
+                moe_drop=moe_drop) for i in range(depth)
+        ])
+
+        # patch merging layer
+        if downsample is not None:
+            self.downsample = downsample(
+                input_resolution, dim=dim, norm_layer=norm_layer)
+        else:
+            self.downsample = None
+
+    def forward(self, x):
+        l_aux = 0.0
+        for blk in self.blocks:
+            if self.use_checkpoint:
+                out = checkpoint.checkpoint(blk, x)
+            else:
+                out = blk(x)
+            if isinstance(out, tuple):
+                x = out[0]
+                cur_l_aux = out[1]
+                l_aux = cur_l_aux + l_aux
+            else:
+                x = out
+
+        if self.downsample is not None:
+            x = self.downsample(x)
+        return x, l_aux
+
+    def extra_repr(self) -> str:
+        return (f'dim={self.dim}, input_resolution={self.input_resolution}, '
+                f'depth={self.depth}')
+
+    def flops(self):
+        flops = 0
+        for blk in self.blocks:
+            flops += blk.flops()
+        if self.downsample is not None:
+            flops += self.downsample.flops()
+        return flops
+
+
+class PatchEmbed(nn.Module):
+    r""" Image to Patch Embedding
+
+    Args:
+        img_size (int): Image size.  Default: 224.
+        patch_size (int): Patch token size. Default: 4.
+        in_chans (int): Number of input image channels. Default: 3.
+        embed_dim (int): Number of linear projection output channels.
+        Default: 96.
+        norm_layer (nn.Module, optional): Normalization layer. Default: None
+    """
+
+    def __init__(self,
+                 img_size=224,
+                 patch_size=4,
+                 in_chans=3,
+                 embed_dim=96,
+                 norm_layer=None):
+        super().__init__()
+        img_size = to_2tuple(img_size)
+        patch_size = to_2tuple(patch_size)
+        patches_resolution = [
+            img_size[0] // patch_size[0], img_size[1] // patch_size[1]
+        ]
+        self.img_size = img_size
+        self.patch_size = patch_size
+        self.patches_resolution = patches_resolution
+        self.num_patches = patches_resolution[0] * patches_resolution[1]
+
+        self.in_chans = in_chans
+        self.embed_dim = embed_dim
+
+        self.proj = nn.Conv2d(
+            in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)
+        if norm_layer is not None:
+            self.norm = norm_layer(embed_dim)
+        else:
+            self.norm = None
+
+    def forward(self, x):
+        B, C, H, W = x.shape
+        # FIXME look at relaxing size constraints
+        assert H == self.img_size[0] and W == self.img_size[1], \
+            (f"Input image size ({H}*{W}) doesn't match model ("
+             f'{self.img_size[0]}*{self.img_size[1]}).')
+        x = self.proj(x).flatten(2).transpose(1, 2)  # B Ph*Pw C
+        if self.norm is not None:
+            x = self.norm(x)
+        return x
+
+    def flops(self):
+        Ho, Wo = self.patches_resolution
+        flops = Ho * Wo * self.embed_dim * self.in_chans * (
+            self.patch_size[0] * self.patch_size[1])
+        if self.norm is not None:
+            flops += Ho * Wo * self.embed_dim
+        return flops
+
+
+class SwinTransformerMoE(nn.Module):
+    r""" Swin Transformer
+        A PyTorch impl of : `Swin Transformer: Hierarchical Vision
+        Transformer using Shifted Windows`  -
+          https://arxiv.org/pdf/2103.14030
+
+    Args:
+        img_size (int | tuple(int)): Input image size. Default 224
+        patch_size (int | tuple(int)): Patch size. Default: 4
+        in_chans (int): Number of input image channels. Default: 3
+        num_classes (int): Number of classes for classification head.
+        Default: 1000
+        embed_dim (int): Patch embedding dimension. Default: 96
+        depths (tuple(int)): Depth of each Swin Transformer layer.
+        num_heads (tuple(int)): Number of attention heads in different layers.
+        window_size (int): Window size. Default: 7
+        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim. Default: 4
+        qkv_bias (bool): If True, add a learnable bias to query, key, value.
+        Default: True
+        qk_scale (float): Override default qk scale of head_dim ** -0.5 if
+        set. Default: None
+        drop_rate (float): Dropout rate. Default: 0
+        attn_drop_rate (float): Attention dropout rate. Default: 0
+        drop_path_rate (float): Stochastic depth rate. Default: 0.1
+        norm_layer (nn.Module): Normalization layer. Default: nn.LayerNorm.
+        ape (bool): If True, add absolute position embedding to the patch
+        embedding. Default: False
+        patch_norm (bool): If True, add normalization after patch embedding.
+        Default: True
+        mlp_fc2_bias (bool): Whether to add bias in fc2 of Mlp. Default: True
+        init_std: Initialization std. Default: 0.02
+        use_checkpoint (bool): Whether to use checkpointing to save memory.
+        Default: False
+        pretrained_window_sizes (tuple(int)): Pretrained window sizes of each
+        layer.
+        moe_blocks (tuple(tuple(int))): The index of each MoE block in each
+        layer.
+        num_local_experts (int): number of local experts in each device (
+        GPU). Default: 1
+        top_value (int): the value of k in top-k gating. Default: 1
+        capacity_factor (float): the capacity factor in MoE. Default: 1.25
+        cosine_router (bool): Whether to use cosine router Default: False
+        normalize_gate (bool): Whether to normalize the gating score in top-k
+        gating. Default: False
+        use_bpr (bool): Whether to use batch-prioritized-routing. Default: True
+        is_gshard_loss (bool): If True, use Gshard balance loss.
+                               If False, use the load loss and importance
+                               loss in "arXiv:1701.06538". Default: False
+        gate_noise (float): the noise ratio in top-k gating. Default: 1.0
+        cosine_router_dim (int): Projection dimension in cosine router.
+        cosine_router_init_t (float): Initialization temperature in cosine
+        router.
+        moe_drop (float): Dropout rate in MoE. Default: 0.0
+        aux_loss_weight (float): auxiliary loss weight. Default: 0.1
+    """
+
+    def __init__(self,
+                 img_size=224,
+                 patch_size=4,
+                 in_chans=3,
+                 num_classes=1000,
+                 embed_dim=96,
+                 depths=[2, 2, 6, 2],
+                 num_heads=[3, 6, 12, 24],
+                 window_size=7,
+                 mlp_ratio=4.,
+                 qkv_bias=True,
+                 qk_scale=None,
+                 drop_rate=0.,
+                 attn_drop_rate=0.,
+                 drop_path_rate=0.1,
+                 norm_layer=nn.LayerNorm,
+                 ape=False,
+                 patch_norm=True,
+                 mlp_fc2_bias=True,
+                 init_std=0.02,
+                 use_checkpoint=False,
+                 pretrained_window_sizes=[0, 0, 0, 0],
+                 moe_blocks=[[-1], [-1], [-1], [-1]],
+                 num_local_experts=1,
+                 top_value=1,
+                 capacity_factor=1.25,
+                 cosine_router=False,
+                 normalize_gate=False,
+                 use_bpr=True,
+                 is_gshard_loss=True,
+                 gate_noise=1.0,
+                 cosine_router_dim=256,
+                 cosine_router_init_t=0.5,
+                 moe_drop=0.0,
+                 aux_loss_weight=0.01,
+                 **kwargs):
+        super().__init__()
+        self._ddp_params_and_buffers_to_ignore = list()
+
+        self.num_classes = num_classes
+        self.num_layers = len(depths)
+        self.embed_dim = embed_dim
+        self.ape = ape
+        self.patch_norm = patch_norm
+        self.num_features = int(embed_dim * 2**(self.num_layers - 1))
+        self.mlp_ratio = mlp_ratio
+        self.init_std = init_std
+        self.aux_loss_weight = aux_loss_weight
+        self.num_local_experts = num_local_experts
+        self.global_experts = num_local_experts * dist.get_world_size() if (
+                num_local_experts > 0) \
+            else dist.get_world_size() // (-num_local_experts)
+        self.sharded_count = (
+            1.0 / num_local_experts) if num_local_experts > 0 else (
+                -num_local_experts)
+
+        # split image into non-overlapping patches
+        self.patch_embed = PatchEmbed(
+            img_size=img_size,
+            patch_size=patch_size,
+            in_chans=in_chans,
+            embed_dim=embed_dim,
+            norm_layer=norm_layer if self.patch_norm else None)
+        num_patches = self.patch_embed.num_patches
+        patches_resolution = self.patch_embed.patches_resolution
+        self.patches_resolution = patches_resolution
+
+        # absolute position embedding
+        if self.ape:
+            self.absolute_pos_embed = nn.Parameter(
+                torch.zeros(1, num_patches, embed_dim))
+            trunc_normal_(self.absolute_pos_embed, std=self.init_std)
+
+        self.pos_drop = nn.Dropout(p=drop_rate)
+
+        # stochastic depth
+        dpr = [
+            x.item() for x in torch.linspace(0, drop_path_rate, sum(depths))
+        ]  # stochastic depth decay rule
+
+        # build layers
+        self.layers = nn.ModuleList()
+        for i_layer in range(self.num_layers):
+            layer = BasicLayer(
+                dim=int(embed_dim * 2**i_layer),
+                input_resolution=(patches_resolution[0] // (2**i_layer),
+                                  patches_resolution[1] // (2**i_layer)),
+                depth=depths[i_layer],
+                num_heads=num_heads[i_layer],
+                window_size=window_size,
+                mlp_ratio=self.mlp_ratio,
+                qkv_bias=qkv_bias,
+                qk_scale=qk_scale,
+                drop=drop_rate,
+                attn_drop=attn_drop_rate,
+                drop_path=dpr[sum(depths[:i_layer]):sum(depths[:i_layer + 1])],
+                norm_layer=norm_layer,
+                downsample=PatchMerging if
+                (i_layer < self.num_layers - 1) else None,
+                mlp_fc2_bias=mlp_fc2_bias,
+                init_std=init_std,
+                use_checkpoint=use_checkpoint,
+                pretrained_window_size=pretrained_window_sizes[i_layer],
+                moe_block=moe_blocks[i_layer],
+                num_local_experts=num_local_experts,
+                top_value=top_value,
+                capacity_factor=capacity_factor,
+                cosine_router=cosine_router,
+                normalize_gate=normalize_gate,
+                use_bpr=use_bpr,
+                is_gshard_loss=is_gshard_loss,
+                gate_noise=gate_noise,
+                cosine_router_dim=cosine_router_dim,
+                cosine_router_init_t=cosine_router_init_t,
+                moe_drop=moe_drop)
+            self.layers.append(layer)
+
+        self.norm = norm_layer(self.num_features)
+        self.avgpool = nn.AdaptiveAvgPool1d(1)
+        self.head = nn.Linear(
+            self.num_features,
+            num_classes) if num_classes > 0 else nn.Identity()
+
+        self.apply(self._init_weights)
+
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            trunc_normal_(m.weight, std=self.init_std)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
+        elif isinstance(m, MoEMlp):
+            m._init_weights()
+
+    @torch.jit.ignore
+    def no_weight_decay(self):
+        return {'absolute_pos_embed'}
+
+    @torch.jit.ignore
+    def no_weight_decay_keywords(self):
+        return {
+            'cpb_mlp', 'relative_position_bias_table', 'fc1_bias', 'fc2_bias',
+            'temperature', 'cosine_projector', 'sim_matrix'
+        }
+
+    def forward_features(self, x):
+        x = self.patch_embed(x)
+        if self.ape:
+            x = x + self.absolute_pos_embed
+        x = self.pos_drop(x)
+        l_aux = 0.0
+        for layer in self.layers:
+            x, cur_l_aux = layer(x)
+            l_aux = cur_l_aux + l_aux
+
+        x = self.norm(x)  # B L C
+        x = self.avgpool(x.transpose(1, 2))  # B C 1
+        x = torch.flatten(x, 1)
+        return x, l_aux
+
+    def forward(self, x):
+        x, l_aux = self.forward_features(x)
+        x = self.head(x)
+        return x, l_aux * self.aux_loss_weight
+
+    def add_param_to_skip_allreduce(self, param_name):
+        self._ddp_params_and_buffers_to_ignore.append(param_name)
+
+    def flops(self):
+        flops = 0
+        flops += self.patch_embed.flops()
+        for i, layer in enumerate(self.layers):
+            flops += layer.flops()
+        flops += self.num_features * self.patches_resolution[
+            0] * self.patches_resolution[1] // (2**self.num_layers)
+        flops += self.num_features * self.num_classes
+        return flops
diff --git a/projects/pose_anything/models/backbones/swin_transformer_v2.py b/projects/pose_anything/models/backbones/swin_transformer_v2.py
new file mode 100644
index 0000000000..0ec6f66448
--- /dev/null
+++ b/projects/pose_anything/models/backbones/swin_transformer_v2.py
@@ -0,0 +1,826 @@
+# --------------------------------------------------------
+# Swin Transformer V2
+# Copyright (c) 2022 Microsoft
+# Licensed under The MIT License [see LICENSE for details]
+# Written by Ze Liu
+# --------------------------------------------------------
+
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.utils.checkpoint as checkpoint
+from timm.models.layers import DropPath, to_2tuple, trunc_normal_
+
+from mmpose.models.builder import BACKBONES
+
+
+class Mlp(nn.Module):
+
+    def __init__(self,
+                 in_features,
+                 hidden_features=None,
+                 out_features=None,
+                 act_layer=nn.GELU,
+                 drop=0.):
+        super().__init__()
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+        self.fc1 = nn.Linear(in_features, hidden_features)
+        self.act = act_layer()
+        self.fc2 = nn.Linear(hidden_features, out_features)
+        self.drop = nn.Dropout(drop)
+
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.drop(x)
+        x = self.fc2(x)
+        x = self.drop(x)
+        return x
+
+
+def window_partition(x, window_size):
+    """
+    Args:
+        x: (B, H, W, C)
+        window_size (int): window size
+
+    Returns:
+        windows: (num_windows*B, window_size, window_size, C)
+    """
+    B, H, W, C = x.shape
+    x = x.view(B, H // window_size, window_size, W // window_size, window_size,
+               C)
+    windows = x.permute(0, 1, 3, 2, 4,
+                        5).contiguous().view(-1, window_size, window_size, C)
+    return windows
+
+
+def window_reverse(windows, window_size, H, W):
+    """
+    Args:
+        windows: (num_windows*B, window_size, window_size, C)
+        window_size (int): Window size
+        H (int): Height of image
+        W (int): Width of image
+
+    Returns:
+        x: (B, H, W, C)
+    """
+    B = int(windows.shape[0] / (H * W / window_size / window_size))
+    x = windows.view(B, H // window_size, W // window_size, window_size,
+                     window_size, -1)
+    x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, H, W, -1)
+    return x
+
+
+class WindowAttention(nn.Module):
+    r""" Window based multi-head self attention (W-MSA) module with relative
+    position bias. It supports both of shifted and non-shifted window.
+
+    Args: dim (int): Number of input channels. window_size (tuple[int]): The
+    height and width of the window. num_heads (int): Number of attention
+    heads. qkv_bias (bool, optional):  If True, add a learnable bias to
+    query, key, value. Default: True attn_drop (float, optional): Dropout
+    ratio of attention weight. Default: 0.0 proj_drop (float, optional):
+    Dropout ratio of output. Default: 0.0 pretrained_window_size (tuple[
+    int]): The height and width of the window in pre-training.
+    """
+
+    def __init__(self,
+                 dim,
+                 window_size,
+                 num_heads,
+                 qkv_bias=True,
+                 attn_drop=0.,
+                 proj_drop=0.,
+                 pretrained_window_size=[0, 0]):
+
+        super().__init__()
+        self.dim = dim
+        self.window_size = window_size  # Wh, Ww
+        self.pretrained_window_size = pretrained_window_size
+        self.num_heads = num_heads
+
+        self.logit_scale = nn.Parameter(
+            torch.log(10 * torch.ones((num_heads, 1, 1))), requires_grad=True)
+
+        # mlp to generate continuous relative position bias
+        self.cpb_mlp = nn.Sequential(
+            nn.Linear(2, 512, bias=True), nn.ReLU(inplace=True),
+            nn.Linear(512, num_heads, bias=False))
+
+        # get relative_coords_table
+        relative_coords_h = torch.arange(
+            -(self.window_size[0] - 1),
+            self.window_size[0],
+            dtype=torch.float32)
+        relative_coords_w = torch.arange(
+            -(self.window_size[1] - 1),
+            self.window_size[1],
+            dtype=torch.float32)
+        relative_coords_table = torch.stack(
+            torch.meshgrid([relative_coords_h, relative_coords_w])).permute(
+                1, 2, 0).contiguous().unsqueeze(0)  # 1, 2*Wh-1, 2*Ww-1, 2
+        if pretrained_window_size[0] > 0:
+            relative_coords_table[:, :, :, 0] /= (
+                pretrained_window_size[0] - 1)
+            relative_coords_table[:, :, :, 1] /= (
+                pretrained_window_size[1] - 1)
+        else:
+            relative_coords_table[:, :, :, 0] /= (self.window_size[0] - 1)
+            relative_coords_table[:, :, :, 1] /= (self.window_size[1] - 1)
+        relative_coords_table *= 8  # normalize to -8, 8
+        relative_coords_table = torch.sign(relative_coords_table) * torch.log2(
+            torch.abs(relative_coords_table) + 1.0) / np.log2(8)
+
+        self.register_buffer('relative_coords_table', relative_coords_table)
+
+        # get pair-wise relative position index for each token inside the
+        # window
+        coords_h = torch.arange(self.window_size[0])
+        coords_w = torch.arange(self.window_size[1])
+        coords = torch.stack(torch.meshgrid([coords_h, coords_w]))  # 2, Wh, Ww
+        coords_flatten = torch.flatten(coords, 1)  # 2, Wh*Ww
+        relative_coords = coords_flatten[:, :,
+                                         None] - coords_flatten[:, None, :]  #
+        # 2, Wh*Ww, Wh*Ww
+        relative_coords = relative_coords.permute(
+            1, 2, 0).contiguous()  # Wh*Ww, Wh*Ww, 2
+        relative_coords[:, :,
+                        0] += self.window_size[0] - 1  # shift to start from 0
+        relative_coords[:, :, 1] += self.window_size[1] - 1
+        relative_coords[:, :, 0] *= 2 * self.window_size[1] - 1
+        relative_position_index = relative_coords.sum(-1)  # Wh*Ww, Wh*Ww
+        self.register_buffer('relative_position_index',
+                             relative_position_index)
+
+        self.qkv = nn.Linear(dim, dim * 3, bias=False)
+        if qkv_bias:
+            self.q_bias = nn.Parameter(torch.zeros(dim))
+            self.v_bias = nn.Parameter(torch.zeros(dim))
+        else:
+            self.q_bias = None
+            self.v_bias = None
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(dim, dim)
+        self.proj_drop = nn.Dropout(proj_drop)
+        self.softmax = nn.Softmax(dim=-1)
+
+    def forward(self, x, mask=None):
+        """
+        Args: x: input features with shape of (num_windows*B, N, C) mask: (
+        0/-inf) mask with shape of (num_windows, Wh*Ww, Wh*Ww) or None
+        """
+        B_, N, C = x.shape
+        qkv_bias = None
+        if self.q_bias is not None:
+            qkv_bias = torch.cat(
+                (self.q_bias,
+                 torch.zeros_like(self.v_bias,
+                                  requires_grad=False), self.v_bias))
+        qkv = F.linear(input=x, weight=self.qkv.weight, bias=qkv_bias)
+        qkv = qkv.reshape(B_, N, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)
+        q, k, v = qkv[0], qkv[1], qkv[
+            2]  # make torchscript happy (cannot use tensor as tuple)
+
+        # cosine attention
+        attn = (
+            F.normalize(q, dim=-1) @ F.normalize(k, dim=-1).transpose(-2, -1))
+        logit_scale = torch.clamp(
+            self.logit_scale,
+            max=torch.log(torch.tensor(1. / 0.01, device=x.device))).exp()
+        attn = attn * logit_scale
+
+        relative_position_bias_table = self.cpb_mlp(
+            self.relative_coords_table).view(-1, self.num_heads)
+        relative_position_bias = relative_position_bias_table[
+            self.relative_position_index.view(-1)].view(
+                self.window_size[0] * self.window_size[1],
+                self.window_size[0] * self.window_size[1],
+                -1)  # Wh*Ww,Wh*Ww,nH
+        relative_position_bias = relative_position_bias.permute(
+            2, 0, 1).contiguous()  # nH, Wh*Ww, Wh*Ww
+        relative_position_bias = 16 * torch.sigmoid(relative_position_bias)
+        attn = attn + relative_position_bias.unsqueeze(0)
+
+        if mask is not None:
+            nW = mask.shape[0]
+            attn = attn.view(B_ // nW, nW, self.num_heads, N,
+                             N) + mask.unsqueeze(1).unsqueeze(0)
+            attn = attn.view(-1, self.num_heads, N, N)
+            attn = self.softmax(attn)
+        else:
+            attn = self.softmax(attn)
+
+        attn = self.attn_drop(attn)
+
+        x = (attn @ v).transpose(1, 2).reshape(B_, N, C)
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+
+    def extra_repr(self) -> str:
+        return f'dim={self.dim}, window_size={self.window_size}, ' \
+               (f'pretrained_window_size={self.pretrained_window_size}, '
+                f'num_heads={self.num_heads}')
+
+    def flops(self, N):
+        # calculate flops for 1 window with token length of N
+        flops = 0
+        # qkv = self.qkv(x)
+        flops += N * self.dim * 3 * self.dim
+        # attn = (q @ k.transpose(-2, -1))
+        flops += self.num_heads * N * (self.dim // self.num_heads) * N
+        #  x = (attn @ v)
+        flops += self.num_heads * N * N * (self.dim // self.num_heads)
+        # x = self.proj(x)
+        flops += N * self.dim * self.dim
+        return flops
+
+
+class SwinTransformerBlock(nn.Module):
+    r""" Swin Transformer Block.
+
+    Args: dim (int): Number of input channels. input_resolution (tuple[int]):
+    Input resolution. num_heads (int): Number of attention heads. window_size
+    (int): Window size. shift_size (int): Shift size for SW-MSA. mlp_ratio (
+    float): Ratio of mlp hidden dim to embedding dim. qkv_bias (bool,
+    optional): If True, add a learnable bias to query, key, value. Default:
+    True drop (float, optional): Dropout rate. Default: 0.0 attn_drop (float,
+    optional): Attention dropout rate. Default: 0.0 drop_path (float,
+    optional): Stochastic depth rate. Default: 0.0 act_layer (nn.Module,
+    optional): Activation layer. Default: nn.GELU norm_layer (nn.Module,
+    optional): Normalization layer.  Default: nn.LayerNorm
+    pretrained_window_size (int): Window size in pre-training.
+    """
+
+    def __init__(self,
+                 dim,
+                 input_resolution,
+                 num_heads,
+                 window_size=7,
+                 shift_size=0,
+                 mlp_ratio=4.,
+                 qkv_bias=True,
+                 drop=0.,
+                 attn_drop=0.,
+                 drop_path=0.,
+                 act_layer=nn.GELU,
+                 norm_layer=nn.LayerNorm,
+                 pretrained_window_size=0):
+        super().__init__()
+        self.dim = dim
+        self.input_resolution = input_resolution
+        self.num_heads = num_heads
+        self.window_size = window_size
+        self.shift_size = shift_size
+        self.mlp_ratio = mlp_ratio
+        if min(self.input_resolution) <= self.window_size:
+            # if window size is larger than input resolution, we don't
+            # partition windows
+            self.shift_size = 0
+            self.window_size = min(self.input_resolution)
+        assert 0 <= self.shift_size < self.window_size, ('shift_size must in '
+                                                         '0-window_size')
+
+        self.norm1 = norm_layer(dim)
+        self.attn = WindowAttention(
+            dim,
+            window_size=to_2tuple(self.window_size),
+            num_heads=num_heads,
+            qkv_bias=qkv_bias,
+            attn_drop=attn_drop,
+            proj_drop=drop,
+            pretrained_window_size=to_2tuple(pretrained_window_size))
+
+        self.drop_path = DropPath(
+            drop_path) if drop_path > 0. else nn.Identity()
+        self.norm2 = norm_layer(dim)
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = Mlp(
+            in_features=dim,
+            hidden_features=mlp_hidden_dim,
+            act_layer=act_layer,
+            drop=drop)
+
+        if self.shift_size > 0:
+            # calculate attention mask for SW-MSA
+            H, W = self.input_resolution
+            img_mask = torch.zeros((1, H, W, 1))  # 1 H W 1
+            h_slices = (slice(0, -self.window_size),
+                        slice(-self.window_size,
+                              -self.shift_size), slice(-self.shift_size, None))
+            w_slices = (slice(0, -self.window_size),
+                        slice(-self.window_size,
+                              -self.shift_size), slice(-self.shift_size, None))
+            cnt = 0
+            for h in h_slices:
+                for w in w_slices:
+                    img_mask[:, h, w, :] = cnt
+                    cnt += 1
+
+            mask_windows = window_partition(
+                img_mask, self.window_size)  # nW, window_size, window_size, 1
+            mask_windows = mask_windows.view(
+                -1, self.window_size * self.window_size)
+            attn_mask = mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
+            attn_mask = attn_mask.masked_fill(attn_mask != 0,
+                                              float(-100.0)).masked_fill(
+                                                  attn_mask == 0, float(0.0))
+        else:
+            attn_mask = None
+
+        self.register_buffer('attn_mask', attn_mask)
+
+    def forward(self, x):
+        H, W = self.input_resolution
+        B, L, C = x.shape
+        assert L == H * W, 'input feature has wrong size'
+
+        shortcut = x
+        x = x.view(B, H, W, C)
+
+        # cyclic shift
+        if self.shift_size > 0:
+            shifted_x = torch.roll(
+                x, shifts=(-self.shift_size, -self.shift_size), dims=(1, 2))
+        else:
+            shifted_x = x
+
+        # partition windows
+        x_windows = window_partition(
+            shifted_x, self.window_size)  # nW*B, window_size, window_size, C
+        x_windows = x_windows.view(-1, self.window_size * self.window_size,
+                                   C)  # nW*B, window_size*window_size, C
+
+        # W-MSA/SW-MSA
+        attn_windows = self.attn(
+            x_windows, mask=self.attn_mask)  # nW*B, window_size*window_size, C
+
+        # merge windows
+        attn_windows = attn_windows.view(-1, self.window_size,
+                                         self.window_size, C)
+        shifted_x = window_reverse(attn_windows, self.window_size, H,
+                                   W)  # B H' W' C
+
+        # reverse cyclic shift
+        if self.shift_size > 0:
+            x = torch.roll(
+                shifted_x,
+                shifts=(self.shift_size, self.shift_size),
+                dims=(1, 2))
+        else:
+            x = shifted_x
+        x = x.view(B, H * W, C)
+        x = shortcut + self.drop_path(self.norm1(x))
+
+        # FFN
+        x = x + self.drop_path(self.norm2(self.mlp(x)))
+
+        return x
+
+    def extra_repr(self) -> str:
+        return (f'dim={self.dim}, input_resolution={self.input_resolution}, '
+                f'num_heads={self.num_heads}, '
+                f'window_size={self.window_size}, '
+                f'shift_size={self.shift_size}, '
+                f'mlp_ratio={self.mlp_ratio}')
+
+    def flops(self):
+        flops = 0
+        H, W = self.input_resolution
+        # norm1
+        flops += self.dim * H * W
+        # W-MSA/SW-MSA
+        nW = H * W / self.window_size / self.window_size
+        flops += nW * self.attn.flops(self.window_size * self.window_size)
+        # mlp
+        flops += 2 * H * W * self.dim * self.dim * self.mlp_ratio
+        # norm2
+        flops += self.dim * H * W
+        return flops
+
+
+class PatchMerging(nn.Module):
+    r""" Patch Merging Layer.
+
+    Args: input_resolution (tuple[int]): Resolution of input feature. dim (
+    int): Number of input channels. norm_layer (nn.Module, optional):
+    Normalization layer.  Default: nn.LayerNorm
+    """
+
+    def __init__(self, input_resolution, dim, norm_layer=nn.LayerNorm):
+        super().__init__()
+        self.input_resolution = input_resolution
+        self.dim = dim
+        self.reduction = nn.Linear(4 * dim, 2 * dim, bias=False)
+        self.norm = norm_layer(2 * dim)
+
+    def forward(self, x):
+        """
+        x: B, H*W, C
+        """
+        H, W = self.input_resolution
+        B, L, C = x.shape
+        assert L == H * W, 'input feature has wrong size'
+        assert H % 2 == 0 and W % 2 == 0, f'x size ({H}*{W}) are not even.'
+
+        x = x.view(B, H, W, C)
+
+        x0 = x[:, 0::2, 0::2, :]  # B H/2 W/2 C
+        x1 = x[:, 1::2, 0::2, :]  # B H/2 W/2 C
+        x2 = x[:, 0::2, 1::2, :]  # B H/2 W/2 C
+        x3 = x[:, 1::2, 1::2, :]  # B H/2 W/2 C
+        x = torch.cat([x0, x1, x2, x3], -1)  # B H/2 W/2 4*C
+        x = x.view(B, -1, 4 * C)  # B H/2*W/2 4*C
+
+        x = self.reduction(x)
+        x = self.norm(x)
+
+        return x
+
+    def extra_repr(self) -> str:
+        return f'input_resolution={self.input_resolution}, dim={self.dim}'
+
+    def flops(self):
+        H, W = self.input_resolution
+        flops = (H // 2) * (W // 2) * 4 * self.dim * 2 * self.dim
+        flops += H * W * self.dim // 2
+        return flops
+
+
+class BasicLayer(nn.Module):
+    """A basic Swin Transformer layer for one stage.
+
+    Args: dim (int): Number of input channels. input_resolution (tuple[int]):
+    Input resolution. depth (int): Number of blocks. num_heads (int): Number
+    of attention heads. window_size (int): Local window size. mlp_ratio (
+    float): Ratio of mlp hidden dim to embedding dim. qkv_bias (bool,
+    optional): If True, add a learnable bias to query, key, value. Default:
+    True drop (float, optional): Dropout rate. Default: 0.0 attn_drop (float,
+    optional): Attention dropout rate. Default: 0.0 drop_path (float | tuple[
+    float], optional): Stochastic depth rate. Default: 0.0 norm_layer (
+    nn.Module, optional): Normalization layer. Default: nn.LayerNorm
+    downsample (nn.Module | None, optional): Downsample layer at the end of
+    the layer. Default: None use_checkpoint (bool): Whether to use
+    checkpointing to save memory. Default: False. pretrained_window_size (
+    int): Local window size in pre-training.
+    """
+
+    def __init__(self,
+                 dim,
+                 input_resolution,
+                 depth,
+                 num_heads,
+                 window_size,
+                 mlp_ratio=4.,
+                 qkv_bias=True,
+                 drop=0.,
+                 attn_drop=0.,
+                 drop_path=0.,
+                 norm_layer=nn.LayerNorm,
+                 downsample=None,
+                 use_checkpoint=False,
+                 pretrained_window_size=0):
+
+        super().__init__()
+        self.dim = dim
+        self.input_resolution = input_resolution
+        self.depth = depth
+        self.use_checkpoint = use_checkpoint
+
+        # build blocks
+        self.blocks = nn.ModuleList([
+            SwinTransformerBlock(
+                dim=dim,
+                input_resolution=input_resolution,
+                num_heads=num_heads,
+                window_size=window_size,
+                shift_size=0 if (i % 2 == 0) else window_size // 2,
+                mlp_ratio=mlp_ratio,
+                qkv_bias=qkv_bias,
+                drop=drop,
+                attn_drop=attn_drop,
+                drop_path=drop_path[i]
+                if isinstance(drop_path, list) else drop_path,
+                norm_layer=norm_layer,
+                pretrained_window_size=pretrained_window_size)
+            for i in range(depth)
+        ])
+
+        # patch merging layer
+        if downsample is not None:
+            self.downsample = downsample(
+                input_resolution, dim=dim, norm_layer=norm_layer)
+        else:
+            self.downsample = None
+
+    def forward(self, x):
+        for blk in self.blocks:
+            if self.use_checkpoint:
+                x = checkpoint.checkpoint(blk, x)
+            else:
+                x = blk(x)
+        if self.downsample is not None:
+            x = self.downsample(x)
+        return x
+
+    def extra_repr(self) -> str:
+        return (f'dim={self.dim}, input_resolution={self.input_resolution}, '
+                f'depth={self.depth}')
+
+    def flops(self):
+        flops = 0
+        for blk in self.blocks:
+            flops += blk.flops()
+        if self.downsample is not None:
+            flops += self.downsample.flops()
+        return flops
+
+    def _init_respostnorm(self):
+        for blk in self.blocks:
+            nn.init.constant_(blk.norm1.bias, 0)
+            nn.init.constant_(blk.norm1.weight, 0)
+            nn.init.constant_(blk.norm2.bias, 0)
+            nn.init.constant_(blk.norm2.weight, 0)
+
+
+class PatchEmbed(nn.Module):
+    r""" Image to Patch Embedding
+
+    Args: img_size (int): Image size.  Default: 224. patch_size (int): Patch
+    token size. Default: 4. in_chans (int): Number of input image channels.
+    Default: 3. embed_dim (int): Number of linear projection output channels.
+    Default: 96. norm_layer (nn.Module, optional): Normalization layer.
+    Default: None
+    """
+
+    def __init__(self,
+                 img_size=224,
+                 patch_size=4,
+                 in_chans=3,
+                 embed_dim=96,
+                 norm_layer=None):
+        super().__init__()
+        img_size = to_2tuple(img_size)
+        patch_size = to_2tuple(patch_size)
+        patches_resolution = [
+            img_size[0] // patch_size[0], img_size[1] // patch_size[1]
+        ]
+        self.img_size = img_size
+        self.patch_size = patch_size
+        self.patches_resolution = patches_resolution
+        self.num_patches = patches_resolution[0] * patches_resolution[1]
+
+        self.in_chans = in_chans
+        self.embed_dim = embed_dim
+
+        self.proj = nn.Conv2d(
+            in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)
+        if norm_layer is not None:
+            self.norm = norm_layer(embed_dim)
+        else:
+            self.norm = None
+
+    def forward(self, x):
+        B, C, H, W = x.shape
+        # FIXME look at relaxing size constraints
+        assert H == self.img_size[0] and W == self.img_size[1], \
+            (f"Input image size ({H}*{W}) doesn't match model "
+             f'({self.img_size[0]}*{self.img_size[1]}).')
+        x = self.proj(x).flatten(2).transpose(1, 2)  # B Ph*Pw C
+        if self.norm is not None:
+            x = self.norm(x)
+        return x
+
+    def flops(self):
+        Ho, Wo = self.patches_resolution
+        flops = Ho * Wo * self.embed_dim * self.in_chans * (
+            self.patch_size[0] * self.patch_size[1])
+        if self.norm is not None:
+            flops += Ho * Wo * self.embed_dim
+        return flops
+
+
+@BACKBONES.register_module()
+class SwinTransformerV2(nn.Module):
+    r""" Swin Transformer A PyTorch impl of : `Swin Transformer: Hierarchical
+    Vision Transformer using Shifted Windows`  -
+    https://arxiv.org/pdf/2103.14030
+
+    Args: img_size (int | tuple(int)): Input image size. Default 224
+    patch_size (int | tuple(int)): Patch size. Default: 4 in_chans (int):
+    Number of input image channels. Default: 3 num_classes (int): Number of
+    classes for classification head. Default: 1000 embed_dim (int): Patch
+    embedding dimension. Default: 96 depths (tuple(int)): Depth of each Swin
+    Transformer layer. num_heads (tuple(int)): Number of attention heads in
+    different layers. window_size (int): Window size. Default: 7 mlp_ratio (
+    float): Ratio of mlp hidden dim to embedding dim. Default: 4 qkv_bias (
+    bool): If True, add a learnable bias to query, key, value. Default: True
+    drop_rate (float): Dropout rate. Default: 0 attn_drop_rate (float):
+    Attention dropout rate. Default: 0 drop_path_rate (float): Stochastic
+    depth rate. Default: 0.1 norm_layer (nn.Module): Normalization layer.
+    Default: nn.LayerNorm. ape (bool): If True, add absolute position
+    embedding to the patch embedding. Default: False patch_norm (bool): If
+    True, add normalization after patch embedding. Default: True
+    use_checkpoint (bool): Whether to use checkpointing to save memory.
+    Default: False pretrained_window_sizes (tuple(int)): Pretrained window
+    sizes of each layer.
+    """
+
+    def __init__(self,
+                 img_size=224,
+                 patch_size=4,
+                 in_chans=3,
+                 num_classes=1000,
+                 embed_dim=96,
+                 depths=[2, 2, 6, 2],
+                 num_heads=[3, 6, 12, 24],
+                 window_size=7,
+                 mlp_ratio=4.,
+                 qkv_bias=True,
+                 drop_rate=0.,
+                 attn_drop_rate=0.,
+                 drop_path_rate=0.1,
+                 norm_layer=nn.LayerNorm,
+                 ape=False,
+                 patch_norm=True,
+                 use_checkpoint=False,
+                 pretrained_window_sizes=[0, 0, 0, 0],
+                 multi_scale=False,
+                 upsample='deconv',
+                 **kwargs):
+        super().__init__()
+
+        self.num_classes = num_classes
+        self.num_layers = len(depths)
+        self.embed_dim = embed_dim
+        self.ape = ape
+        self.patch_norm = patch_norm
+        self.num_features = int(embed_dim * 2**(self.num_layers - 1))
+        self.mlp_ratio = mlp_ratio
+
+        # split image into non-overlapping patches
+        self.patch_embed = PatchEmbed(
+            img_size=img_size,
+            patch_size=patch_size,
+            in_chans=in_chans,
+            embed_dim=embed_dim,
+            norm_layer=norm_layer if self.patch_norm else None)
+        num_patches = self.patch_embed.num_patches
+        patches_resolution = self.patch_embed.patches_resolution
+        self.patches_resolution = patches_resolution
+
+        # absolute position embedding
+        if self.ape:
+            self.absolute_pos_embed = nn.Parameter(
+                torch.zeros(1, num_patches, embed_dim))
+            trunc_normal_(self.absolute_pos_embed, std=.02)
+
+        self.pos_drop = nn.Dropout(p=drop_rate)
+
+        # stochastic depth
+        dpr = [
+            x.item() for x in torch.linspace(0, drop_path_rate, sum(depths))
+        ]  # stochastic depth decay rule
+
+        # build layers
+        self.layers = nn.ModuleList()
+        for i_layer in range(self.num_layers):
+            layer = BasicLayer(
+                dim=int(embed_dim * 2**i_layer),
+                input_resolution=(patches_resolution[0] // (2**i_layer),
+                                  patches_resolution[1] // (2**i_layer)),
+                depth=depths[i_layer],
+                num_heads=num_heads[i_layer],
+                window_size=window_size,
+                mlp_ratio=self.mlp_ratio,
+                qkv_bias=qkv_bias,
+                drop=drop_rate,
+                attn_drop=attn_drop_rate,
+                drop_path=dpr[sum(depths[:i_layer]):sum(depths[:i_layer + 1])],
+                norm_layer=norm_layer,
+                downsample=PatchMerging if
+                (i_layer < self.num_layers - 1) else None,
+                use_checkpoint=use_checkpoint,
+                pretrained_window_size=pretrained_window_sizes[i_layer])
+            self.layers.append(layer)
+
+        self.norm = norm_layer(self.num_features)
+        self.avgpool = nn.AdaptiveAvgPool1d(1)
+        self.multi_scale = multi_scale
+        if self.multi_scale:
+            self.scales = [1, 2, 4, 4]
+            self.upsample = nn.ModuleList()
+            features = [
+                int(embed_dim * 2**i) for i in range(1, self.num_layers)
+            ] + [self.num_features]
+            self.multi_scale_fuse = nn.Conv2d(
+                sum(features), self.num_features, 1)
+            for i in range(self.num_layers):
+                self.upsample.append(nn.Upsample(scale_factor=self.scales[i]))
+        else:
+            if upsample == 'deconv':
+                self.upsample = nn.ConvTranspose2d(
+                    self.num_features, self.num_features, 2, stride=2)
+            elif upsample == 'new_deconv':
+                self.upsample = nn.Sequential(
+                    nn.Upsample(
+                        scale_factor=2, mode='bilinear', align_corners=False),
+                    nn.Conv2d(
+                        self.num_features,
+                        self.num_features,
+                        3,
+                        stride=1,
+                        padding=1), nn.BatchNorm2d(self.num_features),
+                    nn.ReLU(inplace=True))
+            elif upsample == 'new_deconv2':
+                self.upsample = nn.Sequential(
+                    nn.Upsample(scale_factor=2),
+                    nn.Conv2d(
+                        self.num_features,
+                        self.num_features,
+                        3,
+                        stride=1,
+                        padding=1), nn.BatchNorm2d(self.num_features),
+                    nn.ReLU(inplace=True))
+            elif upsample == 'bilinear':
+                self.upsample = nn.Upsample(
+                    scale_factor=2, mode='bilinear', align_corners=False)
+            else:
+                self.upsample = nn.Identity()
+        self.head = nn.Linear(
+            self.num_features,
+            num_classes) if num_classes > 0 else nn.Identity()
+
+        self.apply(self._init_weights)
+        for bly in self.layers:
+            bly._init_respostnorm()
+
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            trunc_normal_(m.weight, std=.02)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
+
+    @torch.jit.ignore
+    def no_weight_decay(self):
+        return {'absolute_pos_embed'}
+
+    @torch.jit.ignore
+    def no_weight_decay_keywords(self):
+        return {'cpb_mlp', 'logit_scale', 'relative_position_bias_table'}
+
+    def forward_features(self, x):
+        B, C, H, W = x.shape
+        x = self.patch_embed(x)
+        if self.ape:
+            x = x + self.absolute_pos_embed
+        x = self.pos_drop(x)
+
+        if self.multi_scale:
+            # x_2d = x.view(B, H // 4, W // 4, -1).permute(0, 3, 1, 2)  # B C
+            # H W features = [self.upsample[0](x_2d)]
+            features = []
+            for i, layer in enumerate(self.layers):
+                x = layer(x)
+                x_2d = x.view(B, H // (8 * self.scales[i]),
+                              W // (8 * self.scales[i]),
+                              -1).permute(0, 3, 1, 2)  # B C H W
+                features.append(self.upsample[i](x_2d))
+            x = torch.cat(features, dim=1)
+            x = self.multi_scale_fuse(x)
+            x = x.view(B, self.num_features, -1).permute(0, 2, 1)
+            x = self.norm(x)  # B L C
+            x = x.view(B, H // 8, W // 8,
+                       self.num_features).permute(0, 3, 1, 2)  # B C H W
+
+        else:
+            for layer in self.layers:
+                x = layer(x)
+            x = self.norm(x)  # B L C
+            x = x.view(B, H // 32, W // 32,
+                       self.num_features).permute(0, 3, 1, 2)  # B C H W
+            x = self.upsample(x)
+
+        return x
+
+    def forward(self, x):
+        x = self.forward_features(x)
+        x = self.head(x)
+        return x
+
+    def flops(self):
+        flops = 0
+        flops += self.patch_embed.flops()
+        for i, layer in enumerate(self.layers):
+            flops += layer.flops()
+        flops += self.num_features * self.patches_resolution[
+            0] * self.patches_resolution[1] // (2**self.num_layers)
+        flops += self.num_features * self.num_classes
+        return flops
diff --git a/projects/pose_anything/models/backbones/swin_utils.py b/projects/pose_anything/models/backbones/swin_utils.py
new file mode 100644
index 0000000000..f53c928448
--- /dev/null
+++ b/projects/pose_anything/models/backbones/swin_utils.py
@@ -0,0 +1,127 @@
+# --------------------------------------------------------
+# SimMIM
+# Copyright (c) 2021 Microsoft
+# Licensed under The MIT License [see LICENSE for details]
+# Written by Ze Liu
+# Modified by Zhenda Xie
+# --------------------------------------------------------
+
+import numpy as np
+import torch
+from scipy import interpolate
+
+
+def load_pretrained(config, model, logger):
+    checkpoint = torch.load(config, map_location='cpu')
+    ckpt_model = checkpoint['model']
+    if any([True if 'encoder.' in k else False for k in ckpt_model.keys()]):
+        ckpt_model = {
+            k.replace('encoder.', ''): v
+            for k, v in ckpt_model.items() if k.startswith('encoder.')
+        }
+        print('Detect pre-trained model, remove [encoder.] prefix.')
+    else:
+        print('Detect non-pre-trained model, pass without doing anything.')
+
+    checkpoint = remap_pretrained_keys_swin(model, ckpt_model, logger)
+    msg = model.load_state_dict(ckpt_model, strict=False)
+    print(msg)
+
+    del checkpoint
+    torch.cuda.empty_cache()
+
+
+def remap_pretrained_keys_swin(model, checkpoint_model, logger):
+    state_dict = model.state_dict()
+
+    # Geometric interpolation when pre-trained patch size mismatch with
+    # fine-tuned patch size
+    all_keys = list(checkpoint_model.keys())
+    for key in all_keys:
+        if 'relative_position_bias_table' in key:
+            relative_position_bias_table_pretrained = checkpoint_model[key]
+            relative_position_bias_table_current = state_dict[key]
+            L1, nH1 = relative_position_bias_table_pretrained.size()
+            L2, nH2 = relative_position_bias_table_current.size()
+            if nH1 != nH2:
+                print(f'Error in loading {key}, passing......')
+            else:
+                if L1 != L2:
+                    print(f'{key}: Interpolate relative_position_bias_table '
+                          f'using geo.')
+                    src_size = int(L1**0.5)
+                    dst_size = int(L2**0.5)
+
+                    def geometric_progression(a, r, n):
+                        return a * (1.0 - r**n) / (1.0 - r)
+
+                    left, right = 1.01, 1.5
+                    while right - left > 1e-6:
+                        q = (left + right) / 2.0
+                        gp = geometric_progression(1, q, src_size // 2)
+                        if gp > dst_size // 2:
+                            right = q
+                        else:
+                            left = q
+
+                    # if q > 1.090307:
+                    #     q = 1.090307
+
+                    dis = []
+                    cur = 1
+                    for i in range(src_size // 2):
+                        dis.append(cur)
+                        cur += q**(i + 1)
+
+                    r_ids = [-_ for _ in reversed(dis)]
+
+                    x = r_ids + [0] + dis
+                    y = r_ids + [0] + dis
+
+                    t = dst_size // 2.0
+                    dx = np.arange(-t, t + 0.1, 1.0)
+                    dy = np.arange(-t, t + 0.1, 1.0)
+
+                    print('Original positions = %s' % str(x))
+                    print('Target positions = %s' % str(dx))
+
+                    all_rel_pos_bias = []
+
+                    for i in range(nH1):
+                        z = relative_position_bias_table_pretrained[:, i].view(
+                            src_size, src_size).float().numpy()
+                        f_cubic = interpolate.interp2d(x, y, z, kind='cubic')
+                        all_rel_pos_bias.append(
+                            torch.Tensor(f_cubic(dx,
+                                                 dy)).contiguous().view(-1, 1).
+                            to(relative_position_bias_table_pretrained.device))
+
+                    new_rel_pos_bias = torch.cat(all_rel_pos_bias, dim=-1)
+                    checkpoint_model[key] = new_rel_pos_bias
+
+    # delete relative_position_index since we always re-init it
+    relative_position_index_keys = [
+        k for k in checkpoint_model.keys() if 'relative_position_index' in k
+    ]
+    for k in relative_position_index_keys:
+        del checkpoint_model[k]
+
+    # delete relative_coords_table since we always re-init it
+    relative_coords_table_keys = [
+        k for k in checkpoint_model.keys() if 'relative_coords_table' in k
+    ]
+    for k in relative_coords_table_keys:
+        del checkpoint_model[k]
+
+    # re-map keys due to name change
+    rpe_mlp_keys = [k for k in checkpoint_model.keys() if 'rpe_mlp' in k]
+    for k in rpe_mlp_keys:
+        checkpoint_model[k.replace('rpe_mlp',
+                                   'cpb_mlp')] = checkpoint_model.pop(k)
+
+    # delete attn_mask since we always re-init it
+    attn_mask_keys = [k for k in checkpoint_model.keys() if 'attn_mask' in k]
+    for k in attn_mask_keys:
+        del checkpoint_model[k]
+
+    return checkpoint_model
diff --git a/projects/pose_anything/models/detectors/__init__.py b/projects/pose_anything/models/detectors/__init__.py
new file mode 100644
index 0000000000..038450a700
--- /dev/null
+++ b/projects/pose_anything/models/detectors/__init__.py
@@ -0,0 +1,3 @@
+from .pam import PoseAnythingModel
+
+__all__ = ['PoseAnythingModel']
diff --git a/projects/pose_anything/models/detectors/pam.py b/projects/pose_anything/models/detectors/pam.py
new file mode 100644
index 0000000000..0ce1b24673
--- /dev/null
+++ b/projects/pose_anything/models/detectors/pam.py
@@ -0,0 +1,395 @@
+import math
+from abc import ABCMeta
+
+import cv2
+import mmcv
+import numpy as np
+import torch
+from mmcv import imshow, imwrite
+from mmengine.model import BaseModel
+
+from mmpose.models import builder
+from mmpose.registry import MODELS
+from ..backbones.swin_utils import load_pretrained
+
+
+@MODELS.register_module()
+class PoseAnythingModel(BaseModel, metaclass=ABCMeta):
+    """Few-shot keypoint detectors.
+
+    Args:
+        keypoint_head (dict): Keypoint head to process feature.
+        encoder_config (dict): Config for encoder. Default: None.
+        pretrained (str): Path to the pretrained models.
+        train_cfg (dict): Config for training. Default: None.
+        test_cfg (dict): Config for testing. Default: None.
+    """
+
+    def __init__(self,
+                 keypoint_head,
+                 encoder_config,
+                 pretrained=False,
+                 train_cfg=None,
+                 test_cfg=None):
+        super().__init__()
+        self.backbone, self.backbone_type = self.init_backbone(
+            pretrained, encoder_config)
+        self.keypoint_head = builder.build_head(keypoint_head)
+        self.keypoint_head.init_weights()
+        self.train_cfg = train_cfg
+        self.test_cfg = test_cfg
+        self.target_type = test_cfg.get('target_type',
+                                        'GaussianHeatMap')  # GaussianHeatMap
+
+    def init_backbone(self, pretrained, encoder_config):
+        if 'swin' in pretrained:
+            encoder_sample = builder.build_backbone(encoder_config)
+            if '.pth' in pretrained:
+                load_pretrained(pretrained, encoder_sample, logger=None)
+            backbone = 'swin'
+        elif 'dino' in pretrained:
+            if 'dinov2' in pretrained:
+                repo = 'facebookresearch/dinov2'
+                backbone = 'dinov2'
+            else:
+                repo = 'facebookresearch/dino:main'
+                backbone = 'dino'
+            encoder_sample = torch.hub.load(repo, pretrained)
+        elif 'resnet' in pretrained:
+            pretrained = 'torchvision://resnet50'
+            encoder_config = dict(type='ResNet', depth=50, out_indices=(3, ))
+            encoder_sample = builder.build_backbone(encoder_config)
+            encoder_sample.init_weights(pretrained)
+            backbone = 'resnet50'
+        else:
+            raise NotImplementedError(f'backbone {pretrained} not supported')
+        return encoder_sample, backbone
+
+    @property
+    def with_keypoint(self):
+        """Check if has keypoint_head."""
+        return hasattr(self, 'keypoint_head')
+
+    def init_weights(self, pretrained=None):
+        """Weight initialization for model."""
+        self.backbone.init_weights(pretrained)
+        self.encoder_query.init_weights(pretrained)
+        self.keypoint_head.init_weights()
+
+    def forward(self,
+                img_s,
+                img_q,
+                target_s=None,
+                target_weight_s=None,
+                target_q=None,
+                target_weight_q=None,
+                img_metas=None,
+                return_loss=True,
+                **kwargs):
+        """Defines the computation performed at every call."""
+
+        if return_loss:
+            return self.forward_train(img_s, target_s, target_weight_s, img_q,
+                                      target_q, target_weight_q, img_metas,
+                                      **kwargs)
+        else:
+            return self.forward_test(img_s, target_s, target_weight_s, img_q,
+                                     target_q, target_weight_q, img_metas,
+                                     **kwargs)
+
+    def forward_dummy(self, img_s, target_s, target_weight_s, img_q, target_q,
+                      target_weight_q, img_metas, **kwargs):
+        return self.predict(img_s, target_s, target_weight_s, img_q, img_metas)
+
+    def forward_train(self, img_s, target_s, target_weight_s, img_q, target_q,
+                      target_weight_q, img_metas, **kwargs):
+        """Defines the computation performed at every call when training."""
+        bs, _, h, w = img_q.shape
+
+        output, initial_proposals, similarity_map, mask_s = self.predict(
+            img_s, target_s, target_weight_s, img_q, img_metas)
+
+        # parse the img meta to get the target keypoints
+        target_keypoints = self.parse_keypoints_from_img_meta(
+            img_metas, output.device, keyword='query')
+        target_sizes = torch.tensor([img_q.shape[-2],
+                                     img_q.shape[-1]]).unsqueeze(0).repeat(
+                                         img_q.shape[0], 1, 1)
+
+        # if return loss
+        losses = dict()
+        if self.with_keypoint:
+            keypoint_losses = self.keypoint_head.get_loss(
+                output, initial_proposals, similarity_map, target_keypoints,
+                target_q, target_weight_q * mask_s, target_sizes)
+            losses.update(keypoint_losses)
+            keypoint_accuracy = self.keypoint_head.get_accuracy(
+                output[-1],
+                target_keypoints,
+                target_weight_q * mask_s,
+                target_sizes,
+                height=h)
+            losses.update(keypoint_accuracy)
+
+        return losses
+
+    def forward_test(self,
+                     img_s,
+                     target_s,
+                     target_weight_s,
+                     img_q,
+                     target_q,
+                     target_weight_q,
+                     img_metas=None,
+                     **kwargs):
+        """Defines the computation performed at every call when testing."""
+        batch_size, _, img_height, img_width = img_q.shape
+
+        output, initial_proposals, similarity_map, _ = self.predict(
+            img_s, target_s, target_weight_s, img_q, img_metas)
+        predicted_pose = output[-1].detach().cpu().numpy(
+        )  # [bs, num_query, 2]
+
+        result = {}
+        if self.with_keypoint:
+            keypoint_result = self.keypoint_head.decode(
+                img_metas, predicted_pose, img_size=[img_width, img_height])
+            result.update(keypoint_result)
+
+        result.update({
+            'points':
+            torch.cat((initial_proposals, output.squeeze(1)),
+                      dim=0).cpu().numpy()
+        })
+        result.update({'sample_image_file': img_metas[0]['sample_image_file']})
+
+        return result
+
+    def predict(self, img_s, target_s, target_weight_s, img_q, img_metas=None):
+
+        batch_size, _, img_height, img_width = img_q.shape
+        assert [
+            i['sample_skeleton'][0] != i['query_skeleton'] for i in img_metas
+        ]
+        skeleton = [i['sample_skeleton'][0] for i in img_metas]
+
+        feature_q, feature_s = self.extract_features(img_s, img_q)
+
+        mask_s = target_weight_s[0]
+        for target_weight in target_weight_s:
+            mask_s = mask_s * target_weight
+
+        output, initial_proposals, similarity_map = self.keypoint_head(
+            feature_q, feature_s, target_s, mask_s, skeleton)
+
+        return output, initial_proposals, similarity_map, mask_s
+
+    def extract_features(self, img_s, img_q):
+        if self.backbone_type == 'swin':
+            feature_q = self.backbone.forward_features(img_q)  # [bs, C, h, w]
+            feature_s = [self.backbone.forward_features(img) for img in img_s]
+        elif self.backbone_type == 'dino':
+            batch_size, _, img_height, img_width = img_q.shape
+            feature_q = self.backbone.get_intermediate_layers(img_q, n=1)
+            ([0][:, 1:].reshape(batch_size, img_height // 8, img_width // 8,
+                                -1).permute(0, 3, 1, 2))  # [bs, 3, h, w]
+            feature_s = [
+                self.backbone.get_intermediate_layers(
+                    img, n=1)[0][:, 1:].reshape(batch_size, img_height // 8,
+                                                img_width // 8,
+                                                -1).permute(0, 3, 1, 2)
+                for img in img_s
+            ]
+        elif self.backbone_type == 'dinov2':
+            batch_size, _, img_height, img_width = img_q.shape
+            feature_q = self.backbone.get_intermediate_layers(
+                img_q, n=1, reshape=True)[0]  # [bs, c, h, w]
+            feature_s = [
+                self.backbone.get_intermediate_layers(img, n=1,
+                                                      reshape=True)[0]
+                for img in img_s
+            ]
+        else:
+            feature_s = [self.backbone(img) for img in img_s]
+            feature_q = self.encoder_query(img_q)
+
+        return feature_q, feature_s
+
+    def parse_keypoints_from_img_meta(self, img_meta, device, keyword='query'):
+        """Parse keypoints from the img_meta.
+
+        Args:
+            img_meta (dict): Image meta info.
+            device (torch.device): Device of the output keypoints.
+            keyword (str): 'query' or 'sample'. Default: 'query'.
+
+        Returns:
+            Tensor: Keypoints coordinates of query images.
+        """
+
+        if keyword == 'query':
+            query_kpt = torch.stack([
+                torch.tensor(info[f'{keyword}_joints_3d']).to(device)
+                for info in img_meta
+            ],
+                                    dim=0)[:, :, :2]  # [bs, num_query, 2]
+        else:
+            query_kpt = []
+            for info in img_meta:
+                if isinstance(info[f'{keyword}_joints_3d'][0], torch.Tensor):
+                    samples = torch.stack(info[f'{keyword}_joints_3d'])
+                else:
+                    samples = np.array(info[f'{keyword}_joints_3d'])
+                query_kpt.append(torch.tensor(samples).to(device)[:, :, :2])
+            query_kpt = torch.stack(
+                query_kpt, dim=0)  # [bs, , num_samples, num_query, 2]
+        return query_kpt
+
+    # UNMODIFIED
+    def show_result(self,
+                    img,
+                    result,
+                    skeleton=None,
+                    kpt_score_thr=0.3,
+                    bbox_color='green',
+                    pose_kpt_color=None,
+                    pose_limb_color=None,
+                    radius=4,
+                    text_color=(255, 0, 0),
+                    thickness=1,
+                    font_scale=0.5,
+                    win_name='',
+                    show=False,
+                    wait_time=0,
+                    out_file=None):
+        """Draw `result` over `img`.
+
+        Args:
+            img (str or Tensor): The image to be displayed.
+            result (list[dict]): The results to draw over `img`
+                (bbox_result, pose_result).
+            kpt_score_thr (float, optional): Minimum score of keypoints
+                to be shown. Default: 0.3.
+            bbox_color (str or tuple or :obj:`Color`): Color of bbox lines.
+            pose_kpt_color (np.array[Nx3]`): Color of N keypoints.
+                If None, do not draw keypoints.
+            pose_limb_color (np.array[Mx3]): Color of M limbs.
+                If None, do not draw limbs.
+            text_color (str or tuple or :obj:`Color`): Color of texts.
+            thickness (int): Thickness of lines.
+            font_scale (float): Font scales of texts.
+            win_name (str): The window name.
+            wait_time (int): Value of waitKey param.
+                Default: 0.
+            out_file (str or None): The filename to write the image.
+                Default: None.
+
+        Returns:
+            Tensor: Visualized img, only if not `show` or `out_file`.
+        """
+
+        img = mmcv.imread(img)
+        img = img.copy()
+        img_h, img_w, _ = img.shape
+
+        bbox_result = []
+        pose_result = []
+        for res in result:
+            bbox_result.append(res['bbox'])
+            pose_result.append(res['keypoints'])
+
+        if len(bbox_result) > 0:
+            bboxes = np.vstack(bbox_result)
+            # draw bounding boxes
+            mmcv.imshow_bboxes(
+                img,
+                bboxes,
+                colors=bbox_color,
+                top_k=-1,
+                thickness=thickness,
+                show=False,
+                win_name=win_name,
+                wait_time=wait_time,
+                out_file=None)
+
+            for person_id, kpts in enumerate(pose_result):
+                # draw each point on image
+                if pose_kpt_color is not None:
+                    assert len(pose_kpt_color) == len(kpts), (
+                        len(pose_kpt_color), len(kpts))
+                    for kid, kpt in enumerate(kpts):
+                        x_coord, y_coord, kpt_score = int(kpt[0]), int(
+                            kpt[1]), kpt[2]
+                        if kpt_score > kpt_score_thr:
+                            img_copy = img.copy()
+                            r, g, b = pose_kpt_color[kid]
+                            cv2.circle(img_copy, (int(x_coord), int(y_coord)),
+                                       radius, (int(r), int(g), int(b)), -1)
+                            transparency = max(0, min(1, kpt_score))
+                            cv2.addWeighted(
+                                img_copy,
+                                transparency,
+                                img,
+                                1 - transparency,
+                                0,
+                                dst=img)
+
+                # draw limbs
+                if skeleton is not None and pose_limb_color is not None:
+                    assert len(pose_limb_color) == len(skeleton)
+                    for sk_id, sk in enumerate(skeleton):
+                        pos1 = (int(kpts[sk[0] - 1, 0]), int(kpts[sk[0] - 1,
+                                                                  1]))
+                        pos2 = (int(kpts[sk[1] - 1, 0]), int(kpts[sk[1] - 1,
+                                                                  1]))
+                        if (0 < pos1[0] < img_w and 0 < pos1[1] < img_h
+                                and 0 < pos2[0] < img_w and 0 < pos2[1] < img_h
+                                and kpts[sk[0] - 1, 2] > kpt_score_thr
+                                and kpts[sk[1] - 1, 2] > kpt_score_thr):
+                            img_copy = img.copy()
+                            X = (pos1[0], pos2[0])
+                            Y = (pos1[1], pos2[1])
+                            mX = np.mean(X)
+                            mY = np.mean(Y)
+                            length = ((Y[0] - Y[1])**2 + (X[0] - X[1])**2)**0.5
+                            angle = math.degrees(
+                                math.atan2(Y[0] - Y[1], X[0] - X[1]))
+                            stickwidth = 2
+                            polygon = cv2.ellipse2Poly(
+                                (int(mX), int(mY)),
+                                (int(length / 2), int(stickwidth)), int(angle),
+                                0, 360, 1)
+
+                            r, g, b = pose_limb_color[sk_id]
+                            cv2.fillConvexPoly(img_copy, polygon,
+                                               (int(r), int(g), int(b)))
+                            transparency = max(
+                                0,
+                                min(
+                                    1, 0.5 *
+                                    (kpts[sk[0] - 1, 2] + kpts[sk[1] - 1, 2])))
+                            cv2.addWeighted(
+                                img_copy,
+                                transparency,
+                                img,
+                                1 - transparency,
+                                0,
+                                dst=img)
+
+        show, wait_time = 1, 1
+        if show:
+            height, width = img.shape[:2]
+            max_ = max(height, width)
+
+            factor = min(1, 800 / max_)
+            enlarge = cv2.resize(
+                img, (0, 0),
+                fx=factor,
+                fy=factor,
+                interpolation=cv2.INTER_CUBIC)
+            imshow(enlarge, win_name, wait_time)
+
+        if out_file is not None:
+            imwrite(img, out_file)
+
+        return img
diff --git a/projects/pose_anything/models/keypoint_heads/__init__.py b/projects/pose_anything/models/keypoint_heads/__init__.py
new file mode 100644
index 0000000000..b42289e92a
--- /dev/null
+++ b/projects/pose_anything/models/keypoint_heads/__init__.py
@@ -0,0 +1,3 @@
+from .head import PoseHead
+
+__all__ = ['PoseHead']
diff --git a/projects/pose_anything/models/keypoint_heads/head.py b/projects/pose_anything/models/keypoint_heads/head.py
new file mode 100644
index 0000000000..f565a63580
--- /dev/null
+++ b/projects/pose_anything/models/keypoint_heads/head.py
@@ -0,0 +1,438 @@
+from copy import deepcopy
+
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from mmcv.cnn import Conv2d, Linear
+from mmengine.model import xavier_init
+from models.utils import build_positional_encoding, build_transformer
+
+# from mmcv.cnn.bricks.transformer import build_positional_encoding
+from mmpose.evaluation import keypoint_pck_accuracy
+from mmpose.models import HEADS
+from mmpose.models.utils.ops import resize
+
+
+def transform_preds(coords, center, scale, output_size, use_udp=False):
+    """Get final keypoint predictions from heatmaps and apply scaling and
+    translation to map them back to the image.
+
+    Note:
+        num_keypoints: K
+
+    Args:
+        coords (np.ndarray[K, ndims]):
+
+            * If ndims=2, corrds are predicted keypoint location.
+            * If ndims=4, corrds are composed of (x, y, scores, tags)
+            * If ndims=5, corrds are composed of (x, y, scores, tags,
+              flipped_tags)
+
+        center (np.ndarray[2, ]): Center of the bounding box (x, y).
+        scale (np.ndarray[2, ]): Scale of the bounding box
+            wrt [width, height].
+        output_size (np.ndarray[2, ] | list(2,)): Size of the
+            destination heatmaps.
+        use_udp (bool): Use unbiased data processing
+
+    Returns:
+        np.ndarray: Predicted coordinates in the images.
+    """
+    assert coords.shape[1] in (2, 4, 5)
+    assert len(center) == 2
+    assert len(scale) == 2
+    assert len(output_size) == 2
+
+    # Recover the scale which is normalized by a factor of 200.
+    scale = scale * 200.0
+
+    if use_udp:
+        scale_x = scale[0] / (output_size[0] - 1.0)
+        scale_y = scale[1] / (output_size[1] - 1.0)
+    else:
+        scale_x = scale[0] / output_size[0]
+        scale_y = scale[1] / output_size[1]
+
+    target_coords = coords.copy()
+    target_coords[:, 0] = coords[:, 0] * scale_x + center[0] - scale[0] * 0.5
+    target_coords[:, 1] = coords[:, 1] * scale_y + center[1] - scale[1] * 0.5
+
+    return target_coords
+
+
+def inverse_sigmoid(x, eps=1e-3):
+    x = x.clamp(min=0, max=1)
+    x1 = x.clamp(min=eps)
+    x2 = (1 - x).clamp(min=eps)
+    return torch.log(x1 / x2)
+
+
+class TokenDecodeMLP(nn.Module):
+    """The MLP used to predict coordinates from the support keypoints
+    tokens."""
+
+    def __init__(self,
+                 in_channels,
+                 hidden_channels,
+                 out_channels=2,
+                 num_layers=3):
+        super(TokenDecodeMLP, self).__init__()
+        layers = []
+        for i in range(num_layers):
+            if i == 0:
+                layers.append(nn.Linear(in_channels, hidden_channels))
+                layers.append(nn.GELU())
+            else:
+                layers.append(nn.Linear(hidden_channels, hidden_channels))
+                layers.append(nn.GELU())
+        layers.append(nn.Linear(hidden_channels, out_channels))
+        self.mlp = nn.Sequential(*layers)
+
+    def forward(self, x):
+        return self.mlp(x)
+
+
+@HEADS.register_module()
+class PoseHead(nn.Module):
+    """In two stage regression A3, the proposal generator are moved into
+    transformer.
+
+    All valid proposals will be added with an positional embedding to better
+    regress the location
+    """
+
+    def __init__(self,
+                 in_channels,
+                 transformer=None,
+                 positional_encoding=dict(
+                     type='SinePositionalEncoding',
+                     num_feats=128,
+                     normalize=True),
+                 encoder_positional_encoding=dict(
+                     type='SinePositionalEncoding',
+                     num_feats=512,
+                     normalize=True),
+                 share_kpt_branch=False,
+                 num_decoder_layer=3,
+                 with_heatmap_loss=False,
+                 with_bb_loss=False,
+                 bb_temperature=0.2,
+                 heatmap_loss_weight=2.0,
+                 support_order_dropout=-1,
+                 extra=None,
+                 train_cfg=None,
+                 test_cfg=None):
+        super().__init__()
+
+        self.in_channels = in_channels
+        self.positional_encoding = build_positional_encoding(
+            positional_encoding)
+        self.encoder_positional_encoding = build_positional_encoding(
+            encoder_positional_encoding)
+        self.transformer = build_transformer(transformer)
+        self.embed_dims = self.transformer.d_model
+        self.with_heatmap_loss = with_heatmap_loss
+        self.with_bb_loss = with_bb_loss
+        self.bb_temperature = bb_temperature
+        self.heatmap_loss_weight = heatmap_loss_weight
+        self.support_order_dropout = support_order_dropout
+
+        assert 'num_feats' in positional_encoding
+        num_feats = positional_encoding['num_feats']
+        assert num_feats * 2 == self.embed_dims, 'embed_dims should' \
+            (f' be exactly 2 times of '
+             f'num_feats. Found'
+             f' {self.embed_dims} '
+             f'and {num_feats}.')
+        if extra is not None and not isinstance(extra, dict):
+            raise TypeError('extra should be dict or None.')
+        """Initialize layers of the transformer head."""
+        self.input_proj = Conv2d(
+            self.in_channels, self.embed_dims, kernel_size=1)
+        self.query_proj = Linear(self.in_channels, self.embed_dims)
+        # Instantiate the proposal generator and subsequent keypoint branch.
+        kpt_branch = TokenDecodeMLP(
+            in_channels=self.embed_dims, hidden_channels=self.embed_dims)
+        if share_kpt_branch:
+            self.kpt_branch = nn.ModuleList(
+                [kpt_branch for i in range(num_decoder_layer)])
+        else:
+            self.kpt_branch = nn.ModuleList(
+                [deepcopy(kpt_branch) for i in range(num_decoder_layer)])
+
+        self.train_cfg = {} if train_cfg is None else train_cfg
+        self.test_cfg = {} if test_cfg is None else test_cfg
+        self.target_type = self.test_cfg.get('target_type', 'GaussianHeatMap')
+
+    def init_weights(self):
+        for m in self.modules():
+            if hasattr(m, 'weight') and m.weight.dim() > 1:
+                xavier_init(m, distribution='uniform')
+        """Initialize weights of the transformer head."""
+        # The initialization for transformer is important
+        self.transformer.init_weights()
+        # initialization for input_proj & prediction head
+        for mlp in self.kpt_branch:
+            nn.init.constant_(mlp.mlp[-1].weight.data, 0)
+            nn.init.constant_(mlp.mlp[-1].bias.data, 0)
+        nn.init.xavier_uniform_(self.input_proj.weight, gain=1)
+        nn.init.constant_(self.input_proj.bias, 0)
+
+        nn.init.xavier_uniform_(self.query_proj.weight, gain=1)
+        nn.init.constant_(self.query_proj.bias, 0)
+
+    def forward(self, x, feature_s, target_s, mask_s, skeleton):
+        """"Forward function for a single feature level.
+
+        Args:
+            x (Tensor): Input feature from backbone's single stage, shape
+                [bs, c, h, w].
+
+        Returns:
+            all_cls_scores (Tensor): Outputs from the classification head,
+                shape [nb_dec, bs, num_query, cls_out_channels]. Note
+                cls_out_channels should includes background.
+            all_bbox_preds (Tensor): Sigmoid outputs from the regression
+                head with normalized coordinate format (cx, cy, w, h).
+                Shape [nb_dec, bs, num_query, 4].
+        """
+        # construct binary masks which used for the transformer.
+        # NOTE following the official DETR repo, non-zero values representing
+        # ignored positions, while zero values means valid positions.
+
+        # process query image feature
+        x = self.input_proj(x)
+        bs, dim, h, w = x.shape
+
+        # Disable the support keypoint positional embedding
+        support_order_embedding = x.new_zeros(
+            (bs, self.embed_dims, 1, target_s[0].shape[1])).to(torch.bool)
+
+        # Feature map pos embedding
+        masks = x.new_zeros(
+            (x.shape[0], x.shape[2], x.shape[3])).to(torch.bool)
+        pos_embed = self.positional_encoding(masks)
+
+        # process keypoint token feature
+        query_embed_list = []
+        for i, (feature, target) in enumerate(zip(feature_s, target_s)):
+            # resize the support feature back to the heatmap sizes.
+            resized_feature = resize(
+                input=feature,
+                size=target.shape[-2:],
+                mode='bilinear',
+                align_corners=False)
+            target = target / (
+                target.sum(dim=-1).sum(dim=-1)[:, :, None, None] + 1e-8)
+            support_keypoints = target.flatten(2) @ resized_feature.flatten(
+                2).permute(0, 2, 1)
+            query_embed_list.append(support_keypoints)
+
+        support_keypoints = torch.mean(torch.stack(query_embed_list, dim=0), 0)
+        support_keypoints = support_keypoints * mask_s
+        support_keypoints = self.query_proj(support_keypoints)
+        masks_query = (~mask_s.to(torch.bool)).squeeze(
+            -1)  # True indicating this query matched no actual joints.
+
+        # outs_dec: [nb_dec, bs, num_query, c]
+        # memory: [bs, c, h, w]
+        # x = Query image feature,
+        # support_keypoints = Support keypoint feature
+        outs_dec, initial_proposals, out_points, similarity_map = (
+            self.transformer(x, masks, support_keypoints, pos_embed,
+                             support_order_embedding, masks_query,
+                             self.positional_encoding, self.kpt_branch,
+                             skeleton))
+
+        output_kpts = []
+        for idx in range(outs_dec.shape[0]):
+            layer_delta_unsig = self.kpt_branch[idx](outs_dec[idx])
+            layer_outputs_unsig = layer_delta_unsig + inverse_sigmoid(
+                out_points[idx])
+            output_kpts.append(layer_outputs_unsig.sigmoid())
+
+        return torch.stack(
+            output_kpts, dim=0), initial_proposals, similarity_map
+
+    def get_loss(self, output, initial_proposals, similarity_map, target,
+                 target_heatmap, target_weight, target_sizes):
+        # Calculate top-down keypoint loss.
+        losses = dict()
+        # denormalize the predicted coordinates.
+        num_dec_layer, bs, nq = output.shape[:3]
+        target_sizes = target_sizes.to(output.device)  # [bs, 1, 2]
+        target = target / target_sizes
+        target = target[None, :, :, :].repeat(num_dec_layer, 1, 1, 1)
+
+        # set the weight for unset query point to be zero
+        normalizer = target_weight.squeeze(dim=-1).sum(dim=-1)  # [bs, ]
+        normalizer[normalizer == 0] = 1
+
+        # compute the heatmap loss
+        if self.with_heatmap_loss:
+            losses['heatmap_loss'] = self.heatmap_loss(
+                similarity_map, target_heatmap, target_weight,
+                normalizer) * self.heatmap_loss_weight
+
+        # compute l1 loss for initial_proposals
+        proposal_l1_loss = F.l1_loss(
+            initial_proposals, target[0], reduction='none')
+        proposal_l1_loss = proposal_l1_loss.sum(
+            dim=-1, keepdim=False) * target_weight.squeeze(dim=-1)
+        proposal_l1_loss = proposal_l1_loss.sum(
+            dim=-1, keepdim=False) / normalizer  # [bs, ]
+        losses['proposal_loss'] = proposal_l1_loss.sum() / bs
+
+        # compute l1 loss for each layer
+        for idx in range(num_dec_layer):
+            layer_output, layer_target = output[idx], target[idx]
+            l1_loss = F.l1_loss(
+                layer_output, layer_target, reduction='none')  # [bs, query, 2]
+            l1_loss = l1_loss.sum(
+                dim=-1, keepdim=False) * target_weight.squeeze(
+                    dim=-1)  # [bs, query]
+            # normalize the loss for each sample with the number of visible
+            # joints
+            l1_loss = l1_loss.sum(dim=-1, keepdim=False) / normalizer  # [bs, ]
+            losses['l1_loss' + '_layer' + str(idx)] = l1_loss.sum() / bs
+
+        return losses
+
+    def get_max_coords(self, heatmap, heatmap_size=64):
+        B, C, H, W = heatmap.shape
+        heatmap = heatmap.view(B, C, -1)
+        max_cor = heatmap.argmax(dim=2)
+        row, col = torch.floor(max_cor / heatmap_size), max_cor % heatmap_size
+        support_joints = torch.cat((row.unsqueeze(-1), col.unsqueeze(-1)),
+                                   dim=-1)
+        return support_joints
+
+    def heatmap_loss(self, similarity_map, target_heatmap, target_weight,
+                     normalizer):
+        # similarity_map: [bs, num_query, h, w]
+        # target_heatmap: [bs, num_query, sh, sw]
+        # target_weight: [bs, num_query, 1]
+
+        # preprocess the similarity_map
+        h, w = similarity_map.shape[-2:]
+        # similarity_map = torch.clamp(similarity_map, 0.0, None)
+        similarity_map = similarity_map.sigmoid()
+
+        target_heatmap = F.interpolate(
+            target_heatmap, size=(h, w), mode='bilinear')
+        target_heatmap = (target_heatmap /
+                          (target_heatmap.max(dim=-1)[0].max(dim=-1)[0] +
+                           1e-10)[:, :, None, None]
+                          )  # make sure sum of each query is 1
+
+        l2_loss = F.mse_loss(
+            similarity_map, target_heatmap, reduction='none')  # bs, nq, h, w
+        l2_loss = l2_loss * target_weight[:, :, :, None]  # bs, nq, h, w
+        l2_loss = l2_loss.flatten(2, 3).sum(-1) / (h * w)  # bs, nq
+        l2_loss = l2_loss.sum(-1) / normalizer  # bs,
+
+        return l2_loss.mean()
+
+    def get_accuracy(self,
+                     output,
+                     target,
+                     target_weight,
+                     target_sizes,
+                     height=256):
+        """Calculate accuracy for top-down keypoint loss.
+
+        Args:
+            output (torch.Tensor[NxKx2]): estimated keypoints in ABSOLUTE
+            coordinates.
+            target (torch.Tensor[NxKx2]): gt keypoints in ABSOLUTE coordinates.
+            target_weight (torch.Tensor[NxKx1]): Weights across different
+            joint types.
+            target_sizes (torch.Tensor[Nx2): shapes of the image.
+        """
+        # NOTE: In POMNet, PCK is estimated on 1/8 resolution, which is
+        # slightly different here.
+
+        accuracy = dict()
+        output = output * float(height)
+        output, target, target_weight, target_sizes = (
+            output.detach().cpu().numpy(), target.detach().cpu().numpy(),
+            target_weight.squeeze(-1).long().detach().cpu().numpy(),
+            target_sizes.squeeze(1).detach().cpu().numpy())
+
+        _, avg_acc, _ = keypoint_pck_accuracy(
+            output,
+            target,
+            target_weight.astype(np.bool8),
+            thr=0.2,
+            normalize=target_sizes)
+        accuracy['acc_pose'] = float(avg_acc)
+
+        return accuracy
+
+    def decode(self, img_metas, output, img_size, **kwargs):
+        """Decode the predicted keypoints from prediction.
+
+        Args:
+            img_metas (list(dict)): Information about data augmentation
+                By default this includes:
+                - "image_file: path to the image file
+                - "center": center of the bbox
+                - "scale": scale of the bbox
+                - "rotation": rotation of the bbox
+                - "bbox_score": score of bbox
+            output (np.ndarray[N, K, H, W]): model predicted heatmaps.
+        """
+        batch_size = len(img_metas)
+        W, H = img_size
+        output = output * np.array([
+            W, H
+        ])[None, None, :]  # [bs, query, 2], coordinates with recovered shapes.
+
+        if 'bbox_id' or 'query_bbox_id' in img_metas[0]:
+            bbox_ids = []
+        else:
+            bbox_ids = None
+
+        c = np.zeros((batch_size, 2), dtype=np.float32)
+        s = np.zeros((batch_size, 2), dtype=np.float32)
+        image_paths = []
+        score = np.ones(batch_size)
+        for i in range(batch_size):
+            c[i, :] = img_metas[i]['query_center']
+            s[i, :] = img_metas[i]['query_scale']
+            image_paths.append(img_metas[i]['query_image_file'])
+
+            if 'query_bbox_score' in img_metas[i]:
+                score[i] = np.array(
+                    img_metas[i]['query_bbox_score']).reshape(-1)
+            if 'bbox_id' in img_metas[i]:
+                bbox_ids.append(img_metas[i]['bbox_id'])
+            elif 'query_bbox_id' in img_metas[i]:
+                bbox_ids.append(img_metas[i]['query_bbox_id'])
+
+        preds = np.zeros(output.shape)
+        for idx in range(output.shape[0]):
+            preds[i] = transform_preds(
+                output[i],
+                c[i],
+                s[i], [W, H],
+                use_udp=self.test_cfg.get('use_udp', False))
+
+        all_preds = np.zeros((batch_size, preds.shape[1], 3), dtype=np.float32)
+        all_boxes = np.zeros((batch_size, 6), dtype=np.float32)
+        all_preds[:, :, 0:2] = preds[:, :, 0:2]
+        all_preds[:, :, 2:3] = 1.0
+        all_boxes[:, 0:2] = c[:, 0:2]
+        all_boxes[:, 2:4] = s[:, 0:2]
+        all_boxes[:, 4] = np.prod(s * 200.0, axis=1)
+        all_boxes[:, 5] = score
+
+        result = {}
+
+        result['preds'] = all_preds
+        result['boxes'] = all_boxes
+        result['image_paths'] = image_paths
+        result['bbox_ids'] = bbox_ids
+
+        return result
diff --git a/projects/pose_anything/models/utils/__init__.py b/projects/pose_anything/models/utils/__init__.py
new file mode 100644
index 0000000000..fe9927e37f
--- /dev/null
+++ b/projects/pose_anything/models/utils/__init__.py
@@ -0,0 +1,22 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .builder import (build_backbone, build_linear_layer,
+                      build_positional_encoding, build_transformer)
+from .encoder_decoder import EncoderDecoder
+from .positional_encoding import (LearnedPositionalEncoding,
+                                  SinePositionalEncoding)
+from .transformer import (DetrTransformerDecoder, DetrTransformerDecoderLayer,
+                          DetrTransformerEncoder, DynamicConv)
+
+__all__ = [
+    'build_transformer',
+    'build_backbone',
+    'build_linear_layer',
+    'build_positional_encoding',
+    'DetrTransformerDecoderLayer',
+    'DetrTransformerDecoder',
+    'DetrTransformerEncoder',
+    'LearnedPositionalEncoding',
+    'SinePositionalEncoding',
+    'EncoderDecoder',
+    'DynamicConv',
+]
diff --git a/projects/pose_anything/models/utils/builder.py b/projects/pose_anything/models/utils/builder.py
new file mode 100644
index 0000000000..a165f98c70
--- /dev/null
+++ b/projects/pose_anything/models/utils/builder.py
@@ -0,0 +1,61 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch.nn as nn
+from mmengine import build_from_cfg
+
+from mmpose.registry import Registry
+
+TRANSFORMER = Registry('Transformer')
+BACKBONES = Registry('BACKBONES')
+POSITIONAL_ENCODING = Registry('position encoding')
+LINEAR_LAYERS = Registry('linear layers')
+
+
+def build_positional_encoding(cfg, default_args=None):
+    """Build backbone."""
+    return build_from_cfg(cfg, POSITIONAL_ENCODING, default_args)
+
+
+def build_backbone(cfg, default_args=None):
+    """Build backbone."""
+    return build_from_cfg(cfg, BACKBONES, default_args)
+
+
+def build_transformer(cfg, default_args=None):
+    """Builder for Transformer."""
+    return build_from_cfg(cfg, TRANSFORMER, default_args)
+
+
+LINEAR_LAYERS.register_module('Linear', module=nn.Linear)
+
+
+def build_linear_layer(cfg, *args, **kwargs):
+    """Build linear layer.
+    Args:
+        cfg (None or dict): The linear layer config, which should contain:
+            - type (str): Layer type.
+            - layer args: Args needed to instantiate an linear layer.
+        args (argument list): Arguments passed to the `__init__`
+            method of the corresponding linear layer.
+        kwargs (keyword arguments): Keyword arguments passed to the `__init__`
+            method of the corresponding linear layer.
+    Returns:
+        nn.Module: Created linear layer.
+    """
+    if cfg is None:
+        cfg_ = dict(type='Linear')
+    else:
+        if not isinstance(cfg, dict):
+            raise TypeError('cfg must be a dict')
+        if 'type' not in cfg:
+            raise KeyError('the cfg dict must contain the key "type"')
+        cfg_ = cfg.copy()
+
+    layer_type = cfg_.pop('type')
+    if layer_type not in LINEAR_LAYERS:
+        raise KeyError(f'Unrecognized linear type {layer_type}')
+    else:
+        linear_layer = LINEAR_LAYERS.get(layer_type)
+
+    layer = linear_layer(*args, **kwargs, **cfg_)
+
+    return layer
diff --git a/projects/pose_anything/models/utils/encoder_decoder.py b/projects/pose_anything/models/utils/encoder_decoder.py
new file mode 100644
index 0000000000..9e197efb14
--- /dev/null
+++ b/projects/pose_anything/models/utils/encoder_decoder.py
@@ -0,0 +1,637 @@
+import copy
+from typing import Optional
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from mmengine.model import xavier_init
+from models.utils.builder import TRANSFORMER
+from torch import Tensor
+
+
+def inverse_sigmoid(x, eps=1e-3):
+    x = x.clamp(min=0, max=1)
+    x1 = x.clamp(min=eps)
+    x2 = (1 - x).clamp(min=eps)
+    return torch.log(x1 / x2)
+
+
+class MLP(nn.Module):
+    """Very simple multi-layer perceptron (also called FFN)"""
+
+    def __init__(self, input_dim, hidden_dim, output_dim, num_layers):
+        super().__init__()
+        self.num_layers = num_layers
+        h = [hidden_dim] * (num_layers - 1)
+        self.layers = nn.ModuleList(
+            nn.Linear(n, k) for n, k in zip([input_dim] + h, h + [output_dim]))
+
+    def forward(self, x):
+        for i, layer in enumerate(self.layers):
+            x = F.gelu(layer(x)) if i < self.num_layers - 1 else layer(x)
+        return x
+
+
+class ProposalGenerator(nn.Module):
+
+    def __init__(self, hidden_dim, proj_dim, dynamic_proj_dim):
+        super().__init__()
+        self.support_proj = nn.Linear(hidden_dim, proj_dim)
+        self.query_proj = nn.Linear(hidden_dim, proj_dim)
+        self.dynamic_proj = nn.Sequential(
+            nn.Linear(hidden_dim, dynamic_proj_dim), nn.ReLU(),
+            nn.Linear(dynamic_proj_dim, hidden_dim))
+        self.dynamic_act = nn.Tanh()
+
+    def forward(self, query_feat, support_feat, spatial_shape):
+        """
+        Args:
+            support_feat: [query, bs, c]
+            query_feat: [hw, bs, c]
+            spatial_shape: h, w
+        """
+        device = query_feat.device
+        _, bs, c = query_feat.shape
+        h, w = spatial_shape
+        side_normalizer = torch.tensor([w, h]).to(query_feat.device)[
+            None, None, :]  # [bs, query, 2], Normalize the coord to [0,1]
+
+        query_feat = query_feat.transpose(0, 1)
+        support_feat = support_feat.transpose(0, 1)
+        nq = support_feat.shape[1]
+
+        fs_proj = self.support_proj(support_feat)  # [bs, query, c]
+        fq_proj = self.query_proj(query_feat)  # [bs, hw, c]
+        pattern_attention = self.dynamic_act(
+            self.dynamic_proj(fs_proj))  # [bs, query, c]
+
+        fs_feat = (pattern_attention + 1) * fs_proj  # [bs, query, c]
+        similarity = torch.bmm(fq_proj,
+                               fs_feat.transpose(1, 2))  # [bs, hw, query]
+        similarity = similarity.transpose(1, 2).reshape(bs, nq, h, w)
+        grid_y, grid_x = torch.meshgrid(
+            torch.linspace(
+                0.5, h - 0.5, h, dtype=torch.float32, device=device),  # (h, w)
+            torch.linspace(
+                0.5, w - 0.5, w, dtype=torch.float32, device=device))
+
+        # compute softmax and sum up
+        coord_grid = torch.stack([grid_x, grid_y],
+                                 dim=0).unsqueeze(0).unsqueeze(0).repeat(
+                                     bs, nq, 1, 1, 1)  # [bs, query, 2, h, w]
+        coord_grid = coord_grid.permute(0, 1, 3, 4, 2)  # [bs, query, h, w, 2]
+        similarity_softmax = similarity.flatten(2, 3).softmax(
+            dim=-1)  # [bs, query, hw]
+        similarity_coord_grid = similarity_softmax[:, :, :,
+                                                   None] * coord_grid.flatten(
+                                                       2, 3)
+        proposal_for_loss = similarity_coord_grid.sum(
+            dim=2, keepdim=False)  # [bs, query, 2]
+        proposal_for_loss = proposal_for_loss / side_normalizer
+
+        max_pos = torch.argmax(
+            similarity.reshape(bs, nq, -1), dim=-1,
+            keepdim=True)  # (bs, nq, 1)
+        max_mask = F.one_hot(max_pos, num_classes=w * h)  # (bs, nq, 1, w*h)
+        max_mask = max_mask.reshape(bs, nq, w,
+                                    h).type(torch.float)  # (bs, nq, w, h)
+        local_max_mask = F.max_pool2d(
+            input=max_mask, kernel_size=3, stride=1,
+            padding=1).reshape(bs, nq, w * h, 1)  # (bs, nq, w*h, 1)
+        '''
+        proposal = (similarity_coord_grid * local_max_mask).sum(
+            dim=2, keepdim=False) / torch.count_nonzero(
+                local_max_mask, dim=2)
+        '''
+        # first, extract the local probability map with the mask
+        local_similarity_softmax = similarity_softmax[:, :, :,
+                                                      None] * local_max_mask
+
+        # then, re-normalize the local probability map
+        local_similarity_softmax = local_similarity_softmax / (
+            local_similarity_softmax.sum(dim=-2, keepdim=True) + 1e-10
+        )  # [bs, nq, w*h, 1]
+
+        # point-wise mulplication of local probability map and coord grid
+        proposals = local_similarity_softmax * coord_grid.flatten(
+            2, 3)  # [bs, nq, w*h, 2]
+
+        # sum the mulplication to obtain the final coord proposals
+        proposals = proposals.sum(dim=2) / side_normalizer  # [bs, nq, 2]
+
+        return proposal_for_loss, similarity, proposals
+
+
+@TRANSFORMER.register_module()
+class EncoderDecoder(nn.Module):
+
+    def __init__(self,
+                 d_model=256,
+                 nhead=8,
+                 num_encoder_layers=3,
+                 num_decoder_layers=3,
+                 graph_decoder=None,
+                 dim_feedforward=2048,
+                 dropout=0.1,
+                 activation='relu',
+                 normalize_before=False,
+                 similarity_proj_dim=256,
+                 dynamic_proj_dim=128,
+                 return_intermediate_dec=True,
+                 look_twice=False,
+                 detach_support_feat=False):
+        super().__init__()
+
+        self.d_model = d_model
+        self.nhead = nhead
+
+        encoder_layer = TransformerEncoderLayer(d_model, nhead,
+                                                dim_feedforward, dropout,
+                                                activation, normalize_before)
+        encoder_norm = nn.LayerNorm(d_model) if normalize_before else None
+        self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers,
+                                          encoder_norm)
+
+        decoder_layer = GraphTransformerDecoderLayer(d_model, nhead,
+                                                     dim_feedforward, dropout,
+                                                     activation,
+                                                     normalize_before,
+                                                     graph_decoder)
+        decoder_norm = nn.LayerNorm(d_model)
+        self.decoder = GraphTransformerDecoder(
+            d_model,
+            decoder_layer,
+            num_decoder_layers,
+            decoder_norm,
+            return_intermediate=return_intermediate_dec,
+            look_twice=look_twice,
+            detach_support_feat=detach_support_feat)
+
+        self.proposal_generator = ProposalGenerator(
+            hidden_dim=d_model,
+            proj_dim=similarity_proj_dim,
+            dynamic_proj_dim=dynamic_proj_dim)
+
+    def init_weights(self):
+        # follow the official DETR to init parameters
+        for m in self.modules():
+            if hasattr(m, 'weight') and m.weight.dim() > 1:
+                xavier_init(m, distribution='uniform')
+
+    def forward(self,
+                src,
+                mask,
+                support_embed,
+                pos_embed,
+                support_order_embed,
+                query_padding_mask,
+                position_embedding,
+                kpt_branch,
+                skeleton,
+                return_attn_map=False):
+
+        bs, c, h, w = src.shape
+
+        src = src.flatten(2).permute(2, 0, 1)
+        pos_embed = pos_embed.flatten(2).permute(2, 0, 1)
+        support_order_embed = support_order_embed.flatten(2).permute(2, 0, 1)
+        pos_embed = torch.cat((pos_embed, support_order_embed))
+        query_embed = support_embed.transpose(0, 1)
+        mask = mask.flatten(1)
+
+        query_embed, refined_support_embed = self.encoder(
+            src,
+            query_embed,
+            src_key_padding_mask=mask,
+            query_key_padding_mask=query_padding_mask,
+            pos=pos_embed)
+
+        # Generate initial proposals and corresponding positional embedding.
+        initial_proposals_for_loss, similarity_map, initial_proposals = (
+            self.proposal_generator(
+                query_embed, refined_support_embed, spatial_shape=[h, w]))
+        initial_position_embedding = position_embedding.forward_coordinates(
+            initial_proposals)
+
+        outs_dec, out_points, attn_maps = self.decoder(
+            refined_support_embed,
+            query_embed,
+            memory_key_padding_mask=mask,
+            pos=pos_embed,
+            query_pos=initial_position_embedding,
+            tgt_key_padding_mask=query_padding_mask,
+            position_embedding=position_embedding,
+            initial_proposals=initial_proposals,
+            kpt_branch=kpt_branch,
+            skeleton=skeleton,
+            return_attn_map=return_attn_map)
+
+        return outs_dec.transpose(
+            1, 2), initial_proposals_for_loss, out_points, similarity_map
+
+
+class GraphTransformerDecoder(nn.Module):
+
+    def __init__(self,
+                 d_model,
+                 decoder_layer,
+                 num_layers,
+                 norm=None,
+                 return_intermediate=False,
+                 look_twice=False,
+                 detach_support_feat=False):
+        super().__init__()
+        self.layers = _get_clones(decoder_layer, num_layers)
+        self.num_layers = num_layers
+        self.norm = norm
+        self.return_intermediate = return_intermediate
+        self.ref_point_head = MLP(d_model, d_model, d_model, num_layers=2)
+        self.look_twice = look_twice
+        self.detach_support_feat = detach_support_feat
+
+    def forward(self,
+                support_feat,
+                query_feat,
+                tgt_mask=None,
+                memory_mask=None,
+                tgt_key_padding_mask=None,
+                memory_key_padding_mask=None,
+                pos=None,
+                query_pos=None,
+                position_embedding=None,
+                initial_proposals=None,
+                kpt_branch=None,
+                skeleton=None,
+                return_attn_map=False):
+        """
+        position_embedding: Class used to compute positional embedding
+        initial_proposals: [bs, nq, 2], normalized coordinates of initial
+        proposals kpt_branch: MLP used to predict the offsets for each query.
+        """
+
+        refined_support_feat = support_feat
+        intermediate = []
+        attn_maps = []
+        bi = initial_proposals.detach()
+        bi_tag = initial_proposals.detach()
+        query_points = [initial_proposals.detach()]
+
+        tgt_key_padding_mask_remove_all_true = tgt_key_padding_mask.clone().to(
+            tgt_key_padding_mask.device)
+        tgt_key_padding_mask_remove_all_true[
+            tgt_key_padding_mask.logical_not().sum(dim=-1) == 0, 0] = False
+
+        for layer_idx, layer in enumerate(self.layers):
+            if layer_idx == 0:  # use positional embedding form initial
+                # proposals
+                query_pos_embed = query_pos.transpose(0, 1)
+            else:
+                # recalculate the positional embedding
+                query_pos_embed = position_embedding.forward_coordinates(bi)
+                query_pos_embed = query_pos_embed.transpose(0, 1)
+            query_pos_embed = self.ref_point_head(query_pos_embed)
+
+            if self.detach_support_feat:
+                refined_support_feat = refined_support_feat.detach()
+
+            refined_support_feat, attn_map = layer(
+                refined_support_feat,
+                query_feat,
+                tgt_mask=tgt_mask,
+                memory_mask=memory_mask,
+                tgt_key_padding_mask=tgt_key_padding_mask_remove_all_true,
+                memory_key_padding_mask=memory_key_padding_mask,
+                pos=pos,
+                query_pos=query_pos_embed,
+                skeleton=skeleton)
+
+            if self.return_intermediate:
+                intermediate.append(self.norm(refined_support_feat))
+
+            if return_attn_map:
+                attn_maps.append(attn_map)
+
+            # update the query coordinates
+            delta_bi = kpt_branch[layer_idx](
+                refined_support_feat.transpose(0, 1))
+
+            # Prediction loss
+            if self.look_twice:
+                bi_pred = self.update(bi_tag, delta_bi)
+                bi_tag = self.update(bi, delta_bi)
+            else:
+                bi_tag = self.update(bi, delta_bi)
+                bi_pred = bi_tag
+
+            bi = bi_tag.detach()
+            query_points.append(bi_pred)
+
+        if self.norm is not None:
+            refined_support_feat = self.norm(refined_support_feat)
+            if self.return_intermediate:
+                intermediate.pop()
+                intermediate.append(refined_support_feat)
+
+        if self.return_intermediate:
+            return torch.stack(intermediate), query_points, attn_maps
+
+        return refined_support_feat.unsqueeze(0), query_points, attn_maps
+
+    def update(self, query_coordinates, delta_unsig):
+        query_coordinates_unsigmoid = inverse_sigmoid(query_coordinates)
+        new_query_coordinates = query_coordinates_unsigmoid + delta_unsig
+        new_query_coordinates = new_query_coordinates.sigmoid()
+        return new_query_coordinates
+
+
+class GraphTransformerDecoderLayer(nn.Module):
+
+    def __init__(self,
+                 d_model,
+                 nhead,
+                 dim_feedforward=2048,
+                 dropout=0.1,
+                 activation='relu',
+                 normalize_before=False,
+                 graph_decoder=None):
+
+        super().__init__()
+        self.graph_decoder = graph_decoder
+        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
+        self.multihead_attn = nn.MultiheadAttention(
+            d_model * 2, nhead, dropout=dropout, vdim=d_model)
+        self.choker = nn.Linear(in_features=2 * d_model, out_features=d_model)
+        # Implementation of Feedforward model
+        if self.graph_decoder is None:
+            self.ffn1 = nn.Linear(d_model, dim_feedforward)
+            self.ffn2 = nn.Linear(dim_feedforward, d_model)
+        elif self.graph_decoder == 'pre':
+            self.ffn1 = GCNLayer(d_model, dim_feedforward, batch_first=False)
+            self.ffn2 = nn.Linear(dim_feedforward, d_model)
+        elif self.graph_decoder == 'post':
+            self.ffn1 = nn.Linear(d_model, dim_feedforward)
+            self.ffn2 = GCNLayer(dim_feedforward, d_model, batch_first=False)
+        else:
+            self.ffn1 = GCNLayer(d_model, dim_feedforward, batch_first=False)
+            self.ffn2 = GCNLayer(dim_feedforward, d_model, batch_first=False)
+
+        self.dropout = nn.Dropout(dropout)
+        self.norm1 = nn.LayerNorm(d_model)
+        self.norm2 = nn.LayerNorm(d_model)
+        self.norm3 = nn.LayerNorm(d_model)
+        self.dropout1 = nn.Dropout(dropout)
+        self.dropout2 = nn.Dropout(dropout)
+        self.dropout3 = nn.Dropout(dropout)
+
+        self.activation = _get_activation_fn(activation)
+        self.normalize_before = normalize_before
+
+    def with_pos_embed(self, tensor, pos: Optional[Tensor]):
+        return tensor if pos is None else tensor + pos
+
+    def forward(self,
+                refined_support_feat,
+                refined_query_feat,
+                tgt_mask: Optional[Tensor] = None,
+                memory_mask: Optional[Tensor] = None,
+                tgt_key_padding_mask: Optional[Tensor] = None,
+                memory_key_padding_mask: Optional[Tensor] = None,
+                pos: Optional[Tensor] = None,
+                query_pos: Optional[Tensor] = None,
+                skeleton: Optional[list] = None):
+
+        q = k = self.with_pos_embed(
+            refined_support_feat,
+            query_pos + pos[refined_query_feat.shape[0]:])
+        tgt2 = self.self_attn(
+            q,
+            k,
+            value=refined_support_feat,
+            attn_mask=tgt_mask,
+            key_padding_mask=tgt_key_padding_mask)[0]
+
+        refined_support_feat = refined_support_feat + self.dropout1(tgt2)
+        refined_support_feat = self.norm1(refined_support_feat)
+
+        # concatenate the positional embedding with the content feature,
+        # instead of direct addition
+        cross_attn_q = torch.cat(
+            (refined_support_feat,
+             query_pos + pos[refined_query_feat.shape[0]:]),
+            dim=-1)
+        cross_attn_k = torch.cat(
+            (refined_query_feat, pos[:refined_query_feat.shape[0]]), dim=-1)
+
+        tgt2, attn_map = self.multihead_attn(
+            query=cross_attn_q,
+            key=cross_attn_k,
+            value=refined_query_feat,
+            attn_mask=memory_mask,
+            key_padding_mask=memory_key_padding_mask)
+
+        refined_support_feat = refined_support_feat + self.dropout2(
+            self.choker(tgt2))
+        refined_support_feat = self.norm2(refined_support_feat)
+        if self.graph_decoder is not None:
+            num_pts, b, c = refined_support_feat.shape
+            adj = adj_from_skeleton(
+                num_pts=num_pts,
+                skeleton=skeleton,
+                mask=tgt_key_padding_mask,
+                device=refined_support_feat.device)
+            if self.graph_decoder == 'pre':
+                tgt2 = self.ffn2(
+                    self.dropout(
+                        self.activation(self.ffn1(refined_support_feat, adj))))
+            elif self.graph_decoder == 'post':
+                tgt2 = self.ffn2(
+                    self.dropout(
+                        self.activation(self.ffn1(refined_support_feat))), adj)
+            else:
+                tgt2 = self.ffn2(
+                    self.dropout(
+                        self.activation(self.ffn1(refined_support_feat, adj))),
+                    adj)
+        else:
+            tgt2 = self.ffn2(
+                self.dropout(self.activation(self.ffn1(refined_support_feat))))
+        refined_support_feat = refined_support_feat + self.dropout3(tgt2)
+        refined_support_feat = self.norm3(refined_support_feat)
+
+        return refined_support_feat, attn_map
+
+
+class TransformerEncoder(nn.Module):
+
+    def __init__(self, encoder_layer, num_layers, norm=None):
+        super().__init__()
+        self.layers = _get_clones(encoder_layer, num_layers)
+        self.num_layers = num_layers
+        self.norm = norm
+
+    def forward(self,
+                src,
+                query,
+                mask: Optional[Tensor] = None,
+                src_key_padding_mask: Optional[Tensor] = None,
+                query_key_padding_mask: Optional[Tensor] = None,
+                pos: Optional[Tensor] = None):
+        # src: [hw, bs, c]
+        # query: [num_query, bs, c]
+        # mask: None by default
+        # src_key_padding_mask: [bs, hw]
+        # query_key_padding_mask: [bs, nq]
+        # pos: [hw, bs, c]
+
+        # organize the input
+        # implement the attention mask to mask out the useless points
+        n, bs, c = src.shape
+        src_cat = torch.cat((src, query), dim=0)  # [hw + nq, bs, c]
+        mask_cat = torch.cat((src_key_padding_mask, query_key_padding_mask),
+                             dim=1)  # [bs, hw+nq]
+        output = src_cat
+
+        for layer in self.layers:
+            output = layer(
+                output,
+                query_length=n,
+                src_mask=mask,
+                src_key_padding_mask=mask_cat,
+                pos=pos)
+
+        if self.norm is not None:
+            output = self.norm(output)
+
+        # resplit the output into src and query
+        refined_query = output[n:, :, :]  # [nq, bs, c]
+        output = output[:n, :, :]  # [n, bs, c]
+
+        return output, refined_query
+
+
+class TransformerEncoderLayer(nn.Module):
+
+    def __init__(self,
+                 d_model,
+                 nhead,
+                 dim_feedforward=2048,
+                 dropout=0.1,
+                 activation='relu',
+                 normalize_before=False):
+        super().__init__()
+        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
+        # Implementation of Feedforward model
+        self.linear1 = nn.Linear(d_model, dim_feedforward)
+        self.dropout = nn.Dropout(dropout)
+        self.linear2 = nn.Linear(dim_feedforward, d_model)
+
+        self.norm1 = nn.LayerNorm(d_model)
+        self.norm2 = nn.LayerNorm(d_model)
+        self.dropout1 = nn.Dropout(dropout)
+        self.dropout2 = nn.Dropout(dropout)
+
+        self.activation = _get_activation_fn(activation)
+        self.normalize_before = normalize_before
+
+    def with_pos_embed(self, tensor, pos: Optional[Tensor]):
+        return tensor if pos is None else tensor + pos
+
+    def forward(self,
+                src,
+                query_length,
+                src_mask: Optional[Tensor] = None,
+                src_key_padding_mask: Optional[Tensor] = None,
+                pos: Optional[Tensor] = None):
+        src = self.with_pos_embed(src, pos)
+        q = k = src
+        # NOTE: compared with original implementation, we add positional
+        # embedding into the VALUE.
+        src2 = self.self_attn(
+            q,
+            k,
+            value=src,
+            attn_mask=src_mask,
+            key_padding_mask=src_key_padding_mask)[0]
+        src = src + self.dropout1(src2)
+        src = self.norm1(src)
+        src2 = self.linear2(self.dropout(self.activation(self.linear1(src))))
+        src = src + self.dropout2(src2)
+        src = self.norm2(src)
+        return src
+
+
+def adj_from_skeleton(num_pts, skeleton, mask, device='cuda'):
+    adj_mx = torch.empty(0, device=device)
+    batch_size = len(skeleton)
+    for b in range(batch_size):
+        edges = torch.tensor(skeleton[b])
+        adj = torch.zeros(num_pts, num_pts, device=device)
+        adj[edges[:, 0], edges[:, 1]] = 1
+        adj_mx = torch.cat((adj_mx, adj.unsqueeze(0)), dim=0)
+    trans_adj_mx = torch.transpose(adj_mx, 1, 2)
+    cond = (trans_adj_mx > adj_mx).float()
+    adj = adj_mx + trans_adj_mx * cond - adj_mx * cond
+    adj = adj * ~mask[..., None] * ~mask[:, None]
+    adj = torch.nan_to_num(adj / adj.sum(dim=-1, keepdim=True))
+    adj = torch.stack((torch.diag_embed(~mask), adj), dim=1)
+    return adj
+
+
+class GCNLayer(nn.Module):
+
+    def __init__(self,
+                 in_features,
+                 out_features,
+                 kernel_size=2,
+                 use_bias=True,
+                 activation=nn.ReLU(inplace=True),
+                 batch_first=True):
+        super(GCNLayer, self).__init__()
+        self.conv = nn.Conv1d(
+            in_features,
+            out_features * kernel_size,
+            kernel_size=1,
+            padding=0,
+            stride=1,
+            dilation=1,
+            bias=use_bias)
+        self.kernel_size = kernel_size
+        self.activation = activation
+        self.batch_first = batch_first
+
+    def forward(self, x, adj):
+        assert adj.size(1) == self.kernel_size
+        if not self.batch_first:
+            x = x.permute(1, 2, 0)
+        else:
+            x = x.transpose(1, 2)
+        x = self.conv(x)
+        b, kc, v = x.size()
+        x = x.view(b, self.kernel_size, kc // self.kernel_size, v)
+        x = torch.einsum('bkcv,bkvw->bcw', (x, adj))
+        if self.activation is not None:
+            x = self.activation(x)
+        if not self.batch_first:
+            x = x.permute(2, 0, 1)
+        else:
+            x = x.transpose(1, 2)
+        return x
+
+
+def _get_clones(module, N):
+    return nn.ModuleList([copy.deepcopy(module) for i in range(N)])
+
+
+def _get_activation_fn(activation):
+    """Return an activation function given a string."""
+    if activation == 'relu':
+        return F.relu
+    if activation == 'gelu':
+        return F.gelu
+    if activation == 'glu':
+        return F.glu
+    raise RuntimeError(F'activation should be relu/gelu, not {activation}.')
+
+
+def clones(module, N):
+    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])
diff --git a/projects/pose_anything/models/utils/positional_encoding.py b/projects/pose_anything/models/utils/positional_encoding.py
new file mode 100644
index 0000000000..2fe441a51c
--- /dev/null
+++ b/projects/pose_anything/models/utils/positional_encoding.py
@@ -0,0 +1,195 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+
+import torch
+import torch.nn as nn
+from mmengine.model import BaseModule
+from models.utils.builder import POSITIONAL_ENCODING
+
+
+# TODO: add an SinePositionalEncoding for coordinates input
+@POSITIONAL_ENCODING.register_module()
+class SinePositionalEncoding(BaseModule):
+    """Position encoding with sine and cosine functions.
+
+    See `End-to-End Object Detection with Transformers
+    <https://arxiv.org/pdf/2005.12872>`_ for details.
+
+    Args:
+        num_feats (int): The feature dimension for each position
+            along x-axis or y-axis. Note the final returned dimension
+            for each position is 2 times of this value.
+        temperature (int, optional): The temperature used for scaling
+            the position embedding. Defaults to 10000.
+        normalize (bool, optional): Whether to normalize the position
+            embedding. Defaults to False.
+        scale (float, optional): A scale factor that scales the position
+            embedding. The scale will be used only when `normalize` is True.
+            Defaults to 2*pi.
+        eps (float, optional): A value added to the denominator for
+            numerical stability. Defaults to 1e-6.
+        offset (float): offset add to embed when do the normalization.
+            Defaults to 0.
+        init_cfg (dict or list[dict], optional): Initialization config dict.
+            Default: None
+    """
+
+    def __init__(self,
+                 num_feats,
+                 temperature=10000,
+                 normalize=False,
+                 scale=2 * math.pi,
+                 eps=1e-6,
+                 offset=0.,
+                 init_cfg=None):
+        super(SinePositionalEncoding, self).__init__(init_cfg)
+        if normalize:
+            assert isinstance(scale, (float, int)), \
+                (f'when normalize is set, '
+                 f'scale should be provided and in float or int type, '
+                 f'found {type(scale)}')
+        self.num_feats = num_feats
+        self.temperature = temperature
+        self.normalize = normalize
+        self.scale = scale
+        self.eps = eps
+        self.offset = offset
+
+    def forward(self, mask):
+        """Forward function for `SinePositionalEncoding`.
+
+        Args:
+            mask (Tensor): ByteTensor mask. Non-zero values representing
+                ignored positions, while zero values means valid positions
+                for this image. Shape [bs, h, w].
+
+        Returns:
+            pos (Tensor): Returned position embedding with shape
+                [bs, num_feats*2, h, w].
+        """
+        # For convenience of exporting to ONNX, it's required to convert
+        # `masks` from bool to int.
+        mask = mask.to(torch.int)
+        not_mask = 1 - mask  # logical_not
+        y_embed = not_mask.cumsum(
+            1, dtype=torch.float32
+        )  # [bs, h, w], recording the y coordinate of each pixel
+        x_embed = not_mask.cumsum(2, dtype=torch.float32)
+        if self.normalize:  # default True
+            y_embed = (y_embed + self.offset) / \
+                      (y_embed[:, -1:, :] + self.eps) * self.scale
+            x_embed = (x_embed + self.offset) / \
+                      (x_embed[:, :, -1:] + self.eps) * self.scale
+        dim_t = torch.arange(
+            self.num_feats, dtype=torch.float32, device=mask.device)
+        dim_t = self.temperature**(2 * (dim_t // 2) / self.num_feats)
+        pos_x = x_embed[:, :, :, None] / dim_t  # [bs, h, w, num_feats]
+        pos_y = y_embed[:, :, :, None] / dim_t
+        # use `view` instead of `flatten` for dynamically exporting to ONNX
+        B, H, W = mask.size()
+        pos_x = torch.stack(
+            (pos_x[:, :, :, 0::2].sin(), pos_x[:, :, :, 1::2].cos()),
+            dim=4).view(B, H, W, -1)  # [bs, h, w, num_feats]
+        pos_y = torch.stack(
+            (pos_y[:, :, :, 0::2].sin(), pos_y[:, :, :, 1::2].cos()),
+            dim=4).view(B, H, W, -1)
+        pos = torch.cat((pos_y, pos_x), dim=3).permute(0, 3, 1, 2)
+        return pos
+
+    def forward_coordinates(self, coord):
+        """Forward function for normalized coordinates input with the shape of.
+
+        [ bs, kpt, 2] return: pos (Tensor): position embedding with the shape
+        of.
+
+        [bs, kpt, num_feats*2]
+        """
+        x_embed, y_embed = coord[:, :, 0], coord[:, :, 1]  # [bs, kpt]
+        x_embed = x_embed * self.scale  # [bs, kpt]
+        y_embed = y_embed * self.scale
+
+        dim_t = torch.arange(
+            self.num_feats, dtype=torch.float32, device=coord.device)
+        dim_t = self.temperature**(2 * (dim_t // 2) / self.num_feats)
+
+        pos_x = x_embed[:, :, None] / dim_t  # [bs, kpt, num_feats]
+        pos_y = y_embed[:, :, None] / dim_t  # [bs, kpt, num_feats]
+        bs, kpt, _ = pos_x.shape
+
+        pos_x = torch.stack((pos_x[:, :, 0::2].sin(), pos_x[:, :, 1::2].cos()),
+                            dim=3).view(bs, kpt, -1)  # [bs, kpt, num_feats]
+        pos_y = torch.stack((pos_y[:, :, 0::2].sin(), pos_y[:, :, 1::2].cos()),
+                            dim=3).view(bs, kpt, -1)  # [bs, kpt, num_feats]
+        pos = torch.cat((pos_y, pos_x), dim=2)  # [bs, kpt, num_feats * 2]
+
+        return pos
+
+    def __repr__(self):
+        """str: a string that describes the module"""
+        repr_str = self.__class__.__name__
+        repr_str += f'(num_feats={self.num_feats}, '
+        repr_str += f'temperature={self.temperature}, '
+        repr_str += f'normalize={self.normalize}, '
+        repr_str += f'scale={self.scale}, '
+        repr_str += f'eps={self.eps})'
+        return repr_str
+
+
+@POSITIONAL_ENCODING.register_module()
+class LearnedPositionalEncoding(BaseModule):
+    """Position embedding with learnable embedding weights.
+
+    Args:
+        num_feats (int): The feature dimension for each position
+            along x-axis or y-axis. The final returned dimension for
+            each position is 2 times of this value.
+        row_num_embed (int, optional): The dictionary size of row embeddings.
+            Default 50.
+        col_num_embed (int, optional): The dictionary size of col embeddings.
+            Default 50.
+        init_cfg (dict or list[dict], optional): Initialization config dict.
+    """
+
+    def __init__(self,
+                 num_feats,
+                 row_num_embed=50,
+                 col_num_embed=50,
+                 init_cfg=dict(type='Uniform', layer='Embedding')):
+        super(LearnedPositionalEncoding, self).__init__(init_cfg)
+        self.row_embed = nn.Embedding(row_num_embed, num_feats)
+        self.col_embed = nn.Embedding(col_num_embed, num_feats)
+        self.num_feats = num_feats
+        self.row_num_embed = row_num_embed
+        self.col_num_embed = col_num_embed
+
+    def forward(self, mask):
+        """Forward function for `LearnedPositionalEncoding`.
+
+        Args:
+            mask (Tensor): ByteTensor mask. Non-zero values representing
+                ignored positions, while zero values means valid positions
+                for this image. Shape [bs, h, w].
+
+        Returns:
+            pos (Tensor): Returned position embedding with shape
+                [bs, num_feats*2, h, w].
+        """
+        h, w = mask.shape[-2:]
+        x = torch.arange(w, device=mask.device)
+        y = torch.arange(h, device=mask.device)
+        x_embed = self.col_embed(x)
+        y_embed = self.row_embed(y)
+        pos = torch.cat(
+            (x_embed.unsqueeze(0).repeat(h, 1, 1), y_embed.unsqueeze(1).repeat(
+                1, w, 1)),
+            dim=-1).permute(2, 0,
+                            1).unsqueeze(0).repeat(mask.shape[0], 1, 1, 1)
+        return pos
+
+    def __repr__(self):
+        """str: a string that describes the module"""
+        repr_str = self.__class__.__name__
+        repr_str += f'(num_feats={self.num_feats}, '
+        repr_str += f'row_num_embed={self.row_num_embed}, '
+        repr_str += f'col_num_embed={self.col_num_embed})'
+        return repr_str
diff --git a/projects/pose_anything/models/utils/transformer.py b/projects/pose_anything/models/utils/transformer.py
new file mode 100644
index 0000000000..4657b81fde
--- /dev/null
+++ b/projects/pose_anything/models/utils/transformer.py
@@ -0,0 +1,340 @@
+import torch
+import torch.nn as nn
+from mmcv.cnn import build_activation_layer, build_norm_layer
+from mmcv.cnn.bricks.transformer import (BaseTransformerLayer,
+                                         TransformerLayerSequence,
+                                         build_transformer_layer_sequence)
+from mmengine.model import BaseModule, xavier_init
+from mmengine.registry import MODELS
+
+
+@MODELS.register_module()
+class Transformer(BaseModule):
+    """Implements the DETR transformer. Following the official DETR
+    implementation, this module copy-paste from torch.nn.Transformer with
+    modifications:
+
+        * positional encodings are passed in MultiheadAttention
+        * extra LN at the end of encoder is removed
+        * decoder returns a stack of activations from all decoding layers
+    See `paper: End-to-End Object Detection with Transformers
+    <https://arxiv.org/pdf/2005.12872>`_ for details.
+    Args:
+        encoder (`mmcv.ConfigDict` | Dict): Config of
+            TransformerEncoder. Defaults to None.
+        decoder ((`mmcv.ConfigDict` | Dict)): Config of
+            TransformerDecoder. Defaults to None
+        init_cfg (obj:`mmcv.ConfigDict`): The Config for initialization.
+            Defaults to None.
+    """
+
+    def __init__(self, encoder=None, decoder=None, init_cfg=None):
+        super(Transformer, self).__init__(init_cfg=init_cfg)
+        self.encoder = build_transformer_layer_sequence(encoder)
+        self.decoder = build_transformer_layer_sequence(decoder)
+        self.embed_dims = self.encoder.embed_dims
+
+    def init_weights(self):
+        # follow the official DETR to init parameters
+        for m in self.modules():
+            if hasattr(m, 'weight') and m.weight.dim() > 1:
+                xavier_init(m, distribution='uniform')
+        self._is_init = True
+
+    def forward(self, x, mask, query_embed, pos_embed, mask_query):
+        """Forward function for `Transformer`.
+        Args:
+            x (Tensor): Input query with shape [bs, c, h, w] where
+                c = embed_dims.
+            mask (Tensor): The key_padding_mask used for encoder and decoder,
+                with shape [bs, h, w].
+            query_embed (Tensor): The query embedding for decoder, with shape
+                [num_query, c].
+            pos_embed (Tensor): The positional encoding for encoder and
+                decoder, with the same shape as `x`.
+        Returns:
+            tuple[Tensor]: results of decoder containing the following tensor.
+                - out_dec: Output from decoder. If return_intermediate_dec \
+                      is True output has shape [num_dec_layers, bs,
+                      num_query, embed_dims], else has shape [1, bs, \
+                      num_query, embed_dims].
+                - memory: Output results from encoder, with shape \
+                      [bs, embed_dims, h, w].
+
+        Notes:
+            x: query image features with shape [bs, c, h, w]
+            mask: mask for x with shape [bs, h, w]
+            pos_embed: positional embedding for x with shape [bs, c, h, w]
+            query_embed: sample keypoint features with shape [bs, num_query, c]
+            mask_query: mask for query_embed with shape [bs, num_query]
+        Outputs:
+            out_dec: [num_layers, bs, num_query, c]
+            memory: [bs, c, h, w]
+
+        """
+        bs, c, h, w = x.shape
+        # use `view` instead of `flatten` for dynamically exporting to ONNX
+        x = x.view(bs, c, -1).permute(2, 0, 1)  # [bs, c, h, w] -> [h*w, bs, c]
+        mask = mask.view(
+            bs, -1
+        )  # [bs, h, w] -> [bs, h*w] Note: this mask should be filled with
+        # False, since all images are with the same shape.
+        pos_embed = pos_embed.view(bs, c, -1).permute(
+            2, 0, 1)  # positional embeding for memory, i.e., the query.
+        memory = self.encoder(
+            query=x,
+            key=None,
+            value=None,
+            query_pos=pos_embed,
+            query_key_padding_mask=mask)  # output memory: [hw, bs, c]
+
+        query_embed = query_embed.permute(
+            1, 0, 2)  # [bs, num_query, c] -> [num_query, bs, c]
+        # target = torch.zeros_like(query_embed)
+        # out_dec: [num_layers, num_query, bs, c]
+        out_dec = self.decoder(
+            query=query_embed,
+            key=memory,
+            value=memory,
+            key_pos=pos_embed,
+            # query_pos=query_embed,
+            query_key_padding_mask=mask_query,
+            key_padding_mask=mask)
+        out_dec = out_dec.transpose(1, 2)  # [decoder_layer, bs, num_query, c]
+        memory = memory.permute(1, 2, 0).reshape(bs, c, h, w)
+        return out_dec, memory
+
+
+@MODELS.register_module()
+class DetrTransformerDecoderLayer(BaseTransformerLayer):
+    """Implements decoder layer in DETR transformer.
+
+    Args:
+        attn_cfgs (list[`mmcv.ConfigDict`] | list[dict] | dict )):
+            Configs for self_attention or cross_attention, the order
+            should be consistent with it in `operation_order`. If it is
+            a dict, it would be expand to the number of attention in
+            `operation_order`.
+        feedforward_channels (int): The hidden dimension for FFNs.
+        ffn_dropout (float): Probability of an element to be zeroed
+            in ffn. Default 0.0.
+        operation_order (tuple[str]): The execution order of operation
+            in transformer. Such as ('self_attn', 'norm', 'ffn', 'norm').
+            Default：None
+        act_cfg (dict): The activation config for FFNs. Default: `LN`
+        norm_cfg (dict): Config dict for normalization layer.
+            Default: `LN`.
+        ffn_num_fcs (int): The number of fully-connected layers in FFNs.
+            Default：2.
+    """
+
+    def __init__(self,
+                 attn_cfgs,
+                 feedforward_channels,
+                 ffn_dropout=0.0,
+                 operation_order=None,
+                 act_cfg=dict(type='ReLU', inplace=True),
+                 norm_cfg=dict(type='LN'),
+                 ffn_num_fcs=2,
+                 **kwargs):
+        super(DetrTransformerDecoderLayer, self).__init__(
+            attn_cfgs=attn_cfgs,
+            feedforward_channels=feedforward_channels,
+            ffn_dropout=ffn_dropout,
+            operation_order=operation_order,
+            act_cfg=act_cfg,
+            norm_cfg=norm_cfg,
+            ffn_num_fcs=ffn_num_fcs,
+            **kwargs)
+        # assert len(operation_order) == 6
+        # assert set(operation_order) == set(
+        #     ['self_attn', 'norm', 'cross_attn', 'ffn'])
+
+
+@MODELS.register_module()
+class DetrTransformerEncoder(TransformerLayerSequence):
+    """TransformerEncoder of DETR.
+
+    Args:
+        post_norm_cfg (dict): Config of last normalization layer. Default：
+            `LN`. Only used when `self.pre_norm` is `True`
+    """
+
+    def __init__(self, *args, post_norm_cfg=dict(type='LN'), **kwargs):
+        super(DetrTransformerEncoder, self).__init__(*args, **kwargs)
+        if post_norm_cfg is not None:
+            self.post_norm = build_norm_layer(
+                post_norm_cfg, self.embed_dims)[1] if self.pre_norm else None
+        else:
+            # assert not self.pre_norm, f'Use prenorm in ' \
+            #                           f'{self.__class__.__name__},' \
+            #                           f'Please specify post_norm_cfg'
+            self.post_norm = None
+
+    def forward(self, *args, **kwargs):
+        """Forward function for `TransformerCoder`.
+
+        Returns:
+            Tensor: forwarded results with shape [num_query, bs, embed_dims].
+        """
+        x = super(DetrTransformerEncoder, self).forward(*args, **kwargs)
+        if self.post_norm is not None:
+            x = self.post_norm(x)
+        return x
+
+
+@MODELS.register_module()
+class DetrTransformerDecoder(TransformerLayerSequence):
+    """Implements the decoder in DETR transformer.
+
+    Args:
+        return_intermediate (bool): Whether to return intermediate outputs.
+        post_norm_cfg (dict): Config of last normalization layer. Default：
+            `LN`.
+    """
+
+    def __init__(self,
+                 *args,
+                 post_norm_cfg=dict(type='LN'),
+                 return_intermediate=False,
+                 **kwargs):
+
+        super(DetrTransformerDecoder, self).__init__(*args, **kwargs)
+        self.return_intermediate = return_intermediate
+        if post_norm_cfg is not None:
+            self.post_norm = build_norm_layer(post_norm_cfg,
+                                              self.embed_dims)[1]
+        else:
+            self.post_norm = None
+
+    def forward(self, query, *args, **kwargs):
+        """Forward function for `TransformerDecoder`.
+        Args:
+            query (Tensor): Input query with shape
+                `(num_query, bs, embed_dims)`.
+        Returns:
+            Tensor: Results with shape [1, num_query, bs, embed_dims] when
+                return_intermediate is `False`, otherwise it has shape
+                [num_layers, num_query, bs, embed_dims].
+        """
+        if not self.return_intermediate:
+            x = super().forward(query, *args, **kwargs)
+            if self.post_norm:
+                x = self.post_norm(x)[None]
+            return x
+
+        intermediate = []
+        for layer in self.layers:
+            query = layer(query, *args, **kwargs)
+            if self.return_intermediate:
+                if self.post_norm is not None:
+                    intermediate.append(self.post_norm(query))
+                else:
+                    intermediate.append(query)
+        return torch.stack(intermediate)
+
+
+@MODELS.register_module()
+class DynamicConv(BaseModule):
+    """Implements Dynamic Convolution.
+
+    This module generate parameters for each sample and
+    use bmm to implement 1*1 convolution. Code is modified
+    from the `official github repo <https://github.com/PeizeSun/
+    SparseR-CNN/blob/main/projects/SparseRCNN/sparsercnn/head.py#L258>`_ .
+    Args:
+        in_channels (int): The input feature channel.
+            Defaults to 256.
+        feat_channels (int): The inner feature channel.
+            Defaults to 64.
+        out_channels (int, optional): The output feature channel.
+            When not specified, it will be set to `in_channels`
+            by default
+        input_feat_shape (int): The shape of input feature.
+            Defaults to 7.
+        with_proj (bool): Project two-dimentional feature to
+            one-dimentional feature. Default to True.
+        act_cfg (dict): The activation config for DynamicConv.
+        norm_cfg (dict): Config dict for normalization layer. Default
+            layer normalization.
+        init_cfg (obj:`mmcv.ConfigDict`): The Config for initialization.
+            Default: None.
+    """
+
+    def __init__(self,
+                 in_channels=256,
+                 feat_channels=64,
+                 out_channels=None,
+                 input_feat_shape=7,
+                 with_proj=True,
+                 act_cfg=dict(type='ReLU', inplace=True),
+                 norm_cfg=dict(type='LN'),
+                 init_cfg=None):
+        super(DynamicConv, self).__init__(init_cfg)
+        self.in_channels = in_channels
+        self.feat_channels = feat_channels
+        self.out_channels_raw = out_channels
+        self.input_feat_shape = input_feat_shape
+        self.with_proj = with_proj
+        self.act_cfg = act_cfg
+        self.norm_cfg = norm_cfg
+        self.out_channels = out_channels if out_channels else in_channels
+
+        self.num_params_in = self.in_channels * self.feat_channels
+        self.num_params_out = self.out_channels * self.feat_channels
+        self.dynamic_layer = nn.Linear(
+            self.in_channels, self.num_params_in + self.num_params_out)
+
+        self.norm_in = build_norm_layer(norm_cfg, self.feat_channels)[1]
+        self.norm_out = build_norm_layer(norm_cfg, self.out_channels)[1]
+
+        self.activation = build_activation_layer(act_cfg)
+
+        num_output = self.out_channels * input_feat_shape**2
+        if self.with_proj:
+            self.fc_layer = nn.Linear(num_output, self.out_channels)
+            self.fc_norm = build_norm_layer(norm_cfg, self.out_channels)[1]
+
+    def forward(self, param_feature, input_feature):
+        """Forward function for `DynamicConv`.
+
+        Args:
+            param_feature (Tensor): The feature can be used
+                to generate the parameter, has shape
+                (num_all_proposals, in_channels).
+            input_feature (Tensor): Feature that
+                interact with parameters, has shape
+                (num_all_proposals, in_channels, H, W).
+        Returns:
+            Tensor: The output feature has shape
+            (num_all_proposals, out_channels).
+        """
+        input_feature = input_feature.flatten(2).permute(2, 0, 1)
+
+        input_feature = input_feature.permute(1, 0, 2)
+        parameters = self.dynamic_layer(param_feature)
+
+        param_in = parameters[:, :self.num_params_in].view(
+            -1, self.in_channels, self.feat_channels)
+        param_out = parameters[:, -self.num_params_out:].view(
+            -1, self.feat_channels, self.out_channels)
+
+        # input_feature has shape (num_all_proposals, H*W, in_channels)
+        # param_in has shape (num_all_proposals, in_channels, feat_channels)
+        # feature has shape (num_all_proposals, H*W, feat_channels)
+        features = torch.bmm(input_feature, param_in)
+        features = self.norm_in(features)
+        features = self.activation(features)
+
+        # param_out has shape (batch_size, feat_channels, out_channels)
+        features = torch.bmm(features, param_out)
+        features = self.norm_out(features)
+        features = self.activation(features)
+
+        if self.with_proj:
+            features = features.flatten(1)
+            features = self.fc_layer(features)
+            features = self.fc_norm(features)
+            features = self.activation(features)
+
+        return features
diff --git a/projects/pose_anything/tools/visualization.py b/projects/pose_anything/tools/visualization.py
new file mode 100644
index 0000000000..a9a4c6f37d
--- /dev/null
+++ b/projects/pose_anything/tools/visualization.py
@@ -0,0 +1,89 @@
+import os
+import random
+
+import matplotlib.pyplot as plt
+import numpy as np
+
+COLORS = ([255, 0, 0], [255, 85, 0], [255, 170, 0], [255, 255,
+                                                     0], [170, 255,
+                                                          0], [85, 255,
+                                                               0], [0, 255, 0],
+          [0, 255, 85], [0, 255, 170], [0, 255,
+                                        255], [0, 170,
+                                               255], [0, 85, 255], [0, 0, 255],
+          [85, 0, 255], [170, 0, 255], [255, 0, 255], [255, 0,
+                                                       170], [255, 0,
+                                                              85], [255, 0, 0])
+
+
+def plot_results(support_img,
+                 query_img,
+                 support_kp,
+                 support_w,
+                 query_kp,
+                 query_w,
+                 skeleton,
+                 initial_proposals,
+                 prediction,
+                 radius=6,
+                 out_dir='./heatmaps'):
+    img_names = [
+        img.split('_')[0] for img in os.listdir(out_dir)
+        if str_is_int(img.split('_')[0])
+    ]
+    if len(img_names) > 0:
+        name_idx = max([int(img_name) for img_name in img_names]) + 1
+    else:
+        name_idx = 0
+
+    h, w, c = support_img.shape
+    prediction = prediction[-1].cpu().numpy() * h
+    support_img = (support_img - np.min(support_img)) / (
+        np.max(support_img) - np.min(support_img))
+    query_img = (query_img - np.min(query_img)) / (
+        np.max(query_img) - np.min(query_img))
+
+    for index, (img, w, keypoint) in enumerate(
+            zip([support_img, query_img], [support_w, query_w],
+                [support_kp, prediction])):
+        f, axes = plt.subplots()
+        plt.imshow(img)
+        for k in range(keypoint.shape[0]):
+            if w[k] > 0:
+                kp = keypoint[k, :2]
+                c = (1, 0, 0, 0.75) if w[k] == 1 else (0, 0, 1, 0.6)
+                patch = plt.Circle(kp, radius, color=c)
+                axes.add_patch(patch)
+                axes.text(kp[0], kp[1], k)
+                plt.draw()
+        for limb_index, limb in enumerate(skeleton):
+            kp = keypoint[:, :2]
+            if limb_index > len(COLORS) - 1:
+                c = [x / 255 for x in random.sample(range(0, 255), 3)]
+            else:
+                c = [x / 255 for x in COLORS[limb_index]]
+            if w[limb[0]] > 0 and w[limb[1]] > 0:
+                patch = plt.Line2D([kp[limb[0], 0], kp[limb[1], 0]],
+                                   [kp[limb[0], 1], kp[limb[1], 1]],
+                                   linewidth=6,
+                                   color=c,
+                                   alpha=0.6)
+                axes.add_artist(patch)
+        plt.axis('off')  # command for hiding the axis.
+        name = 'support' if index == 0 else 'query'
+        plt.savefig(
+            f'./{out_dir}/{str(name_idx)}_{str(name)}.png',
+            bbox_inches='tight',
+            pad_inches=0)
+        if index == 1:
+            plt.show()
+        plt.clf()
+        plt.close('all')
+
+
+def str_is_int(s):
+    try:
+        int(s)
+        return True
+    except ValueError:
+        return False
diff --git a/projects/rtmo/README.md b/projects/rtmo/README.md
new file mode 100644
index 0000000000..18d9cd0ba3
--- /dev/null
+++ b/projects/rtmo/README.md
@@ -0,0 +1,140 @@
+# RTMO: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation
+
+<img src="https://github.com/open-mmlab/mmpose/assets/26127467/cc2b3657-ba2b-478a-b61d-2c0fb310515c" style="width:350px" /> <img src="https://github.com/open-mmlab/mmpose/assets/26127467/94db75b7-3215-45b0-9f7e-203f31fcb263" style="width:350px" />
+
+RTMO is a one-stage pose estimation model that achieves performance comparable to RTMPose. It has the following key advantages:
+
+- **Faster inference speed when multiple people are present** - RTMO runs faster than RTMPose on images with more than 4 persons. This makes it well-suited for real-time multi-person pose estimation.
+- **No dependency on human detectors** - Since RTMO is a one-stage model, it does not rely on an auxiliary human detector. This simplifies the pipeline and deployment.
+
+👉🏼 TRY RTMO NOW
+
+```bash
+python demo/inferencer_demo.py $IMAGE --pose2d rtmo --vis-out-dir vis_results
+```
+
+## 📜 Introduction
+
+Real-time multi-person pose estimation presents significant challenges in balancing speed and precision. While two-stage top-down methods slow down as the number of people in the image increases, existing one-stage methods often fail to simultaneously deliver high accuracy and real-time performance. This paper introduces RTMO, a one-stage pose estimation framework that seamlessly integrates coordinate classification by representing keypoints using dual 1-D heatmaps within the YOLO architecture, achieving accuracy comparable to top-down methods while maintaining high speed. We propose a dynamic coordinate classifier and a tailored loss function for heatmap learning, specifically designed to address the incompatibilities between coordinate classification and dense prediction models. RTMO outperforms state-of-the-art one-stage pose estimators, achieving 1.1% higher AP on COCO while operating about 9 times faster with the same backbone. Our largest model, RTMO-l, attains 74.8% AP on COCO val2017 and 141 FPS on a single V100 GPU, demonstrating its efficiency and accuracy.
+
+<img src="https://github.com/open-mmlab/mmpose/assets/26127467/ad94c097-7d51-4b91-b885-d8605e22a0e6" height="360px" alt><br>
+
+Refer to [our paper](https://arxiv.org/abs/2312.07526) for more details.
+
+## 🎉 News
+
+- **`2023/12/13`**: The RTMO paper and models are released!
+
+## 🗂️ Model Zoo
+
+### Results on COCO val2017 dataset
+
+| Model                                                        | Train Set | Latency (ms) |  AP   | AP<sup>50</sup> | AP<sup>75</sup> |  AR   | AR<sup>50</sup> |                             Download                             |
+| :----------------------------------------------------------- | :-------: | :----------: | :---: | :-------------: | :-------------: | :---: | :-------------: | :--------------------------------------------------------------: |
+| [RTMO-s](/configs/body_2d_keypoint/rtmo/coco/rtmo-s_8xb32-600e_coco-640x640.py) |   COCO    |     8.9      | 0.677 |      0.878      |      0.737      | 0.715 |      0.908      | [ckpt](https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-s_8xb32-600e_coco-640x640-8db55a59_20231211.pth) |
+| [RTMO-m](/configs/body_2d_keypoint/rtmo/coco/rtmo-m_16xb16-600e_coco-640x640.py) |   COCO    |     12.4     | 0.709 |      0.890      |      0.778      | 0.747 |      0.920      | [ckpt](https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-m_16xb16-600e_coco-640x640-6f4e0306_20231211.pth) |
+| [RTMO-l](/configs/body_2d_keypoint/rtmo/coco/rtmo-l_16xb16-600e_coco-640x640.py) |   COCO    |     19.1     | 0.724 |      0.899      |      0.788      | 0.762 |      0.927      | [ckpt](https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-l_16xb16-600e_coco-640x640-516a421f_20231211.pth) |
+| [RTMO-t](/configs/body_2d_keypoint/rtmo/body7/rtmo-t_8xb32-600e_body7-416x416.py) |   body7   |      -       | 0.574 |      0.803      |      0.613      | 0.611 |      0.836      | [ckpt](https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-t_8xb32-600e_body7-416x416-f48f75cb_20231219.pth) \| [onnx](https://download.openmmlab.com/mmpose/v1/projects/rtmo/onnx_sdk/rtmo-t_8xb32-600e_body7-416x416-f48f75cb_20231219.zip) |
+| [RTMO-s](/configs/body_2d_keypoint/rtmo/body7/rtmo-s_8xb32-600e_body7-640x640.py) |   body7   |     8.9      | 0.686 |      0.879      |      0.744      | 0.723 |      0.908      | [ckpt](https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-s_8xb32-600e_body7-640x640-dac2bf74_20231211.pth) \| [onnx](https://download.openmmlab.com/mmpose/v1/projects/rtmo/onnx_sdk/rtmo-s_8xb32-600e_body7-640x640-dac2bf74_20231211.zip) |
+| [RTMO-m](/configs/body_2d_keypoint/rtmo/body7/rtmo-m_16xb16-600e_body7-640x640.py) |   body7   |     12.4     | 0.726 |      0.899      |      0.790      | 0.763 |      0.926      | [ckpt](https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-m_16xb16-600e_body7-640x640-39e78cc4_20231211.pth)  \| [onnx](https://download.openmmlab.com/mmpose/v1/projects/rtmo/onnx_sdk/rtmo-m_16xb16-600e_body7-640x640-39e78cc4_20231211.zip) |
+| [RTMO-l](/configs/body_2d_keypoint/rtmo/body7/rtmo-l_16xb16-600e_body7-640x640.py) |   body7   |     19.1     | 0.748 |      0.911      |      0.813      | 0.786 |      0.939      | [ckpt](https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-l_16xb16-600e_body7-640x640-b37118ce_20231211.pth) \| [onnx](https://download.openmmlab.com/mmpose/v1/projects/rtmo/onnx_sdk/rtmo-l_16xb16-600e_body7-640x640-b37118ce_20231211.zip) |
+
+- The latency is evaluated on a single V100 GPU with ONNXRuntime backend.
+- "body7" refers to a combined dataset composed of [AI Challenger](https://github.com/AIChallenger/AI_Challenger_2017), [COCO](http://cocodataset.org/), [CrowdPose](https://github.com/Jeff-sjtu/CrowdPose), [Halpe](https://github.com/Fang-Haoshu/Halpe-FullBody/), [MPII](http://human-pose.mpi-inf.mpg.de/), [PoseTrack18](https://posetrack.net/users/download.php) and [sub-JHMDB](http://jhmdb.is.tue.mpg.de/dataset).
+
+### Results on CrowdPose test dataset
+
+| Model                                                               | Train Set |  AP   | AP<sup>50</sup> | AP<sup>75</sup> | AP (E) | AP (M) | AP (H) |                                Download                                 |
+| :------------------------------------------------------------------ | :-------: | :---: | :-------------: | :-------------: | :----: | :----: | :----: | :---------------------------------------------------------------------: |
+| [RTMO-s](/configs/body_2d_keypoint/rtmo/crowdpose/rtmo-s_8xb32-700e_crowdpose-640x640.py) | CrowdPose | 0.673 |      0.882      |      0.729      | 0.737  | 0.682  | 0.591  | [ckpt](https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-s_8xb32-700e_crowdpose-640x640-79f81c0d_20231211.pth) |
+| [RTMO-m](/configs/body_2d_keypoint/rtmo/crowdpose/rtmo-m_16xb16-700e_crowdpose-640x640.py) | CrowdPose | 0.711 |      0.897      |      0.771      | 0.774  | 0.719  | 0.634  | [ckpt](https://download.openmmlab.com/mmpose/v1/projects/rtmo/rrtmo-m_16xb16-700e_crowdpose-640x640-0eaf670d_20231211.pth) |
+| [RTMO-l](/configs/body_2d_keypoint/rtmo/crowdpose/rtmo-l_16xb16-700e_crowdpose-640x640.py) | CrowdPose | 0.732 |      0.907      |      0.793      | 0.792  | 0.741  | 0.653  | [ckpt](https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-l_16xb16-700e_crowdpose-640x640-1008211f_20231211.pth) |
+| [RTMO-l](/configs/body_2d_keypoint/rtmo/crowdpose/rtmo-l_16xb16-700e_body7-crowdpose-640x640.py) |   body7   | 0.838 |      0.947      |      0.893      | 0.888  | 0.847  | 0.772  | [ckpt](https://download.openmmlab.com/mmpose/v1/projects/rtmo/rtmo-l_16xb16-700e_body7-crowdpose-640x640-5bafdc11_20231219.pth) |
+
+## 🖥️ Train and Evaluation
+
+### Dataset Preparation
+
+Please follow [this instruction](https://mmpose.readthedocs.io/en/latest/dataset_zoo/2d_body_keypoint.html) to prepare the training and testing datasets.
+
+### Train
+
+Under the root directory of mmpose, run the following command to train models:
+
+```sh
+sh tools/dist_train.sh $CONFIG $NUM_GPUS --work-dir $WORK_DIR --amp
+```
+
+- Automatic Mixed Precision (AMP) technique is used to reduce GPU memory consumption during training.
+
+### Evaluation
+
+Under the root directory of mmpose, run the following command to evaluate models:
+
+```sh
+sh tools/dist_test.sh $CONFIG $PATH_TO_CHECKPOINT $NUM_GPUS
+```
+
+See [here](https://mmpose.readthedocs.io/en/latest/user_guides/train_and_test.html) for more training and evaluation details.
+
+## 🛞 Deployment
+
+[MMDeploy](https://github.com/open-mmlab/mmdeploy) provides tools for easy deployment of RTMO models. [\[Install Now\]](https://mmdeploy.readthedocs.io/en/latest/get_started.html#installation)
+
+**⭕ Notice**:
+
+- PyTorch **1.12+** is required to export the ONNX model of RTMO!
+
+- MMDeploy v1.3.1+ is required to deploy RTMO.
+
+### ONNX Model Export
+
+Under mmdeploy root, run:
+
+```sh
+python tools/deploy.py \
+    configs/mmpose/pose-detection_rtmo_onnxruntime_dynamic-640x640.py \
+    $RTMO_CONFIG $RTMO_CHECKPOINT \
+    $MMPOSE_ROOT/tests/data/coco/000000197388.jpg \
+    --work-dir $WORK_DIR --dump-info \
+    [--show] [--device $DEVICE]
+```
+
+### TensorRT Model Export
+
+[Install TensorRT](https://mmdeploy.readthedocs.io/en/latest/05-supported-backends/tensorrt.html#install-tensorrt) and [build custom ops](https://mmdeploy.readthedocs.io/en/latest/05-supported-backends/tensorrt.html#build-custom-ops) first.
+
+Then under mmdeploy root, run:
+
+```sh
+python tools/deploy.py \
+    configs/mmpose/pose-detection_rtmo_tensorrt-fp16_dynamic-640x640.py \
+    $RTMO_CONFIG $RTMO_CHECKPOINT \
+    $MMPOSE_ROOT/tests/data/coco/000000197388.jpg \
+    --work-dir $WORK_DIR --dump-info \
+    --device cuda:0 [--show]
+```
+
+This conversion takes several minutes. GPU is required for TensorRT model exportation.
+
+## ⭐ Citation
+
+If this project benefits your work, please kindly consider citing the original paper and MMPose:
+
+```bibtex
+@misc{lu2023rtmo,
+      title={{RTMO}: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation},
+      author={Peng Lu and Tao Jiang and Yining Li and Xiangtai Li and Kai Chen and Wenming Yang},
+      year={2023},
+      eprint={2312.07526},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+
+@misc{mmpose2020,
+    title={OpenMMLab Pose Estimation Toolbox and Benchmark},
+    author={MMPose Contributors},
+    howpublished = {\url{https://github.com/open-mmlab/mmpose}},
+    year={2020}
+}
+```
diff --git a/projects/rtmpose/README.md b/projects/rtmpose/README.md
index a304a69b0d..94bbc2ca2d 100644
--- a/projects/rtmpose/README.md
+++ b/projects/rtmpose/README.md
@@ -44,6 +44,8 @@ ______________________________________________________________________
 
 ## 🥳 🚀 What's New [🔝](#-table-of-contents)
 
+- Dec. 2023:
+  - Update RTMW models. The RTMW-l model achieves 70.1 mAP on COCO-Wholebody val set.
 - Sep. 2023:
   - Add RTMW models trained on combined datasets. The alpha version of RTMW-x model achieves 70.2 mAP on COCO-Wholebody val set. You can try it [Here](https://openxlab.org.cn/apps/detail/mmpose/RTMPose). The technical report will be released soon.
   - Add YOLOX and RTMDet models trained on HumanArt dataset.
@@ -292,9 +294,9 @@ For more details, please refer to [GroupFisher Pruning for RTMPose](./rtmpose/pr
 </details>
 
 <details open>
-<summary><b>Cocktail13</b></summary>
+<summary><b>Cocktail14</b></summary>
 
-- `Cocktail13` denotes model trained on 13 public datasets:
+- `Cocktail14` denotes model trained on 14 public datasets:
   - [AI Challenger](https://mmpose.readthedocs.io/en/latest/dataset_zoo/2d_body_keypoint.html#aic)
   - [CrowdPose](https://mmpose.readthedocs.io/en/latest/dataset_zoo/2d_body_keypoint.html#crowdpose)
   - [MPII](https://mmpose.readthedocs.io/en/latest/dataset_zoo/2d_body_keypoint.html#mpii)
@@ -308,11 +310,15 @@ For more details, please refer to [GroupFisher Pruning for RTMPose](./rtmpose/pr
   - [300W](https://ibug.doc.ic.ac.uk/resources/300-W/)
   - [COFW](http://www.vision.caltech.edu/xpburgos/ICCV13/)
   - [LaPa](https://github.com/JDAI-CV/lapa-dataset)
+  - [InterHand](https://mks0601.github.io/InterHand2.6M/)
 
 | Config                          | Input Size | Whole AP | Whole AR | FLOPS<sup><br>(G) | ORT-Latency<sup><br>(ms)<sup><br>(i7-11700) | TRT-FP16-Latency<sup><br>(ms)<sup><br>(GTX 1660Ti) |             Download              |
 | :------------------------------ | :--------: | :------: | :------: | :---------------: | :-----------------------------------------: | :------------------------------------------------: | :-------------------------------: |
-| [RTMW-x<sup><br>(alpha version)](./rtmpose/wholebody_2d_keypoint/rtmw-x_8xb704-270e_cocktail13-256x192.py) |  256x192   |   67.2   |   75.4   |       13.1        |                      -                      |                         -                          | [pth](https://download.openmmlab.com/mmpose/v1/projects/rtmw/rtmw-x_simcc-cocktail13_pt-ucoco_270e-256x192-fbef0d61_20230925.pth) |
-| [RTMW-x<sup><br>(alpha version)](./rtmpose/wholebody_2d_keypoint/rtmw-x_8xb320-270e_cocktail13-384x288.py) |  384x288   |   70.2   |   77.9   |       29.3        |                      -                      |                         -                          | [pth](https://download.openmmlab.com/mmpose/v1/projects/rtmw/rtmw-x_simcc-cocktail13_pt-ucoco_270e-384x288-0949e3a9_20230925.pth) |
+| [RTMW-m](./rtmpose/wholebody_2d_keypoint/rtmw-m_8xb1024-270e_cocktail14-256x192.py) |  256x192   |   58.2   |   67.3   |        4.3        |                      -                      |                         -                          | [pth](https://download.openmmlab.com/mmpose/v1/projects/rtmw/rtmw-dw-l-m_simcc-cocktail14_270e-256x192-20231122.pth)<br>[onnx](https://download.openmmlab.com/mmpose/v1/projects/rtmw/onnx_sdk/rtmw-dw-m-s_simcc-cocktail14_270e-256x192_20231122.zip) |
+| [RTMW-l](./rtmpose/wholebody_2d_keypoint/rtmw-x_8xb704-270e_cocktail14-256x192.py) |  256x192   |   66.0   |   74.6   |        7.9        |                      -                      |                         -                          | [pth](https://download.openmmlab.com/mmpose/v1/projects/rtmw/rtmw-dw-x-l_simcc-cocktail14_270e-256x192-20231122.pth)<br>[onnx](https://download.openmmlab.com/mmpose/v1/projects/rtmw/onnx_sdk/rtmw-dw-x-l_simcc-cocktail14_270e-256x192_20231122.zip) |
+| [RTMW-x](./rtmpose/wholebody_2d_keypoint/rtmw-x_8xb704-270e_cocktail14-256x192.py) |  256x192   |   67.2   |   75.2   |       13.1        |                      -                      |                         -                          | [pth](https://download.openmmlab.com/mmpose/v1/projects/rtmw/rtmw-x_simcc-cocktail14_pt-ucoco_270e-256x192-13a2546d_20231208.pth) |
+| [RTMW-l](./rtmpose/wholebody_2d_keypoint/rtmw-x_8xb320-270e_cocktail14-384x288.py) |  384x288   |   70.1   |   78.0   |       17.7        |                      -                      |                         -                          | [pth](https://download.openmmlab.com/mmpose/v1/projects/rtmw/rtmw-dw-x-l_simcc-cocktail14_270e-384x288-20231122.pth)<br>[onnx](https://download.openmmlab.com/mmpose/v1/projects/rtmw/onnx_sdk/rtmw-dw-x-l_simcc-cocktail14_270e-384x288_20231122.zip) |
+| [RTMW-x](./rtmpose/wholebody_2d_keypoint/rtmw-x_8xb320-270e_cocktail14-384x288.py) |  384x288   |   70.2   |   78.1   |       29.3        |                      -                      |                         -                          | [pth](https://download.openmmlab.com/mmpose/v1/projects/rtmw/rtmw-x_simcc-cocktail14_pt-ucoco_270e-384x288-f840f204_20231122.pth) |
 
 </details>
 
@@ -1017,51 +1023,51 @@ if __name__ == '__main__':
 Here is a basic example of SDK C++ API:
 
 ```C++
-#include "mmdeploy/detector.hpp"
-
-#include "opencv2/imgcodecs/imgcodecs.hpp"
-#include "utils/argparse.h"
-#include "utils/visualize.h"
-
-DEFINE_ARG_string(model, "Model path");
-DEFINE_ARG_string(image, "Input image path");
-DEFINE_string(device, "cpu", R"(Device name, e.g. "cpu", "cuda")");
-DEFINE_string(output, "detector_output.jpg", "Output image path");
-
-DEFINE_double(det_thr, .5, "Detection score threshold");
+#include "mmdeploy/pose_detector.hpp"
+#include "opencv2/imgcodecs.hpp"
+#include "opencv2/imgproc.hpp"
+#include "utils/argparse.h" // See: https://github.com/open-mmlab/mmdeploy/blob/main/demo/csrc/cpp/utils/argparse.h
+
+DEFINE_ARG_string(model_path, "Model path");
+DEFINE_ARG_string(image_path, "Input image path");
+DEFINE_string(device_name, "cpu", R"(Device name, e.g. "cpu", "cuda")");
+DEFINE_int32(bbox_x, -1, R"(x position of the bounding box)");
+DEFINE_int32(bbox_y, -1, R"(y position of the bounding box)");
+DEFINE_int32(bbox_w, -1, R"(width of the bounding box)");
+DEFINE_int32(bbox_h, -1, R"(height of the bounding box)");
 
 int main(int argc, char* argv[]) {
   if (!utils::ParseArguments(argc, argv)) {
     return -1;
   }
 
-  cv::Mat img = cv::imread(ARGS_image);
-  if (img.empty()) {
-    fprintf(stderr, "failed to load image: %s\n", ARGS_image.c_str());
-    return -1;
-  }
+  cv::Mat img = cv::imread(ARGS_image_path);
+
+  mmdeploy::PoseDetector detector(mmdeploy::Model{ARGS_model_path}, mmdeploy::Device{FLAGS_device_name, 0});
 
-  // construct a detector instance
-  mmdeploy::Detector detector(mmdeploy::Model{ARGS_model}, mmdeploy::Device{FLAGS_device});
-
-  // apply the detector, the result is an array-like class holding references to
-  // `mmdeploy_detection_t`, will be released automatically on destruction
-  mmdeploy::Detector::Result dets = detector.Apply(img);
-
-  // visualize
-  utils::Visualize v;
-  auto sess = v.get_session(img);
-  int count = 0;
-  for (const mmdeploy_detection_t& det : dets) {
-    if (det.score > FLAGS_det_thr) {  // filter bboxes
-      sess.add_det(det.bbox, det.label_id, det.score, det.mask, count++);
-    }
+  mmdeploy::PoseDetector::Result result{0, 0, nullptr};
+
+  if (FLAGS_bbox_x == -1 || FLAGS_bbox_y == -1 || FLAGS_bbox_w == -1 || FLAGS_bbox_h == -1) {
+    result = detector.Apply(img);
+  } else {
+    // convert (x, y, w, h) -> (left, top, right, bottom)
+    mmdeploy::cxx::Rect rect;
+    rect.left = FLAGS_bbox_x;
+    rect.top = FLAGS_bbox_y;
+    rect.right = FLAGS_bbox_x + FLAGS_bbox_w;
+    rect.bottom = FLAGS_bbox_y + FLAGS_bbox_h;
+    result = detector.Apply(img, {rect});
   }
 
-  if (!FLAGS_output.empty()) {
-    cv::imwrite(FLAGS_output, sess.get());
+  // Draw circles at detected keypoints
+  for (size_t i = 0; i < result[0].length; ++i) {
+    cv::Point keypoint(result[0].point[i].x, result[0].point[i].y);
+    cv::circle(img, keypoint, 1, cv::Scalar(0, 255, 0), 2);  // Draw filled circle
   }
 
+  // Save the output image
+  cv::imwrite("output_pose.png", img);
+
   return 0;
 }
 ```
diff --git a/projects/rtmpose/README_CN.md b/projects/rtmpose/README_CN.md
index 859d0e9364..6b33a20e9f 100644
--- a/projects/rtmpose/README_CN.md
+++ b/projects/rtmpose/README_CN.md
@@ -40,6 +40,8 @@ ______________________________________________________________________
 
 ## 🥳 最新进展 [🔝](#-table-of-contents)
 
+- 2023 年 12 月：
+  - 更新 RTMW 模型，RTMW-l 在 COCO-Wholebody 验证集上去的 70.1 mAP。
 - 2023 年 9 月：
   - 发布混合数据集上训练的 RTMW 模型。Alpha 版本的 RTMW-x 在 COCO-Wholebody 验证集上取得了 70.2 mAP。[在线 Demo](https://openxlab.org.cn/apps/detail/mmpose/RTMPose) 已支持 RTMW。技术报告正在撰写中。
   - 增加 HumanArt 上训练的 YOLOX 和 RTMDet 模型。
@@ -283,7 +285,7 @@ RTMPose 是一个长期优化迭代的项目，致力于业务场景下的高性
 </details>
 
 <details open>
-<summary><b>Cocktail13</b></summary>
+<summary><b>Cocktail14</b></summary>
 
 - `Cocktail13` 代表模型在 13 个开源数据集上训练得到:
   - [AI Challenger](https://mmpose.readthedocs.io/en/latest/dataset_zoo/2d_body_keypoint.html#aic)
@@ -299,11 +301,15 @@ RTMPose 是一个长期优化迭代的项目，致力于业务场景下的高性
   - [300W](https://ibug.doc.ic.ac.uk/resources/300-W/)
   - [COFW](http://www.vision.caltech.edu/xpburgos/ICCV13/)
   - [LaPa](https://github.com/JDAI-CV/lapa-dataset)
+  - [InterHand](https://mks0601.github.io/InterHand2.6M/)
 
 | Config                          | Input Size | Whole AP | Whole AR | FLOPS<sup><br>(G) | ORT-Latency<sup><br>(ms)<sup><br>(i7-11700) | TRT-FP16-Latency<sup><br>(ms)<sup><br>(GTX 1660Ti) |             Download              |
 | :------------------------------ | :--------: | :------: | :------: | :---------------: | :-----------------------------------------: | :------------------------------------------------: | :-------------------------------: |
-| [RTMW-x<sup><br>(alpha version)](./rtmpose/wholebody_2d_keypoint/rtmw-x_8xb704-270e_cocktail13-256x192.py) |  256x192   |   67.2   |   75.4   |       13.1        |                      -                      |                         -                          | [pth](https://download.openmmlab.com/mmpose/v1/projects/rtmw/rtmw-x_simcc-cocktail13_pt-ucoco_270e-256x192-fbef0d61_20230925.pth) |
-| [RTMW-x<sup><br>(alpha version)](./rtmpose/wholebody_2d_keypoint/rtmw-x_8xb320-270e_cocktail13-384x288.py) |  384x288   |   70.2   |   77.9   |       29.3        |                      -                      |                         -                          | [pth](https://download.openmmlab.com/mmpose/v1/projects/rtmw/rtmw-x_simcc-cocktail13_pt-ucoco_270e-384x288-0949e3a9_20230925.pth) |
+| [RTMW-m](./rtmpose/wholebody_2d_keypoint/rtmw-m_8xb1024-270e_cocktail14-256x192.py) |  256x192   |   58.2   |   67.3   |        4.3        |                      -                      |                         -                          | [pth](https://download.openmmlab.com/mmpose/v1/projects/rtmw/rtmw-dw-l-m_simcc-cocktail14_270e-256x192-20231122.pth)<br>[onnx](https://download.openmmlab.com/mmpose/v1/projects/rtmw/onnx_sdk/rtmw-dw-m-s_simcc-cocktail14_270e-256x192_20231122.zip) |
+| [RTMW-l](./rtmpose/wholebody_2d_keypoint/rtmw-x_8xb704-270e_cocktail14-256x192.py) |  256x192   |   66.0   |   74.6   |        7.9        |                      -                      |                         -                          | [pth](https://download.openmmlab.com/mmpose/v1/projects/rtmw/rtmw-dw-x-l_simcc-cocktail14_270e-256x192-20231122.pth)<br>[onnx](https://download.openmmlab.com/mmpose/v1/projects/rtmw/onnx_sdk/rtmw-dw-x-l_simcc-cocktail14_270e-256x192_20231122.zip) |
+| [RTMW-x](./rtmpose/wholebody_2d_keypoint/rtmw-x_8xb704-270e_cocktail14-256x192.py) |  256x192   |   67.2   |   75.2   |       13.1        |                      -                      |                         -                          | [pth](https://download.openmmlab.com/mmpose/v1/projects/rtmw/rtmw-x_simcc-cocktail14_pt-ucoco_270e-256x192-13a2546d_20231208.pth) |
+| [RTMW-l](./rtmpose/wholebody_2d_keypoint/rtmw-x_8xb320-270e_cocktail14-384x288.py) |  384x288   |   70.1   |   78.0   |       17.7        |                      -                      |                         -                          | [pth](https://download.openmmlab.com/mmpose/v1/projects/rtmw/rtmw-dw-x-l_simcc-cocktail14_270e-384x288-20231122.pth)<br>[onnx](https://download.openmmlab.com/mmpose/v1/projects/rtmw/onnx_sdk/rtmw-dw-x-l_simcc-cocktail14_270e-384x288_20231122.zip) |
+| [RTMW-x](./rtmpose/wholebody_2d_keypoint/rtmw-x_8xb320-270e_cocktail14-384x288.py) |  384x288   |   70.2   |   78.1   |       29.3        |                      -                      |                         -                          | [pth](https://download.openmmlab.com/mmpose/v1/projects/rtmw/rtmw-x_simcc-cocktail14_pt-ucoco_270e-384x288-f840f204_20231122.pth) |
 
 </details>
 
@@ -1010,51 +1016,51 @@ if __name__ == '__main__':
 #### C++ API
 
 ```C++
-#include "mmdeploy/detector.hpp"
-
-#include "opencv2/imgcodecs/imgcodecs.hpp"
-#include "utils/argparse.h"
-#include "utils/visualize.h"
-
-DEFINE_ARG_string(model, "Model path");
-DEFINE_ARG_string(image, "Input image path");
-DEFINE_string(device, "cpu", R"(Device name, e.g. "cpu", "cuda")");
-DEFINE_string(output, "detector_output.jpg", "Output image path");
-
-DEFINE_double(det_thr, .5, "Detection score threshold");
+#include "mmdeploy/pose_detector.hpp"
+#include "opencv2/imgcodecs.hpp"
+#include "opencv2/imgproc.hpp"
+#include "utils/argparse.h" // See: https://github.com/open-mmlab/mmdeploy/blob/main/demo/csrc/cpp/utils/argparse.h
+
+DEFINE_ARG_string(model_path, "Model path");
+DEFINE_ARG_string(image_path, "Input image path");
+DEFINE_string(device_name, "cpu", R"(Device name, e.g. "cpu", "cuda")");
+DEFINE_int32(bbox_x, -1, R"(x position of the bounding box)");
+DEFINE_int32(bbox_y, -1, R"(y position of the bounding box)");
+DEFINE_int32(bbox_w, -1, R"(width of the bounding box)");
+DEFINE_int32(bbox_h, -1, R"(height of the bounding box)");
 
 int main(int argc, char* argv[]) {
   if (!utils::ParseArguments(argc, argv)) {
     return -1;
   }
 
-  cv::Mat img = cv::imread(ARGS_image);
-  if (img.empty()) {
-    fprintf(stderr, "failed to load image: %s\n", ARGS_image.c_str());
-    return -1;
-  }
+  cv::Mat img = cv::imread(ARGS_image_path);
+
+  mmdeploy::PoseDetector detector(mmdeploy::Model{ARGS_model_path}, mmdeploy::Device{FLAGS_device_name, 0});
 
-  // construct a detector instance
-  mmdeploy::Detector detector(mmdeploy::Model{ARGS_model}, mmdeploy::Device{FLAGS_device});
-
-  // apply the detector, the result is an array-like class holding references to
-  // `mmdeploy_detection_t`, will be released automatically on destruction
-  mmdeploy::Detector::Result dets = detector.Apply(img);
-
-  // visualize
-  utils::Visualize v;
-  auto sess = v.get_session(img);
-  int count = 0;
-  for (const mmdeploy_detection_t& det : dets) {
-    if (det.score > FLAGS_det_thr) {  // filter bboxes
-      sess.add_det(det.bbox, det.label_id, det.score, det.mask, count++);
-    }
+  mmdeploy::PoseDetector::Result result{0, 0, nullptr};
+
+  if (FLAGS_bbox_x == -1 || FLAGS_bbox_y == -1 || FLAGS_bbox_w == -1 || FLAGS_bbox_h == -1) {
+    result = detector.Apply(img);
+  } else {
+    // convert (x, y, w, h) -> (left, top, right, bottom)
+    mmdeploy::cxx::Rect rect;
+    rect.left = FLAGS_bbox_x;
+    rect.top = FLAGS_bbox_y;
+    rect.right = FLAGS_bbox_x + FLAGS_bbox_w;
+    rect.bottom = FLAGS_bbox_y + FLAGS_bbox_h;
+    result = detector.Apply(img, {rect});
   }
 
-  if (!FLAGS_output.empty()) {
-    cv::imwrite(FLAGS_output, sess.get());
+  // Draw circles at detected keypoints
+  for (size_t i = 0; i < result[0].length; ++i) {
+    cv::Point keypoint(result[0].point[i].x, result[0].point[i].y);
+    cv::circle(img, keypoint, 1, cv::Scalar(0, 255, 0), 2);  // Draw filled circle
   }
 
+  // Save the output image
+  cv::imwrite("output_pose.png", img);
+
   return 0;
 }
 ```
diff --git a/projects/rtmpose/app.py b/projects/rtmpose/app.py
new file mode 100644
index 0000000000..6b5be8f9bb
--- /dev/null
+++ b/projects/rtmpose/app.py
@@ -0,0 +1,200 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+
+import os
+from functools import partial
+
+# prepare environment
+project_path = os.path.join(os.path.dirname(os.path.abspath(__file__)))
+mmpose_path = project_path.split('/projects', 1)[0]
+
+os.system('python -m pip install Openmim')
+os.system('python -m pip install openxlab')
+os.system('python -m pip install gradio==3.38.0')
+
+os.system('python -m mim install "mmcv>=2.0.0"')
+os.system('python -m mim install "mmengine>=0.9.0"')
+os.system('python -m mim install "mmdet>=3.0.0"')
+os.system(f'python -m mim install -e {mmpose_path}')
+
+import gradio as gr  # noqa
+from openxlab.model import download  # noqa
+
+from mmpose.apis import MMPoseInferencer  # noqa
+
+# download checkpoints
+download(model_repo='mmpose/RTMPose', model_name='dwpose-l')
+download(model_repo='mmpose/RTMPose', model_name='RTMW-x')
+download(model_repo='mmpose/RTMPose', model_name='RTMO-l')
+download(model_repo='mmpose/RTMPose', model_name='RTMPose-l-body8')
+download(model_repo='mmpose/RTMPose', model_name='RTMPose-m-face6')
+
+models = [
+    'rtmpose | body', 'rtmo | body', 'rtmpose | face', 'dwpose | wholebody',
+    'rtmw | wholebody'
+]
+cached_model = {model: None for model in models}
+
+
+def predict(input,
+            draw_heatmap=False,
+            model_type='body',
+            skeleton_style='mmpose',
+            input_type='image'):
+    """Visualize the demo images.
+
+    Using mmdet to detect the human.
+    """
+
+    if model_type == 'rtmpose | face':
+        if cached_model[model_type] is None:
+            cached_model[model_type] = MMPoseInferencer(pose2d='face')
+        model = cached_model[model_type]
+
+    elif model_type == 'dwpose | wholebody':
+        if cached_model[model_type] is None:
+            cached_model[model_type] = MMPoseInferencer(
+                pose2d=os.path.join(
+                    project_path, 'rtmpose/wholebody_2d_keypoint/'
+                    'rtmpose-l_8xb32-270e_coco-wholebody-384x288.py'),
+                pose2d_weights='https://download.openmmlab.com/mmpose/v1/'
+                'projects/rtmposev1/rtmpose-l_simcc-ucoco_dw-ucoco_270e-'
+                '384x288-2438fd99_20230728.pth')
+        model = cached_model[model_type]
+
+    elif model_type == 'rtmw | wholebody':
+        if cached_model[model_type] is None:
+            cached_model[model_type] = MMPoseInferencer(
+                pose2d=os.path.join(
+                    project_path, 'rtmpose/wholebody_2d_keypoint/'
+                    'rtmw-l_8xb320-270e_cocktail14-384x288.py'),
+                pose2d_weights='https://download.openmmlab.com/mmpose/v1/'
+                'projects/rtmw/rtmw-dw-x-l_simcc-cocktail14_270e-'
+                '384x288-20231122.pth')
+        model = cached_model[model_type]
+
+    elif model_type == 'rtmpose | body':
+        if cached_model[model_type] is None:
+            cached_model[model_type] = MMPoseInferencer(pose2d='rtmpose-l')
+        model = cached_model[model_type]
+
+    elif model_type == 'rtmo | body':
+        if cached_model[model_type] is None:
+            cached_model[model_type] = MMPoseInferencer(pose2d='rtmo')
+        model = cached_model[model_type]
+        draw_heatmap = False
+
+    else:
+        raise ValueError
+
+    if input_type == 'image':
+
+        result = next(
+            model(
+                input,
+                return_vis=True,
+                draw_heatmap=draw_heatmap,
+                skeleton_style=skeleton_style))
+        img = result['visualization'][0][..., ::-1]
+        return img
+
+    elif input_type == 'video':
+
+        for _ in model(
+                input,
+                vis_out_dir='test.mp4',
+                draw_heatmap=draw_heatmap,
+                skeleton_style=skeleton_style):
+            pass
+
+        return 'test.mp4'
+
+    return None
+
+
+news_list = [
+    '2023-8-1: We support [DWPose](https://arxiv.org/pdf/2307.15880.pdf).',
+    '2023-9-25: We release an alpha version of RTMW model, the technical '
+    'report will be released soon.',
+    '2023-12-11: Update RTMW models, the online version is the RTMW-l with '
+    '70.1 mAP on COCO-Wholebody.',
+    '2023-12-13: We release an alpha version of RTMO (One-stage RTMPose) '
+    'models.',
+]
+
+with gr.Blocks() as demo:
+
+    with gr.Tab('Upload-Image'):
+        input_img = gr.Image(type='numpy')
+        button = gr.Button('Inference', variant='primary')
+        hm = gr.Checkbox(label='draw-heatmap', info='Whether to draw heatmap')
+        model_type = gr.Dropdown(models, label='Model | Keypoint Type')
+
+        gr.Markdown('## News')
+        for news in news_list[::-1]:
+            gr.Markdown(news)
+
+        gr.Markdown('## Output')
+        out_image = gr.Image(type='numpy')
+        gr.Examples(['./tests/data/coco/000000000785.jpg'], input_img)
+        input_type = 'image'
+        button.click(
+            partial(predict, input_type=input_type),
+            [input_img, hm, model_type], out_image)
+
+    with gr.Tab('Webcam-Image'):
+        input_img = gr.Image(source='webcam', type='numpy')
+        button = gr.Button('Inference', variant='primary')
+        hm = gr.Checkbox(label='draw-heatmap', info='Whether to draw heatmap')
+        model_type = gr.Dropdown(models, label='Model | Keypoint Type')
+
+        gr.Markdown('## News')
+        for news in news_list[::-1]:
+            gr.Markdown(news)
+
+        gr.Markdown('## Output')
+        out_image = gr.Image(type='numpy')
+
+        input_type = 'image'
+        button.click(
+            partial(predict, input_type=input_type),
+            [input_img, hm, model_type], out_image)
+
+    with gr.Tab('Upload-Video'):
+        input_video = gr.Video(type='mp4')
+        button = gr.Button('Inference', variant='primary')
+        hm = gr.Checkbox(label='draw-heatmap', info='Whether to draw heatmap')
+        model_type = gr.Dropdown(models, label='Model | Keypoint type')
+
+        gr.Markdown('## News')
+        for news in news_list[::-1]:
+            gr.Markdown(news)
+
+        gr.Markdown('## Output')
+        out_video = gr.Video()
+
+        input_type = 'video'
+        button.click(
+            partial(predict, input_type=input_type),
+            [input_video, hm, model_type], out_video)
+
+    with gr.Tab('Webcam-Video'):
+        input_video = gr.Video(source='webcam', format='mp4')
+        button = gr.Button('Inference', variant='primary')
+        hm = gr.Checkbox(label='draw-heatmap', info='Whether to draw heatmap')
+        model_type = gr.Dropdown(models, label='Model | Keypoint Type')
+
+        gr.Markdown('## News')
+        for news in news_list[::-1]:
+            gr.Markdown(news)
+
+        gr.Markdown('## Output')
+        out_video = gr.Video()
+
+        input_type = 'video'
+        button.click(
+            partial(predict, input_type=input_type),
+            [input_video, hm, model_type], out_video)
+
+gr.close_all()
+demo.queue()
+demo.launch()
diff --git a/projects/rtmpose/examples/onnxruntime/main.py b/projects/rtmpose/examples/onnxruntime/main.py
index df4858c8dd..856ca1d3a1 100644
--- a/projects/rtmpose/examples/onnxruntime/main.py
+++ b/projects/rtmpose/examples/onnxruntime/main.py
@@ -178,11 +178,13 @@ def visualize(img: np.ndarray,
 
     # draw keypoints and skeleton
     for kpts, score in zip(keypoints, scores):
+        keypoints_num = len(score)
         for kpt, color in zip(kpts, point_color):
             cv2.circle(img, tuple(kpt.astype(np.int32)), 1, palette[color], 1,
                        cv2.LINE_AA)
         for (u, v), color in zip(skeleton, link_color):
-            if score[u] > thr and score[v] > thr:
+            if u < keypoints_num and v < keypoints_num \
+                        and score[u] > thr and score[v] > thr:
                 cv2.line(img, tuple(kpts[u].astype(np.int32)),
                          tuple(kpts[v].astype(np.int32)), palette[color], 2,
                          cv2.LINE_AA)
diff --git a/projects/rtmpose/rtmpose/wholebody_2d_keypoint/rtmw-l_8xb1024-270e_cocktail14-256x192.py b/projects/rtmpose/rtmpose/wholebody_2d_keypoint/rtmw-l_8xb1024-270e_cocktail14-256x192.py
new file mode 100644
index 0000000000..90484bb52c
--- /dev/null
+++ b/projects/rtmpose/rtmpose/wholebody_2d_keypoint/rtmw-l_8xb1024-270e_cocktail14-256x192.py
@@ -0,0 +1,615 @@
+_base_ = ['mmpose::_base_/default_runtime.py']
+
+# common setting
+num_keypoints = 133
+input_size = (192, 256)
+
+# runtime
+max_epochs = 270
+stage2_num_epochs = 10
+base_lr = 5e-4
+train_batch_size = 1024
+val_batch_size = 32
+
+train_cfg = dict(max_epochs=max_epochs, val_interval=10)
+randomness = dict(seed=21)
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='AdamW', lr=base_lr, weight_decay=0.1),
+    clip_grad=dict(max_norm=35, norm_type=2),
+    paramwise_cfg=dict(
+        norm_decay_mult=0, bias_decay_mult=0, bypass_duplicate=True))
+
+# learning rate
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1.0e-5,
+        by_epoch=False,
+        begin=0,
+        end=1000),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=base_lr * 0.05,
+        begin=max_epochs // 2,
+        end=max_epochs,
+        T_max=max_epochs // 2,
+        by_epoch=True,
+        convert_to_iter_based=True),
+]
+
+# automatically scaling LR based on the actual training batch size
+auto_scale_lr = dict(base_batch_size=8192)
+
+# codec settings
+codec = dict(
+    type='SimCCLabel',
+    input_size=input_size,
+    sigma=(4.9, 5.66),
+    simcc_split_ratio=2.0,
+    normalize=False,
+    use_dark=False)
+
+# model settings
+model = dict(
+    type='TopdownPoseEstimator',
+    data_preprocessor=dict(
+        type='PoseDataPreprocessor',
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        bgr_to_rgb=True),
+    backbone=dict(
+        type='CSPNeXt',
+        arch='P5',
+        expand_ratio=0.5,
+        deepen_factor=1.,
+        widen_factor=1.,
+        channel_attention=True,
+        norm_cfg=dict(type='BN'),
+        act_cfg=dict(type='SiLU'),
+        init_cfg=dict(
+            type='Pretrained',
+            prefix='backbone.',
+            checkpoint='https://download.openmmlab.com/mmpose/v1/projects/'
+            'rtmposev1/rtmpose-l_simcc-ucoco_dw-ucoco_270e-256x192-4d6dfc62_20230728.pth'  # noqa
+        )),
+    neck=dict(
+        type='CSPNeXtPAFPN',
+        in_channels=[256, 512, 1024],
+        out_channels=None,
+        out_indices=(
+            1,
+            2,
+        ),
+        num_csp_blocks=2,
+        expand_ratio=0.5,
+        norm_cfg=dict(type='SyncBN'),
+        act_cfg=dict(type='SiLU', inplace=True)),
+    head=dict(
+        type='RTMWHead',
+        in_channels=1024,
+        out_channels=num_keypoints,
+        input_size=input_size,
+        in_featuremap_size=tuple([s // 32 for s in input_size]),
+        simcc_split_ratio=codec['simcc_split_ratio'],
+        final_layer_kernel_size=7,
+        gau_cfg=dict(
+            hidden_dims=256,
+            s=128,
+            expansion_factor=2,
+            dropout_rate=0.,
+            drop_path=0.,
+            act_fn='SiLU',
+            use_rel_bias=False,
+            pos_enc=False),
+        loss=dict(
+            type='KLDiscretLoss',
+            use_target_weight=True,
+            beta=1.,
+            label_softmax=True,
+            label_beta=10.,
+            mask=list(range(23, 91)),
+            mask_weight=0.5,
+        ),
+        decoder=codec),
+    test_cfg=dict(flip_test=True))
+
+# base dataset settings
+dataset_type = 'CocoWholeBodyDataset'
+data_mode = 'topdown'
+data_root = 'data/'
+
+backend_args = dict(backend='local')
+
+# pipelines
+train_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(type='RandomFlip', direction='horizontal'),
+    dict(type='RandomHalfBody'),
+    dict(
+        type='RandomBBoxTransform', scale_factor=[0.5, 1.5], rotate_factor=90),
+    dict(type='TopdownAffine', input_size=codec['input_size']),
+    dict(type='PhotometricDistortion'),
+    dict(
+        type='Albumentation',
+        transforms=[
+            dict(type='Blur', p=0.1),
+            dict(type='MedianBlur', p=0.1),
+            dict(
+                type='CoarseDropout',
+                max_holes=1,
+                max_height=0.4,
+                max_width=0.4,
+                min_holes=1,
+                min_height=0.2,
+                min_width=0.2,
+                p=0.5),
+        ]),
+    dict(
+        type='GenerateTarget',
+        encoder=codec,
+        use_dataset_keypoint_weights=True),
+    dict(type='PackPoseInputs')
+]
+val_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(type='TopdownAffine', input_size=codec['input_size']),
+    dict(type='PackPoseInputs')
+]
+train_pipeline_stage2 = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(type='RandomFlip', direction='horizontal'),
+    dict(type='RandomHalfBody'),
+    dict(
+        type='RandomBBoxTransform',
+        shift_factor=0.,
+        scale_factor=[0.5, 1.5],
+        rotate_factor=90),
+    dict(type='TopdownAffine', input_size=codec['input_size']),
+    dict(
+        type='Albumentation',
+        transforms=[
+            dict(type='Blur', p=0.1),
+            dict(type='MedianBlur', p=0.1),
+        ]),
+    dict(
+        type='GenerateTarget',
+        encoder=codec,
+        use_dataset_keypoint_weights=True),
+    dict(type='PackPoseInputs')
+]
+
+# mapping
+
+aic_coco133 = [(0, 6), (1, 8), (2, 10), (3, 5), (4, 7), (5, 9), (6, 12),
+               (7, 14), (8, 16), (9, 11), (10, 13), (11, 15)]
+
+crowdpose_coco133 = [(0, 5), (1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (6, 11),
+                     (7, 12), (8, 13), (9, 14), (10, 15), (11, 16)]
+
+mpii_coco133 = [
+    (0, 16),
+    (1, 14),
+    (2, 12),
+    (3, 11),
+    (4, 13),
+    (5, 15),
+    (10, 10),
+    (11, 8),
+    (12, 6),
+    (13, 5),
+    (14, 7),
+    (15, 9),
+]
+
+jhmdb_coco133 = [
+    (3, 6),
+    (4, 5),
+    (5, 12),
+    (6, 11),
+    (7, 8),
+    (8, 7),
+    (9, 14),
+    (10, 13),
+    (11, 10),
+    (12, 9),
+    (13, 16),
+    (14, 15),
+]
+
+halpe_coco133 = [(i, i)
+                 for i in range(17)] + [(20, 17), (21, 20), (22, 18), (23, 21),
+                                        (24, 19),
+                                        (25, 22)] + [(i, i - 3)
+                                                     for i in range(26, 136)]
+
+posetrack_coco133 = [
+    (0, 0),
+    (3, 3),
+    (4, 4),
+    (5, 5),
+    (6, 6),
+    (7, 7),
+    (8, 8),
+    (9, 9),
+    (10, 10),
+    (11, 11),
+    (12, 12),
+    (13, 13),
+    (14, 14),
+    (15, 15),
+    (16, 16),
+]
+
+humanart_coco133 = [(i, i) for i in range(17)] + [(17, 99), (18, 120),
+                                                  (19, 17), (20, 20)]
+
+# train datasets
+dataset_coco = dict(
+    type=dataset_type,
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='coco/annotations/coco_wholebody_train_v1.0.json',
+    data_prefix=dict(img='detection/coco/train2017/'),
+    pipeline=[],
+)
+
+dataset_aic = dict(
+    type='AicDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='aic/annotations/aic_train.json',
+    data_prefix=dict(img='pose/ai_challenge/ai_challenger_keypoint'
+                     '_train_20170902/keypoint_train_images_20170902/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=aic_coco133)
+    ],
+)
+
+dataset_crowdpose = dict(
+    type='CrowdPoseDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='crowdpose/annotations/mmpose_crowdpose_trainval.json',
+    data_prefix=dict(img='pose/CrowdPose/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=crowdpose_coco133)
+    ],
+)
+
+dataset_mpii = dict(
+    type='MpiiDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='mpii/annotations/mpii_train.json',
+    data_prefix=dict(img='pose/MPI/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=mpii_coco133)
+    ],
+)
+
+dataset_jhmdb = dict(
+    type='JhmdbDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='jhmdb/annotations/Sub1_train.json',
+    data_prefix=dict(img='pose/JHMDB/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=jhmdb_coco133)
+    ],
+)
+
+dataset_halpe = dict(
+    type='HalpeDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='halpe/annotations/halpe_train_v1.json',
+    data_prefix=dict(img='pose/Halpe/hico_20160224_det/images/train2015'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=halpe_coco133)
+    ],
+)
+
+dataset_posetrack = dict(
+    type='PoseTrack18Dataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='posetrack18/annotations/posetrack18_train.json',
+    data_prefix=dict(img='pose/PoseChallenge2018/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=posetrack_coco133)
+    ],
+)
+
+dataset_humanart = dict(
+    type='HumanArt21Dataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='HumanArt/annotations/training_humanart.json',
+    filter_cfg=dict(scenes=['real_human']),
+    data_prefix=dict(img='pose/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=humanart_coco133)
+    ])
+
+ubody_scenes = [
+    'Magic_show', 'Entertainment', 'ConductMusic', 'Online_class', 'TalkShow',
+    'Speech', 'Fitness', 'Interview', 'Olympic', 'TVShow', 'Singing',
+    'SignLanguage', 'Movie', 'LiveVlog', 'VideoConference'
+]
+
+ubody_datasets = []
+for scene in ubody_scenes:
+    each = dict(
+        type='UBody2dDataset',
+        data_root=data_root,
+        data_mode=data_mode,
+        ann_file=f'Ubody/annotations/{scene}/train_annotations.json',
+        data_prefix=dict(img='pose/UBody/images/'),
+        pipeline=[],
+        sample_interval=10)
+    ubody_datasets.append(each)
+
+dataset_ubody = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/ubody2d.py'),
+    datasets=ubody_datasets,
+    pipeline=[],
+    test_mode=False,
+)
+
+face_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale', padding=1.25),
+    dict(
+        type='RandomBBoxTransform',
+        shift_factor=0.,
+        scale_factor=[1.5, 2.0],
+        rotate_factor=0),
+]
+
+wflw_coco133 = [(i * 2, 23 + i)
+                for i in range(17)] + [(33 + i, 40 + i) for i in range(5)] + [
+                    (42 + i, 45 + i) for i in range(5)
+                ] + [(51 + i, 50 + i)
+                     for i in range(9)] + [(60, 59), (61, 60), (63, 61),
+                                           (64, 62), (65, 63), (67, 64),
+                                           (68, 65), (69, 66), (71, 67),
+                                           (72, 68), (73, 69),
+                                           (75, 70)] + [(76 + i, 71 + i)
+                                                        for i in range(20)]
+dataset_wflw = dict(
+    type='WFLWDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='wflw/annotations/face_landmarks_wflw_train.json',
+    data_prefix=dict(img='pose/WFLW/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=wflw_coco133), *face_pipeline
+    ],
+)
+
+mapping_300w_coco133 = [(i, 23 + i) for i in range(68)]
+dataset_300w = dict(
+    type='Face300WDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='300w/annotations/face_landmarks_300w_train.json',
+    data_prefix=dict(img='pose/300w/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=mapping_300w_coco133), *face_pipeline
+    ],
+)
+
+cofw_coco133 = [(0, 40), (2, 44), (4, 42), (1, 49), (3, 45), (6, 47), (8, 59),
+                (10, 62), (9, 68), (11, 65), (18, 54), (19, 58), (20, 53),
+                (21, 56), (22, 71), (23, 77), (24, 74), (25, 85), (26, 89),
+                (27, 80), (28, 31)]
+dataset_cofw = dict(
+    type='COFWDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='cofw/annotations/cofw_train.json',
+    data_prefix=dict(img='pose/COFW/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=cofw_coco133), *face_pipeline
+    ],
+)
+
+lapa_coco133 = [(i * 2, 23 + i) for i in range(17)] + [
+    (33 + i, 40 + i) for i in range(5)
+] + [(42 + i, 45 + i) for i in range(5)] + [
+    (51 + i, 50 + i) for i in range(4)
+] + [(58 + i, 54 + i) for i in range(5)] + [(66, 59), (67, 60), (69, 61),
+                                            (70, 62), (71, 63), (73, 64),
+                                            (75, 65), (76, 66), (78, 67),
+                                            (79, 68), (80, 69),
+                                            (82, 70)] + [(84 + i, 71 + i)
+                                                         for i in range(20)]
+dataset_lapa = dict(
+    type='LapaDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='LaPa/annotations/lapa_trainval.json',
+    data_prefix=dict(img='pose/LaPa/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=lapa_coco133), *face_pipeline
+    ],
+)
+
+dataset_wb = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[dataset_coco, dataset_halpe, dataset_ubody],
+    pipeline=[],
+    test_mode=False,
+)
+
+dataset_body = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[
+        dataset_aic,
+        dataset_crowdpose,
+        dataset_mpii,
+        dataset_jhmdb,
+        dataset_posetrack,
+        dataset_humanart,
+    ],
+    pipeline=[],
+    test_mode=False,
+)
+
+dataset_face = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[
+        dataset_wflw,
+        dataset_300w,
+        dataset_cofw,
+        dataset_lapa,
+    ],
+    pipeline=[],
+    test_mode=False,
+)
+
+hand_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(
+        type='RandomBBoxTransform',
+        shift_factor=0.,
+        scale_factor=[1.5, 2.0],
+        rotate_factor=0),
+]
+
+interhand_left = [(21, 95), (22, 94), (23, 93), (24, 92), (25, 99), (26, 98),
+                  (27, 97), (28, 96), (29, 103), (30, 102), (31, 101),
+                  (32, 100), (33, 107), (34, 106), (35, 105), (36, 104),
+                  (37, 111), (38, 110), (39, 109), (40, 108), (41, 91)]
+interhand_right = [(i - 21, j + 21) for i, j in interhand_left]
+interhand_coco133 = interhand_right + interhand_left
+
+dataset_interhand2d = dict(
+    type='InterHand2DDoubleDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='interhand26m/annotations/all/InterHand2.6M_train_data.json',
+    camera_param_file='interhand26m/annotations/all/'
+    'InterHand2.6M_train_camera.json',
+    joint_file='interhand26m/annotations/all/'
+    'InterHand2.6M_train_joint_3d.json',
+    data_prefix=dict(img='interhand2.6m/images/train/'),
+    sample_interval=10,
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=interhand_coco133,
+        ), *hand_pipeline
+    ],
+)
+
+dataset_hand = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[dataset_interhand2d],
+    pipeline=[],
+    test_mode=False,
+)
+
+train_datasets = [dataset_wb, dataset_body, dataset_face, dataset_hand]
+
+# data loaders
+train_dataloader = dict(
+    batch_size=train_batch_size,
+    num_workers=4,
+    pin_memory=False,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dict(
+        type='CombinedDataset',
+        metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+        datasets=train_datasets,
+        pipeline=train_pipeline,
+        test_mode=False,
+    ))
+
+val_dataloader = dict(
+    batch_size=val_batch_size,
+    num_workers=4,
+    persistent_workers=True,
+    drop_last=False,
+    sampler=dict(type='DefaultSampler', shuffle=False, round_up=False),
+    dataset=dict(
+        type='CocoWholeBodyDataset',
+        ann_file='data/coco/annotations/coco_wholebody_val_v1.0.json',
+        data_prefix=dict(img='data/detection/coco/val2017/'),
+        pipeline=val_pipeline,
+        bbox_file='data/coco/person_detection_results/'
+        'COCO_val2017_detections_AP_H_56_person.json',
+        test_mode=True))
+
+test_dataloader = val_dataloader
+
+# hooks
+default_hooks = dict(
+    checkpoint=dict(
+        save_best='coco-wholebody/AP', rule='greater', max_keep_ckpts=1))
+
+custom_hooks = [
+    dict(
+        type='EMAHook',
+        ema_type='ExpMomentumEMA',
+        momentum=0.0002,
+        update_buffers=True,
+        priority=49),
+    dict(
+        type='mmdet.PipelineSwitchHook',
+        switch_epoch=max_epochs - stage2_num_epochs,
+        switch_pipeline=train_pipeline_stage2)
+]
+
+# evaluators
+val_evaluator = dict(
+    type='CocoWholeBodyMetric',
+    ann_file='data/coco/annotations/coco_wholebody_val_v1.0.json')
+test_evaluator = val_evaluator
diff --git a/projects/rtmpose/rtmpose/wholebody_2d_keypoint/rtmw-l_8xb320-270e_cocktail14-384x288.py b/projects/rtmpose/rtmpose/wholebody_2d_keypoint/rtmw-l_8xb320-270e_cocktail14-384x288.py
new file mode 100644
index 0000000000..3c3b3c7a26
--- /dev/null
+++ b/projects/rtmpose/rtmpose/wholebody_2d_keypoint/rtmw-l_8xb320-270e_cocktail14-384x288.py
@@ -0,0 +1,616 @@
+_base_ = ['mmpose::_base_/default_runtime.py']
+
+# common setting
+num_keypoints = 133
+input_size = (288, 384)
+
+# runtime
+max_epochs = 270
+stage2_num_epochs = 10
+base_lr = 5e-4
+train_batch_size = 320
+val_batch_size = 32
+
+train_cfg = dict(max_epochs=max_epochs, val_interval=10)
+randomness = dict(seed=21)
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='AdamW', lr=base_lr, weight_decay=0.1),
+    clip_grad=dict(max_norm=35, norm_type=2),
+    paramwise_cfg=dict(
+        norm_decay_mult=0, bias_decay_mult=0, bypass_duplicate=True))
+
+# learning rate
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1.0e-5,
+        by_epoch=False,
+        begin=0,
+        end=1000),
+    dict(
+        # use cosine lr from 150 to 300 epoch
+        type='CosineAnnealingLR',
+        eta_min=base_lr * 0.05,
+        begin=max_epochs // 2,
+        end=max_epochs,
+        T_max=max_epochs // 2,
+        by_epoch=True,
+        convert_to_iter_based=True),
+]
+
+# automatically scaling LR based on the actual training batch size
+auto_scale_lr = dict(base_batch_size=2560)
+
+# codec settings
+codec = dict(
+    type='SimCCLabel',
+    input_size=input_size,
+    sigma=(6., 6.93),
+    simcc_split_ratio=2.0,
+    normalize=False,
+    use_dark=False)
+
+# model settings
+model = dict(
+    type='TopdownPoseEstimator',
+    data_preprocessor=dict(
+        type='PoseDataPreprocessor',
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        bgr_to_rgb=True),
+    backbone=dict(
+        type='CSPNeXt',
+        arch='P5',
+        expand_ratio=0.5,
+        deepen_factor=1.,
+        widen_factor=1.,
+        channel_attention=True,
+        norm_cfg=dict(type='BN'),
+        act_cfg=dict(type='SiLU'),
+        init_cfg=dict(
+            type='Pretrained',
+            prefix='backbone.',
+            checkpoint='https://download.openmmlab.com/mmpose/v1/projects/'
+            'rtmposev1/rtmpose-l_simcc-ucoco_dw-ucoco_270e-256x192-4d6dfc62_20230728.pth'  # noqa
+        )),
+    neck=dict(
+        type='CSPNeXtPAFPN',
+        in_channels=[256, 512, 1024],
+        out_channels=None,
+        out_indices=(
+            1,
+            2,
+        ),
+        num_csp_blocks=2,
+        expand_ratio=0.5,
+        norm_cfg=dict(type='SyncBN'),
+        act_cfg=dict(type='SiLU', inplace=True)),
+    head=dict(
+        type='RTMWHead',
+        in_channels=1024,
+        out_channels=num_keypoints,
+        input_size=input_size,
+        in_featuremap_size=tuple([s // 32 for s in input_size]),
+        simcc_split_ratio=codec['simcc_split_ratio'],
+        final_layer_kernel_size=7,
+        gau_cfg=dict(
+            hidden_dims=256,
+            s=128,
+            expansion_factor=2,
+            dropout_rate=0.,
+            drop_path=0.,
+            act_fn='SiLU',
+            use_rel_bias=False,
+            pos_enc=False),
+        loss=dict(
+            type='KLDiscretLoss',
+            use_target_weight=True,
+            beta=1.,
+            label_softmax=True,
+            label_beta=10.,
+            mask=list(range(23, 91)),
+            mask_weight=0.5,
+        ),
+        decoder=codec),
+    test_cfg=dict(flip_test=True))
+
+# base dataset settings
+dataset_type = 'CocoWholeBodyDataset'
+data_mode = 'topdown'
+data_root = 'data/'
+
+backend_args = dict(backend='local')
+
+# pipelines
+train_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(type='RandomFlip', direction='horizontal'),
+    dict(type='RandomHalfBody'),
+    dict(
+        type='RandomBBoxTransform', scale_factor=[0.5, 1.5], rotate_factor=90),
+    dict(type='TopdownAffine', input_size=codec['input_size']),
+    dict(type='PhotometricDistortion'),
+    dict(
+        type='Albumentation',
+        transforms=[
+            dict(type='Blur', p=0.1),
+            dict(type='MedianBlur', p=0.1),
+            dict(
+                type='CoarseDropout',
+                max_holes=1,
+                max_height=0.4,
+                max_width=0.4,
+                min_holes=1,
+                min_height=0.2,
+                min_width=0.2,
+                p=0.5),
+        ]),
+    dict(
+        type='GenerateTarget',
+        encoder=codec,
+        use_dataset_keypoint_weights=True),
+    dict(type='PackPoseInputs')
+]
+val_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(type='TopdownAffine', input_size=codec['input_size']),
+    dict(type='PackPoseInputs')
+]
+train_pipeline_stage2 = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(type='RandomFlip', direction='horizontal'),
+    dict(type='RandomHalfBody'),
+    dict(
+        type='RandomBBoxTransform',
+        shift_factor=0.,
+        scale_factor=[0.5, 1.5],
+        rotate_factor=90),
+    dict(type='TopdownAffine', input_size=codec['input_size']),
+    dict(
+        type='Albumentation',
+        transforms=[
+            dict(type='Blur', p=0.1),
+            dict(type='MedianBlur', p=0.1),
+        ]),
+    dict(
+        type='GenerateTarget',
+        encoder=codec,
+        use_dataset_keypoint_weights=True),
+    dict(type='PackPoseInputs')
+]
+
+# mapping
+
+aic_coco133 = [(0, 6), (1, 8), (2, 10), (3, 5), (4, 7), (5, 9), (6, 12),
+               (7, 14), (8, 16), (9, 11), (10, 13), (11, 15)]
+
+crowdpose_coco133 = [(0, 5), (1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (6, 11),
+                     (7, 12), (8, 13), (9, 14), (10, 15), (11, 16)]
+
+mpii_coco133 = [
+    (0, 16),
+    (1, 14),
+    (2, 12),
+    (3, 11),
+    (4, 13),
+    (5, 15),
+    (10, 10),
+    (11, 8),
+    (12, 6),
+    (13, 5),
+    (14, 7),
+    (15, 9),
+]
+
+jhmdb_coco133 = [
+    (3, 6),
+    (4, 5),
+    (5, 12),
+    (6, 11),
+    (7, 8),
+    (8, 7),
+    (9, 14),
+    (10, 13),
+    (11, 10),
+    (12, 9),
+    (13, 16),
+    (14, 15),
+]
+
+halpe_coco133 = [(i, i)
+                 for i in range(17)] + [(20, 17), (21, 20), (22, 18), (23, 21),
+                                        (24, 19),
+                                        (25, 22)] + [(i, i - 3)
+                                                     for i in range(26, 136)]
+
+posetrack_coco133 = [
+    (0, 0),
+    (3, 3),
+    (4, 4),
+    (5, 5),
+    (6, 6),
+    (7, 7),
+    (8, 8),
+    (9, 9),
+    (10, 10),
+    (11, 11),
+    (12, 12),
+    (13, 13),
+    (14, 14),
+    (15, 15),
+    (16, 16),
+]
+
+humanart_coco133 = [(i, i) for i in range(17)] + [(17, 99), (18, 120),
+                                                  (19, 17), (20, 20)]
+
+# train datasets
+dataset_coco = dict(
+    type=dataset_type,
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='coco/annotations/coco_wholebody_train_v1.0.json',
+    data_prefix=dict(img='detection/coco/train2017/'),
+    pipeline=[],
+)
+
+dataset_aic = dict(
+    type='AicDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='aic/annotations/aic_train.json',
+    data_prefix=dict(img='pose/ai_challenge/ai_challenger_keypoint'
+                     '_train_20170902/keypoint_train_images_20170902/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=aic_coco133)
+    ],
+)
+
+dataset_crowdpose = dict(
+    type='CrowdPoseDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='crowdpose/annotations/mmpose_crowdpose_trainval.json',
+    data_prefix=dict(img='pose/CrowdPose/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=crowdpose_coco133)
+    ],
+)
+
+dataset_mpii = dict(
+    type='MpiiDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='mpii/annotations/mpii_train.json',
+    data_prefix=dict(img='pose/MPI/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=mpii_coco133)
+    ],
+)
+
+dataset_jhmdb = dict(
+    type='JhmdbDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='jhmdb/annotations/Sub1_train.json',
+    data_prefix=dict(img='pose/JHMDB/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=jhmdb_coco133)
+    ],
+)
+
+dataset_halpe = dict(
+    type='HalpeDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='halpe/annotations/halpe_train_v1.json',
+    data_prefix=dict(img='pose/Halpe/hico_20160224_det/images/train2015'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=halpe_coco133)
+    ],
+)
+
+dataset_posetrack = dict(
+    type='PoseTrack18Dataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='posetrack18/annotations/posetrack18_train.json',
+    data_prefix=dict(img='pose/PoseChallenge2018/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=posetrack_coco133)
+    ],
+)
+
+dataset_humanart = dict(
+    type='HumanArt21Dataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='HumanArt/annotations/training_humanart.json',
+    filter_cfg=dict(scenes=['real_human']),
+    data_prefix=dict(img='pose/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=humanart_coco133)
+    ])
+
+ubody_scenes = [
+    'Magic_show', 'Entertainment', 'ConductMusic', 'Online_class', 'TalkShow',
+    'Speech', 'Fitness', 'Interview', 'Olympic', 'TVShow', 'Singing',
+    'SignLanguage', 'Movie', 'LiveVlog', 'VideoConference'
+]
+
+ubody_datasets = []
+for scene in ubody_scenes:
+    each = dict(
+        type='UBody2dDataset',
+        data_root=data_root,
+        data_mode=data_mode,
+        ann_file=f'Ubody/annotations/{scene}/train_annotations.json',
+        data_prefix=dict(img='pose/UBody/images/'),
+        pipeline=[],
+        sample_interval=10)
+    ubody_datasets.append(each)
+
+dataset_ubody = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/ubody2d.py'),
+    datasets=ubody_datasets,
+    pipeline=[],
+    test_mode=False,
+)
+
+face_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale', padding=1.25),
+    dict(
+        type='RandomBBoxTransform',
+        shift_factor=0.,
+        scale_factor=[1.5, 2.0],
+        rotate_factor=0),
+]
+
+wflw_coco133 = [(i * 2, 23 + i)
+                for i in range(17)] + [(33 + i, 40 + i) for i in range(5)] + [
+                    (42 + i, 45 + i) for i in range(5)
+                ] + [(51 + i, 50 + i)
+                     for i in range(9)] + [(60, 59), (61, 60), (63, 61),
+                                           (64, 62), (65, 63), (67, 64),
+                                           (68, 65), (69, 66), (71, 67),
+                                           (72, 68), (73, 69),
+                                           (75, 70)] + [(76 + i, 71 + i)
+                                                        for i in range(20)]
+dataset_wflw = dict(
+    type='WFLWDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='wflw/annotations/face_landmarks_wflw_train.json',
+    data_prefix=dict(img='pose/WFLW/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=wflw_coco133), *face_pipeline
+    ],
+)
+
+mapping_300w_coco133 = [(i, 23 + i) for i in range(68)]
+dataset_300w = dict(
+    type='Face300WDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='300w/annotations/face_landmarks_300w_train.json',
+    data_prefix=dict(img='pose/300w/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=mapping_300w_coco133), *face_pipeline
+    ],
+)
+
+cofw_coco133 = [(0, 40), (2, 44), (4, 42), (1, 49), (3, 45), (6, 47), (8, 59),
+                (10, 62), (9, 68), (11, 65), (18, 54), (19, 58), (20, 53),
+                (21, 56), (22, 71), (23, 77), (24, 74), (25, 85), (26, 89),
+                (27, 80), (28, 31)]
+dataset_cofw = dict(
+    type='COFWDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='cofw/annotations/cofw_train.json',
+    data_prefix=dict(img='pose/COFW/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=cofw_coco133), *face_pipeline
+    ],
+)
+
+lapa_coco133 = [(i * 2, 23 + i) for i in range(17)] + [
+    (33 + i, 40 + i) for i in range(5)
+] + [(42 + i, 45 + i) for i in range(5)] + [
+    (51 + i, 50 + i) for i in range(4)
+] + [(58 + i, 54 + i) for i in range(5)] + [(66, 59), (67, 60), (69, 61),
+                                            (70, 62), (71, 63), (73, 64),
+                                            (75, 65), (76, 66), (78, 67),
+                                            (79, 68), (80, 69),
+                                            (82, 70)] + [(84 + i, 71 + i)
+                                                         for i in range(20)]
+dataset_lapa = dict(
+    type='LapaDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='LaPa/annotations/lapa_trainval.json',
+    data_prefix=dict(img='pose/LaPa/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=lapa_coco133), *face_pipeline
+    ],
+)
+
+dataset_wb = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[dataset_coco, dataset_halpe, dataset_ubody],
+    pipeline=[],
+    test_mode=False,
+)
+
+dataset_body = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[
+        dataset_aic,
+        dataset_crowdpose,
+        dataset_mpii,
+        dataset_jhmdb,
+        dataset_posetrack,
+        dataset_humanart,
+    ],
+    pipeline=[],
+    test_mode=False,
+)
+
+dataset_face = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[
+        dataset_wflw,
+        dataset_300w,
+        dataset_cofw,
+        dataset_lapa,
+    ],
+    pipeline=[],
+    test_mode=False,
+)
+
+hand_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(
+        type='RandomBBoxTransform',
+        shift_factor=0.,
+        scale_factor=[1.5, 2.0],
+        rotate_factor=0),
+]
+
+interhand_left = [(21, 95), (22, 94), (23, 93), (24, 92), (25, 99), (26, 98),
+                  (27, 97), (28, 96), (29, 103), (30, 102), (31, 101),
+                  (32, 100), (33, 107), (34, 106), (35, 105), (36, 104),
+                  (37, 111), (38, 110), (39, 109), (40, 108), (41, 91)]
+interhand_right = [(i - 21, j + 21) for i, j in interhand_left]
+interhand_coco133 = interhand_right + interhand_left
+
+dataset_interhand2d = dict(
+    type='InterHand2DDoubleDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='interhand26m/annotations/all/InterHand2.6M_train_data.json',
+    camera_param_file='interhand26m/annotations/all/'
+    'InterHand2.6M_train_camera.json',
+    joint_file='interhand26m/annotations/all/'
+    'InterHand2.6M_train_joint_3d.json',
+    data_prefix=dict(img='interhand2.6m/images/train/'),
+    sample_interval=10,
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=interhand_coco133,
+        ), *hand_pipeline
+    ],
+)
+
+dataset_hand = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[dataset_interhand2d],
+    pipeline=[],
+    test_mode=False,
+)
+
+train_datasets = [dataset_wb, dataset_body, dataset_face, dataset_hand]
+
+# data loaders
+train_dataloader = dict(
+    batch_size=train_batch_size,
+    num_workers=4,
+    pin_memory=False,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dict(
+        type='CombinedDataset',
+        metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+        datasets=train_datasets,
+        pipeline=train_pipeline,
+        test_mode=False,
+    ))
+
+val_dataloader = dict(
+    batch_size=val_batch_size,
+    num_workers=4,
+    persistent_workers=True,
+    drop_last=False,
+    sampler=dict(type='DefaultSampler', shuffle=False, round_up=False),
+    dataset=dict(
+        type='CocoWholeBodyDataset',
+        ann_file='data/coco/annotations/coco_wholebody_val_v1.0.json',
+        data_prefix=dict(img='data/detection/coco/val2017/'),
+        pipeline=val_pipeline,
+        bbox_file='data/coco/person_detection_results/'
+        'COCO_val2017_detections_AP_H_56_person.json',
+        test_mode=True))
+
+test_dataloader = val_dataloader
+
+# hooks
+default_hooks = dict(
+    checkpoint=dict(
+        save_best='coco-wholebody/AP', rule='greater', max_keep_ckpts=1))
+
+custom_hooks = [
+    dict(
+        type='EMAHook',
+        ema_type='ExpMomentumEMA',
+        momentum=0.0002,
+        update_buffers=True,
+        priority=49),
+    dict(
+        type='mmdet.PipelineSwitchHook',
+        switch_epoch=max_epochs - stage2_num_epochs,
+        switch_pipeline=train_pipeline_stage2)
+]
+
+# evaluators
+val_evaluator = dict(
+    type='CocoWholeBodyMetric',
+    ann_file='data/coco/annotations/coco_wholebody_val_v1.0.json')
+test_evaluator = val_evaluator
diff --git a/projects/rtmpose/rtmpose/wholebody_2d_keypoint/rtmw-m_8xb1024-270e_cocktail14-256x192.py b/projects/rtmpose/rtmpose/wholebody_2d_keypoint/rtmw-m_8xb1024-270e_cocktail14-256x192.py
new file mode 100644
index 0000000000..0788a0b6a3
--- /dev/null
+++ b/projects/rtmpose/rtmpose/wholebody_2d_keypoint/rtmw-m_8xb1024-270e_cocktail14-256x192.py
@@ -0,0 +1,615 @@
+_base_ = ['mmpose::_base_/default_runtime.py']
+
+# common setting
+num_keypoints = 133
+input_size = (192, 256)
+
+# runtime
+max_epochs = 270
+stage2_num_epochs = 10
+base_lr = 5e-4
+train_batch_size = 1024
+val_batch_size = 32
+
+train_cfg = dict(max_epochs=max_epochs, val_interval=10)
+randomness = dict(seed=21)
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='AdamW', lr=base_lr, weight_decay=0.05),
+    clip_grad=dict(max_norm=35, norm_type=2),
+    paramwise_cfg=dict(
+        norm_decay_mult=0, bias_decay_mult=0, bypass_duplicate=True))
+
+# learning rate
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1.0e-5,
+        by_epoch=False,
+        begin=0,
+        end=1000),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=base_lr * 0.05,
+        begin=max_epochs // 2,
+        end=max_epochs,
+        T_max=max_epochs // 2,
+        by_epoch=True,
+        convert_to_iter_based=True),
+]
+
+# automatically scaling LR based on the actual training batch size
+auto_scale_lr = dict(base_batch_size=8192)
+
+# codec settings
+codec = dict(
+    type='SimCCLabel',
+    input_size=input_size,
+    sigma=(4.9, 5.66),
+    simcc_split_ratio=2.0,
+    normalize=False,
+    use_dark=False)
+
+# model settings
+model = dict(
+    type='TopdownPoseEstimator',
+    data_preprocessor=dict(
+        type='PoseDataPreprocessor',
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        bgr_to_rgb=True),
+    backbone=dict(
+        type='CSPNeXt',
+        arch='P5',
+        expand_ratio=0.5,
+        deepen_factor=0.67,
+        widen_factor=0.75,
+        channel_attention=True,
+        norm_cfg=dict(type='BN'),
+        act_cfg=dict(type='SiLU'),
+        init_cfg=dict(
+            type='Pretrained',
+            prefix='backbone.',
+            checkpoint='https://download.openmmlab.com/mmpose/v1/projects/'
+            'rtmposev1/rtmpose-m_simcc-ucoco_dw-ucoco_270e-256x192-c8b76419_20230728.pth'  # noqa
+        )),
+    neck=dict(
+        type='CSPNeXtPAFPN',
+        in_channels=[192, 384, 768],
+        out_channels=None,
+        out_indices=(
+            1,
+            2,
+        ),
+        num_csp_blocks=2,
+        expand_ratio=0.5,
+        norm_cfg=dict(type='SyncBN'),
+        act_cfg=dict(type='SiLU', inplace=True)),
+    head=dict(
+        type='RTMWHead',
+        in_channels=768,
+        out_channels=num_keypoints,
+        input_size=input_size,
+        in_featuremap_size=tuple([s // 32 for s in input_size]),
+        simcc_split_ratio=codec['simcc_split_ratio'],
+        final_layer_kernel_size=7,
+        gau_cfg=dict(
+            hidden_dims=256,
+            s=128,
+            expansion_factor=2,
+            dropout_rate=0.,
+            drop_path=0.,
+            act_fn='SiLU',
+            use_rel_bias=False,
+            pos_enc=False),
+        loss=dict(
+            type='KLDiscretLoss',
+            use_target_weight=True,
+            beta=1.,
+            label_softmax=True,
+            label_beta=10.,
+            mask=list(range(23, 91)),
+            mask_weight=0.5,
+        ),
+        decoder=codec),
+    test_cfg=dict(flip_test=True))
+
+# base dataset settings
+dataset_type = 'CocoWholeBodyDataset'
+data_mode = 'topdown'
+data_root = 'data/'
+
+backend_args = dict(backend='local')
+
+# pipelines
+train_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(type='RandomFlip', direction='horizontal'),
+    dict(type='RandomHalfBody'),
+    dict(
+        type='RandomBBoxTransform', scale_factor=[0.5, 1.5], rotate_factor=90),
+    dict(type='TopdownAffine', input_size=codec['input_size']),
+    dict(type='PhotometricDistortion'),
+    dict(
+        type='Albumentation',
+        transforms=[
+            dict(type='Blur', p=0.1),
+            dict(type='MedianBlur', p=0.1),
+            dict(
+                type='CoarseDropout',
+                max_holes=1,
+                max_height=0.4,
+                max_width=0.4,
+                min_holes=1,
+                min_height=0.2,
+                min_width=0.2,
+                p=0.5),
+        ]),
+    dict(
+        type='GenerateTarget',
+        encoder=codec,
+        use_dataset_keypoint_weights=True),
+    dict(type='PackPoseInputs')
+]
+val_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(type='TopdownAffine', input_size=codec['input_size']),
+    dict(type='PackPoseInputs')
+]
+train_pipeline_stage2 = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(type='RandomFlip', direction='horizontal'),
+    dict(type='RandomHalfBody'),
+    dict(
+        type='RandomBBoxTransform',
+        shift_factor=0.,
+        scale_factor=[0.5, 1.5],
+        rotate_factor=90),
+    dict(type='TopdownAffine', input_size=codec['input_size']),
+    dict(
+        type='Albumentation',
+        transforms=[
+            dict(type='Blur', p=0.1),
+            dict(type='MedianBlur', p=0.1),
+        ]),
+    dict(
+        type='GenerateTarget',
+        encoder=codec,
+        use_dataset_keypoint_weights=True),
+    dict(type='PackPoseInputs')
+]
+
+# mapping
+
+aic_coco133 = [(0, 6), (1, 8), (2, 10), (3, 5), (4, 7), (5, 9), (6, 12),
+               (7, 14), (8, 16), (9, 11), (10, 13), (11, 15)]
+
+crowdpose_coco133 = [(0, 5), (1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (6, 11),
+                     (7, 12), (8, 13), (9, 14), (10, 15), (11, 16)]
+
+mpii_coco133 = [
+    (0, 16),
+    (1, 14),
+    (2, 12),
+    (3, 11),
+    (4, 13),
+    (5, 15),
+    (10, 10),
+    (11, 8),
+    (12, 6),
+    (13, 5),
+    (14, 7),
+    (15, 9),
+]
+
+jhmdb_coco133 = [
+    (3, 6),
+    (4, 5),
+    (5, 12),
+    (6, 11),
+    (7, 8),
+    (8, 7),
+    (9, 14),
+    (10, 13),
+    (11, 10),
+    (12, 9),
+    (13, 16),
+    (14, 15),
+]
+
+halpe_coco133 = [(i, i)
+                 for i in range(17)] + [(20, 17), (21, 20), (22, 18), (23, 21),
+                                        (24, 19),
+                                        (25, 22)] + [(i, i - 3)
+                                                     for i in range(26, 136)]
+
+posetrack_coco133 = [
+    (0, 0),
+    (3, 3),
+    (4, 4),
+    (5, 5),
+    (6, 6),
+    (7, 7),
+    (8, 8),
+    (9, 9),
+    (10, 10),
+    (11, 11),
+    (12, 12),
+    (13, 13),
+    (14, 14),
+    (15, 15),
+    (16, 16),
+]
+
+humanart_coco133 = [(i, i) for i in range(17)] + [(17, 99), (18, 120),
+                                                  (19, 17), (20, 20)]
+
+# train datasets
+dataset_coco = dict(
+    type=dataset_type,
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='coco/annotations/coco_wholebody_train_v1.0.json',
+    data_prefix=dict(img='detection/coco/train2017/'),
+    pipeline=[],
+)
+
+dataset_aic = dict(
+    type='AicDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='aic/annotations/aic_train.json',
+    data_prefix=dict(img='pose/ai_challenge/ai_challenger_keypoint'
+                     '_train_20170902/keypoint_train_images_20170902/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=aic_coco133)
+    ],
+)
+
+dataset_crowdpose = dict(
+    type='CrowdPoseDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='crowdpose/annotations/mmpose_crowdpose_trainval.json',
+    data_prefix=dict(img='pose/CrowdPose/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=crowdpose_coco133)
+    ],
+)
+
+dataset_mpii = dict(
+    type='MpiiDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='mpii/annotations/mpii_train.json',
+    data_prefix=dict(img='pose/MPI/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=mpii_coco133)
+    ],
+)
+
+dataset_jhmdb = dict(
+    type='JhmdbDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='jhmdb/annotations/Sub1_train.json',
+    data_prefix=dict(img='pose/JHMDB/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=jhmdb_coco133)
+    ],
+)
+
+dataset_halpe = dict(
+    type='HalpeDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='halpe/annotations/halpe_train_v1.json',
+    data_prefix=dict(img='pose/Halpe/hico_20160224_det/images/train2015'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=halpe_coco133)
+    ],
+)
+
+dataset_posetrack = dict(
+    type='PoseTrack18Dataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='posetrack18/annotations/posetrack18_train.json',
+    data_prefix=dict(img='pose/PoseChallenge2018/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=posetrack_coco133)
+    ],
+)
+
+dataset_humanart = dict(
+    type='HumanArt21Dataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='HumanArt/annotations/training_humanart.json',
+    filter_cfg=dict(scenes=['real_human']),
+    data_prefix=dict(img='pose/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=humanart_coco133)
+    ])
+
+ubody_scenes = [
+    'Magic_show', 'Entertainment', 'ConductMusic', 'Online_class', 'TalkShow',
+    'Speech', 'Fitness', 'Interview', 'Olympic', 'TVShow', 'Singing',
+    'SignLanguage', 'Movie', 'LiveVlog', 'VideoConference'
+]
+
+ubody_datasets = []
+for scene in ubody_scenes:
+    each = dict(
+        type='UBody2dDataset',
+        data_root=data_root,
+        data_mode=data_mode,
+        ann_file=f'Ubody/annotations/{scene}/train_annotations.json',
+        data_prefix=dict(img='pose/UBody/images/'),
+        pipeline=[],
+        sample_interval=10)
+    ubody_datasets.append(each)
+
+dataset_ubody = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/ubody2d.py'),
+    datasets=ubody_datasets,
+    pipeline=[],
+    test_mode=False,
+)
+
+face_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale', padding=1.25),
+    dict(
+        type='RandomBBoxTransform',
+        shift_factor=0.,
+        scale_factor=[1.5, 2.0],
+        rotate_factor=0),
+]
+
+wflw_coco133 = [(i * 2, 23 + i)
+                for i in range(17)] + [(33 + i, 40 + i) for i in range(5)] + [
+                    (42 + i, 45 + i) for i in range(5)
+                ] + [(51 + i, 50 + i)
+                     for i in range(9)] + [(60, 59), (61, 60), (63, 61),
+                                           (64, 62), (65, 63), (67, 64),
+                                           (68, 65), (69, 66), (71, 67),
+                                           (72, 68), (73, 69),
+                                           (75, 70)] + [(76 + i, 71 + i)
+                                                        for i in range(20)]
+dataset_wflw = dict(
+    type='WFLWDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='wflw/annotations/face_landmarks_wflw_train.json',
+    data_prefix=dict(img='pose/WFLW/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=wflw_coco133), *face_pipeline
+    ],
+)
+
+mapping_300w_coco133 = [(i, 23 + i) for i in range(68)]
+dataset_300w = dict(
+    type='Face300WDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='300w/annotations/face_landmarks_300w_train.json',
+    data_prefix=dict(img='pose/300w/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=mapping_300w_coco133), *face_pipeline
+    ],
+)
+
+cofw_coco133 = [(0, 40), (2, 44), (4, 42), (1, 49), (3, 45), (6, 47), (8, 59),
+                (10, 62), (9, 68), (11, 65), (18, 54), (19, 58), (20, 53),
+                (21, 56), (22, 71), (23, 77), (24, 74), (25, 85), (26, 89),
+                (27, 80), (28, 31)]
+dataset_cofw = dict(
+    type='COFWDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='cofw/annotations/cofw_train.json',
+    data_prefix=dict(img='pose/COFW/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=cofw_coco133), *face_pipeline
+    ],
+)
+
+lapa_coco133 = [(i * 2, 23 + i) for i in range(17)] + [
+    (33 + i, 40 + i) for i in range(5)
+] + [(42 + i, 45 + i) for i in range(5)] + [
+    (51 + i, 50 + i) for i in range(4)
+] + [(58 + i, 54 + i) for i in range(5)] + [(66, 59), (67, 60), (69, 61),
+                                            (70, 62), (71, 63), (73, 64),
+                                            (75, 65), (76, 66), (78, 67),
+                                            (79, 68), (80, 69),
+                                            (82, 70)] + [(84 + i, 71 + i)
+                                                         for i in range(20)]
+dataset_lapa = dict(
+    type='LapaDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='LaPa/annotations/lapa_trainval.json',
+    data_prefix=dict(img='pose/LaPa/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=lapa_coco133), *face_pipeline
+    ],
+)
+
+dataset_wb = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[dataset_coco, dataset_halpe, dataset_ubody],
+    pipeline=[],
+    test_mode=False,
+)
+
+dataset_body = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[
+        dataset_aic,
+        dataset_crowdpose,
+        dataset_mpii,
+        dataset_jhmdb,
+        dataset_posetrack,
+        dataset_humanart,
+    ],
+    pipeline=[],
+    test_mode=False,
+)
+
+dataset_face = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[
+        dataset_wflw,
+        dataset_300w,
+        dataset_cofw,
+        dataset_lapa,
+    ],
+    pipeline=[],
+    test_mode=False,
+)
+
+hand_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(
+        type='RandomBBoxTransform',
+        shift_factor=0.,
+        scale_factor=[1.5, 2.0],
+        rotate_factor=0),
+]
+
+interhand_left = [(21, 95), (22, 94), (23, 93), (24, 92), (25, 99), (26, 98),
+                  (27, 97), (28, 96), (29, 103), (30, 102), (31, 101),
+                  (32, 100), (33, 107), (34, 106), (35, 105), (36, 104),
+                  (37, 111), (38, 110), (39, 109), (40, 108), (41, 91)]
+interhand_right = [(i - 21, j + 21) for i, j in interhand_left]
+interhand_coco133 = interhand_right + interhand_left
+
+dataset_interhand2d = dict(
+    type='InterHand2DDoubleDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='interhand26m/annotations/all/InterHand2.6M_train_data.json',
+    camera_param_file='interhand26m/annotations/all/'
+    'InterHand2.6M_train_camera.json',
+    joint_file='interhand26m/annotations/all/'
+    'InterHand2.6M_train_joint_3d.json',
+    data_prefix=dict(img='interhand2.6m/images/train/'),
+    sample_interval=10,
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=interhand_coco133,
+        ), *hand_pipeline
+    ],
+)
+
+dataset_hand = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[dataset_interhand2d],
+    pipeline=[],
+    test_mode=False,
+)
+
+train_datasets = [dataset_wb, dataset_body, dataset_face, dataset_hand]
+
+# data loaders
+train_dataloader = dict(
+    batch_size=train_batch_size,
+    num_workers=4,
+    pin_memory=False,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dict(
+        type='CombinedDataset',
+        metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+        datasets=train_datasets,
+        pipeline=train_pipeline,
+        test_mode=False,
+    ))
+
+val_dataloader = dict(
+    batch_size=val_batch_size,
+    num_workers=4,
+    persistent_workers=True,
+    drop_last=False,
+    sampler=dict(type='DefaultSampler', shuffle=False, round_up=False),
+    dataset=dict(
+        type='CocoWholeBodyDataset',
+        ann_file='data/coco/annotations/coco_wholebody_val_v1.0.json',
+        data_prefix=dict(img='data/detection/coco/val2017/'),
+        pipeline=val_pipeline,
+        bbox_file='data/coco/person_detection_results/'
+        'COCO_val2017_detections_AP_H_56_person.json',
+        test_mode=True))
+
+test_dataloader = val_dataloader
+
+# hooks
+default_hooks = dict(
+    checkpoint=dict(
+        save_best='coco-wholebody/AP', rule='greater', max_keep_ckpts=1))
+
+custom_hooks = [
+    dict(
+        type='EMAHook',
+        ema_type='ExpMomentumEMA',
+        momentum=0.0002,
+        update_buffers=True,
+        priority=49),
+    dict(
+        type='mmdet.PipelineSwitchHook',
+        switch_epoch=max_epochs - stage2_num_epochs,
+        switch_pipeline=train_pipeline_stage2)
+]
+
+# evaluators
+val_evaluator = dict(
+    type='CocoWholeBodyMetric',
+    ann_file='data/coco/annotations/coco_wholebody_val_v1.0.json')
+test_evaluator = val_evaluator
diff --git a/projects/rtmpose/rtmpose/wholebody_2d_keypoint/rtmw-x_8xb320-270e_cocktail14-384x288.py b/projects/rtmpose/rtmpose/wholebody_2d_keypoint/rtmw-x_8xb320-270e_cocktail14-384x288.py
new file mode 100644
index 0000000000..952df6a867
--- /dev/null
+++ b/projects/rtmpose/rtmpose/wholebody_2d_keypoint/rtmw-x_8xb320-270e_cocktail14-384x288.py
@@ -0,0 +1,616 @@
+_base_ = ['mmpose::_base_/default_runtime.py']
+
+# common setting
+num_keypoints = 133
+input_size = (288, 384)
+
+# runtime
+max_epochs = 270
+stage2_num_epochs = 10
+base_lr = 5e-4
+train_batch_size = 320
+val_batch_size = 32
+
+train_cfg = dict(max_epochs=max_epochs, val_interval=10)
+randomness = dict(seed=21)
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='AdamW', lr=base_lr, weight_decay=0.1),
+    clip_grad=dict(max_norm=35, norm_type=2),
+    paramwise_cfg=dict(
+        norm_decay_mult=0, bias_decay_mult=0, bypass_duplicate=True))
+
+# learning rate
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1.0e-5,
+        by_epoch=False,
+        begin=0,
+        end=1000),
+    dict(
+        # use cosine lr from 150 to 300 epoch
+        type='CosineAnnealingLR',
+        eta_min=base_lr * 0.05,
+        begin=max_epochs // 2,
+        end=max_epochs,
+        T_max=max_epochs // 2,
+        by_epoch=True,
+        convert_to_iter_based=True),
+]
+
+# automatically scaling LR based on the actual training batch size
+auto_scale_lr = dict(base_batch_size=2560)
+
+# codec settings
+codec = dict(
+    type='SimCCLabel',
+    input_size=input_size,
+    sigma=(6., 6.93),
+    simcc_split_ratio=2.0,
+    normalize=False,
+    use_dark=False)
+
+# model settings
+model = dict(
+    type='TopdownPoseEstimator',
+    data_preprocessor=dict(
+        type='PoseDataPreprocessor',
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        bgr_to_rgb=True),
+    backbone=dict(
+        type='CSPNeXt',
+        arch='P5',
+        expand_ratio=0.5,
+        deepen_factor=1.33,
+        widen_factor=1.25,
+        channel_attention=True,
+        norm_cfg=dict(type='BN'),
+        act_cfg=dict(type='SiLU'),
+        init_cfg=dict(
+            type='Pretrained',
+            prefix='backbone.',
+            checkpoint='https://download.openmmlab.com/mmpose/v1/'
+            'wholebody_2d_keypoint/rtmpose/ubody/rtmpose-x_simcc-ucoco_pt-aic-coco_270e-384x288-f5b50679_20230822.pth'  # noqa
+        )),
+    neck=dict(
+        type='CSPNeXtPAFPN',
+        in_channels=[320, 640, 1280],
+        out_channels=None,
+        out_indices=(
+            1,
+            2,
+        ),
+        num_csp_blocks=2,
+        expand_ratio=0.5,
+        norm_cfg=dict(type='SyncBN'),
+        act_cfg=dict(type='SiLU', inplace=True)),
+    head=dict(
+        type='RTMWHead',
+        in_channels=1280,
+        out_channels=num_keypoints,
+        input_size=input_size,
+        in_featuremap_size=tuple([s // 32 for s in input_size]),
+        simcc_split_ratio=codec['simcc_split_ratio'],
+        final_layer_kernel_size=7,
+        gau_cfg=dict(
+            hidden_dims=256,
+            s=128,
+            expansion_factor=2,
+            dropout_rate=0.,
+            drop_path=0.,
+            act_fn='SiLU',
+            use_rel_bias=False,
+            pos_enc=False),
+        loss=dict(
+            type='KLDiscretLoss',
+            use_target_weight=True,
+            beta=1.,
+            label_softmax=True,
+            label_beta=10.,
+            mask=list(range(23, 91)),
+            mask_weight=0.5,
+        ),
+        decoder=codec),
+    test_cfg=dict(flip_test=True))
+
+# base dataset settings
+dataset_type = 'CocoWholeBodyDataset'
+data_mode = 'topdown'
+data_root = 'data/'
+
+backend_args = dict(backend='local')
+
+# pipelines
+train_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(type='RandomFlip', direction='horizontal'),
+    dict(type='RandomHalfBody'),
+    dict(
+        type='RandomBBoxTransform', scale_factor=[0.5, 1.5], rotate_factor=90),
+    dict(type='TopdownAffine', input_size=codec['input_size']),
+    dict(type='PhotometricDistortion'),
+    dict(
+        type='Albumentation',
+        transforms=[
+            dict(type='Blur', p=0.1),
+            dict(type='MedianBlur', p=0.1),
+            dict(
+                type='CoarseDropout',
+                max_holes=1,
+                max_height=0.4,
+                max_width=0.4,
+                min_holes=1,
+                min_height=0.2,
+                min_width=0.2,
+                p=0.5),
+        ]),
+    dict(
+        type='GenerateTarget',
+        encoder=codec,
+        use_dataset_keypoint_weights=True),
+    dict(type='PackPoseInputs')
+]
+val_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(type='TopdownAffine', input_size=codec['input_size']),
+    dict(type='PackPoseInputs')
+]
+train_pipeline_stage2 = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(type='RandomFlip', direction='horizontal'),
+    dict(type='RandomHalfBody'),
+    dict(
+        type='RandomBBoxTransform',
+        shift_factor=0.,
+        scale_factor=[0.5, 1.5],
+        rotate_factor=90),
+    dict(type='TopdownAffine', input_size=codec['input_size']),
+    dict(
+        type='Albumentation',
+        transforms=[
+            dict(type='Blur', p=0.1),
+            dict(type='MedianBlur', p=0.1),
+        ]),
+    dict(
+        type='GenerateTarget',
+        encoder=codec,
+        use_dataset_keypoint_weights=True),
+    dict(type='PackPoseInputs')
+]
+
+# mapping
+
+aic_coco133 = [(0, 6), (1, 8), (2, 10), (3, 5), (4, 7), (5, 9), (6, 12),
+               (7, 14), (8, 16), (9, 11), (10, 13), (11, 15)]
+
+crowdpose_coco133 = [(0, 5), (1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (6, 11),
+                     (7, 12), (8, 13), (9, 14), (10, 15), (11, 16)]
+
+mpii_coco133 = [
+    (0, 16),
+    (1, 14),
+    (2, 12),
+    (3, 11),
+    (4, 13),
+    (5, 15),
+    (10, 10),
+    (11, 8),
+    (12, 6),
+    (13, 5),
+    (14, 7),
+    (15, 9),
+]
+
+jhmdb_coco133 = [
+    (3, 6),
+    (4, 5),
+    (5, 12),
+    (6, 11),
+    (7, 8),
+    (8, 7),
+    (9, 14),
+    (10, 13),
+    (11, 10),
+    (12, 9),
+    (13, 16),
+    (14, 15),
+]
+
+halpe_coco133 = [(i, i)
+                 for i in range(17)] + [(20, 17), (21, 20), (22, 18), (23, 21),
+                                        (24, 19),
+                                        (25, 22)] + [(i, i - 3)
+                                                     for i in range(26, 136)]
+
+posetrack_coco133 = [
+    (0, 0),
+    (3, 3),
+    (4, 4),
+    (5, 5),
+    (6, 6),
+    (7, 7),
+    (8, 8),
+    (9, 9),
+    (10, 10),
+    (11, 11),
+    (12, 12),
+    (13, 13),
+    (14, 14),
+    (15, 15),
+    (16, 16),
+]
+
+humanart_coco133 = [(i, i) for i in range(17)] + [(17, 99), (18, 120),
+                                                  (19, 17), (20, 20)]
+
+# train datasets
+dataset_coco = dict(
+    type=dataset_type,
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='coco/annotations/coco_wholebody_train_v1.0.json',
+    data_prefix=dict(img='detection/coco/train2017/'),
+    pipeline=[],
+)
+
+dataset_aic = dict(
+    type='AicDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='aic/annotations/aic_train.json',
+    data_prefix=dict(img='pose/ai_challenge/ai_challenger_keypoint'
+                     '_train_20170902/keypoint_train_images_20170902/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=aic_coco133)
+    ],
+)
+
+dataset_crowdpose = dict(
+    type='CrowdPoseDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='crowdpose/annotations/mmpose_crowdpose_trainval.json',
+    data_prefix=dict(img='pose/CrowdPose/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=crowdpose_coco133)
+    ],
+)
+
+dataset_mpii = dict(
+    type='MpiiDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='mpii/annotations/mpii_train.json',
+    data_prefix=dict(img='pose/MPI/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=mpii_coco133)
+    ],
+)
+
+dataset_jhmdb = dict(
+    type='JhmdbDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='jhmdb/annotations/Sub1_train.json',
+    data_prefix=dict(img='pose/JHMDB/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=jhmdb_coco133)
+    ],
+)
+
+dataset_halpe = dict(
+    type='HalpeDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='halpe/annotations/halpe_train_v1.json',
+    data_prefix=dict(img='pose/Halpe/hico_20160224_det/images/train2015'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=halpe_coco133)
+    ],
+)
+
+dataset_posetrack = dict(
+    type='PoseTrack18Dataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='posetrack18/annotations/posetrack18_train.json',
+    data_prefix=dict(img='pose/PoseChallenge2018/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=posetrack_coco133)
+    ],
+)
+
+dataset_humanart = dict(
+    type='HumanArt21Dataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='HumanArt/annotations/training_humanart.json',
+    filter_cfg=dict(scenes=['real_human']),
+    data_prefix=dict(img='pose/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=humanart_coco133)
+    ])
+
+ubody_scenes = [
+    'Magic_show', 'Entertainment', 'ConductMusic', 'Online_class', 'TalkShow',
+    'Speech', 'Fitness', 'Interview', 'Olympic', 'TVShow', 'Singing',
+    'SignLanguage', 'Movie', 'LiveVlog', 'VideoConference'
+]
+
+ubody_datasets = []
+for scene in ubody_scenes:
+    each = dict(
+        type='UBody2dDataset',
+        data_root=data_root,
+        data_mode=data_mode,
+        ann_file=f'Ubody/annotations/{scene}/train_annotations.json',
+        data_prefix=dict(img='pose/UBody/images/'),
+        pipeline=[],
+        sample_interval=10)
+    ubody_datasets.append(each)
+
+dataset_ubody = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/ubody2d.py'),
+    datasets=ubody_datasets,
+    pipeline=[],
+    test_mode=False,
+)
+
+face_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale', padding=1.25),
+    dict(
+        type='RandomBBoxTransform',
+        shift_factor=0.,
+        scale_factor=[1.5, 2.0],
+        rotate_factor=0),
+]
+
+wflw_coco133 = [(i * 2, 23 + i)
+                for i in range(17)] + [(33 + i, 40 + i) for i in range(5)] + [
+                    (42 + i, 45 + i) for i in range(5)
+                ] + [(51 + i, 50 + i)
+                     for i in range(9)] + [(60, 59), (61, 60), (63, 61),
+                                           (64, 62), (65, 63), (67, 64),
+                                           (68, 65), (69, 66), (71, 67),
+                                           (72, 68), (73, 69),
+                                           (75, 70)] + [(76 + i, 71 + i)
+                                                        for i in range(20)]
+dataset_wflw = dict(
+    type='WFLWDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='wflw/annotations/face_landmarks_wflw_train.json',
+    data_prefix=dict(img='pose/WFLW/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=wflw_coco133), *face_pipeline
+    ],
+)
+
+mapping_300w_coco133 = [(i, 23 + i) for i in range(68)]
+dataset_300w = dict(
+    type='Face300WDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='300w/annotations/face_landmarks_300w_train.json',
+    data_prefix=dict(img='pose/300w/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=mapping_300w_coco133), *face_pipeline
+    ],
+)
+
+cofw_coco133 = [(0, 40), (2, 44), (4, 42), (1, 49), (3, 45), (6, 47), (8, 59),
+                (10, 62), (9, 68), (11, 65), (18, 54), (19, 58), (20, 53),
+                (21, 56), (22, 71), (23, 77), (24, 74), (25, 85), (26, 89),
+                (27, 80), (28, 31)]
+dataset_cofw = dict(
+    type='COFWDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='cofw/annotations/cofw_train.json',
+    data_prefix=dict(img='pose/COFW/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=cofw_coco133), *face_pipeline
+    ],
+)
+
+lapa_coco133 = [(i * 2, 23 + i) for i in range(17)] + [
+    (33 + i, 40 + i) for i in range(5)
+] + [(42 + i, 45 + i) for i in range(5)] + [
+    (51 + i, 50 + i) for i in range(4)
+] + [(58 + i, 54 + i) for i in range(5)] + [(66, 59), (67, 60), (69, 61),
+                                            (70, 62), (71, 63), (73, 64),
+                                            (75, 65), (76, 66), (78, 67),
+                                            (79, 68), (80, 69),
+                                            (82, 70)] + [(84 + i, 71 + i)
+                                                         for i in range(20)]
+dataset_lapa = dict(
+    type='LapaDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='LaPa/annotations/lapa_trainval.json',
+    data_prefix=dict(img='pose/LaPa/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=lapa_coco133), *face_pipeline
+    ],
+)
+
+dataset_wb = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[dataset_coco, dataset_halpe, dataset_ubody],
+    pipeline=[],
+    test_mode=False,
+)
+
+dataset_body = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[
+        dataset_aic,
+        dataset_crowdpose,
+        dataset_mpii,
+        dataset_jhmdb,
+        dataset_posetrack,
+        dataset_humanart,
+    ],
+    pipeline=[],
+    test_mode=False,
+)
+
+dataset_face = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[
+        dataset_wflw,
+        dataset_300w,
+        dataset_cofw,
+        dataset_lapa,
+    ],
+    pipeline=[],
+    test_mode=False,
+)
+
+hand_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(
+        type='RandomBBoxTransform',
+        shift_factor=0.,
+        scale_factor=[1.5, 2.0],
+        rotate_factor=0),
+]
+
+interhand_left = [(21, 95), (22, 94), (23, 93), (24, 92), (25, 99), (26, 98),
+                  (27, 97), (28, 96), (29, 103), (30, 102), (31, 101),
+                  (32, 100), (33, 107), (34, 106), (35, 105), (36, 104),
+                  (37, 111), (38, 110), (39, 109), (40, 108), (41, 91)]
+interhand_right = [(i - 21, j + 21) for i, j in interhand_left]
+interhand_coco133 = interhand_right + interhand_left
+
+dataset_interhand2d = dict(
+    type='InterHand2DDoubleDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='interhand26m/annotations/all/InterHand2.6M_train_data.json',
+    camera_param_file='interhand26m/annotations/all/'
+    'InterHand2.6M_train_camera.json',
+    joint_file='interhand26m/annotations/all/'
+    'InterHand2.6M_train_joint_3d.json',
+    data_prefix=dict(img='interhand2.6m/images/train/'),
+    sample_interval=10,
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=interhand_coco133,
+        ), *hand_pipeline
+    ],
+)
+
+dataset_hand = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[dataset_interhand2d],
+    pipeline=[],
+    test_mode=False,
+)
+
+train_datasets = [dataset_wb, dataset_body, dataset_face, dataset_hand]
+
+# data loaders
+train_dataloader = dict(
+    batch_size=train_batch_size,
+    num_workers=4,
+    pin_memory=False,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dict(
+        type='CombinedDataset',
+        metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+        datasets=train_datasets,
+        pipeline=train_pipeline,
+        test_mode=False,
+    ))
+
+val_dataloader = dict(
+    batch_size=val_batch_size,
+    num_workers=4,
+    persistent_workers=True,
+    drop_last=False,
+    sampler=dict(type='DefaultSampler', shuffle=False, round_up=False),
+    dataset=dict(
+        type='CocoWholeBodyDataset',
+        ann_file='data/coco/annotations/coco_wholebody_val_v1.0.json',
+        data_prefix=dict(img='data/detection/coco/val2017/'),
+        pipeline=val_pipeline,
+        bbox_file='data/coco/person_detection_results/'
+        'COCO_val2017_detections_AP_H_56_person.json',
+        test_mode=True))
+
+test_dataloader = val_dataloader
+
+# hooks
+default_hooks = dict(
+    checkpoint=dict(
+        save_best='coco-wholebody/AP', rule='greater', max_keep_ckpts=1))
+
+custom_hooks = [
+    dict(
+        type='EMAHook',
+        ema_type='ExpMomentumEMA',
+        momentum=0.0002,
+        update_buffers=True,
+        priority=49),
+    dict(
+        type='mmdet.PipelineSwitchHook',
+        switch_epoch=max_epochs - stage2_num_epochs,
+        switch_pipeline=train_pipeline_stage2)
+]
+
+# evaluators
+val_evaluator = dict(
+    type='CocoWholeBodyMetric',
+    ann_file='data/coco/annotations/coco_wholebody_val_v1.0.json')
+test_evaluator = val_evaluator
diff --git a/projects/rtmpose/rtmpose/wholebody_2d_keypoint/rtmw-x_8xb704-270e_cocktail14-256x192.py b/projects/rtmpose/rtmpose/wholebody_2d_keypoint/rtmw-x_8xb704-270e_cocktail14-256x192.py
new file mode 100644
index 0000000000..7a00e1171b
--- /dev/null
+++ b/projects/rtmpose/rtmpose/wholebody_2d_keypoint/rtmw-x_8xb704-270e_cocktail14-256x192.py
@@ -0,0 +1,615 @@
+_base_ = ['mmpose::_base_/default_runtime.py']
+
+# common setting
+num_keypoints = 133
+input_size = (192, 256)
+
+# runtime
+max_epochs = 270
+stage2_num_epochs = 10
+base_lr = 5e-4
+train_batch_size = 704
+val_batch_size = 32
+
+train_cfg = dict(max_epochs=max_epochs, val_interval=10)
+randomness = dict(seed=21)
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='AdamW', lr=base_lr, weight_decay=0.1),
+    clip_grad=dict(max_norm=35, norm_type=2),
+    paramwise_cfg=dict(
+        norm_decay_mult=0, bias_decay_mult=0, bypass_duplicate=True))
+
+# learning rate
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1.0e-5,
+        by_epoch=False,
+        begin=0,
+        end=1000),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=base_lr * 0.05,
+        begin=max_epochs // 2,
+        end=max_epochs,
+        T_max=max_epochs // 2,
+        by_epoch=True,
+        convert_to_iter_based=True),
+]
+
+# automatically scaling LR based on the actual training batch size
+auto_scale_lr = dict(base_batch_size=5632)
+
+# codec settings
+codec = dict(
+    type='SimCCLabel',
+    input_size=input_size,
+    sigma=(4.9, 5.66),
+    simcc_split_ratio=2.0,
+    normalize=False,
+    use_dark=False)
+
+# model settings
+model = dict(
+    type='TopdownPoseEstimator',
+    data_preprocessor=dict(
+        type='PoseDataPreprocessor',
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        bgr_to_rgb=True),
+    backbone=dict(
+        type='CSPNeXt',
+        arch='P5',
+        expand_ratio=0.5,
+        deepen_factor=1.33,
+        widen_factor=1.25,
+        channel_attention=True,
+        norm_cfg=dict(type='BN'),
+        act_cfg=dict(type='SiLU'),
+        init_cfg=dict(
+            type='Pretrained',
+            prefix='backbone.',
+            checkpoint='https://download.openmmlab.com/mmpose/v1/'
+            'wholebody_2d_keypoint/rtmpose/ubody/rtmpose-x_simcc-ucoco_pt-aic-coco_270e-256x192-05f5bcb7_20230822.pth'  # noqa
+        )),
+    neck=dict(
+        type='CSPNeXtPAFPN',
+        in_channels=[320, 640, 1280],
+        out_channels=None,
+        out_indices=(
+            1,
+            2,
+        ),
+        num_csp_blocks=2,
+        expand_ratio=0.5,
+        norm_cfg=dict(type='SyncBN'),
+        act_cfg=dict(type='SiLU', inplace=True)),
+    head=dict(
+        type='RTMWHead',
+        in_channels=1280,
+        out_channels=num_keypoints,
+        input_size=input_size,
+        in_featuremap_size=tuple([s // 32 for s in input_size]),
+        simcc_split_ratio=codec['simcc_split_ratio'],
+        final_layer_kernel_size=7,
+        gau_cfg=dict(
+            hidden_dims=256,
+            s=128,
+            expansion_factor=2,
+            dropout_rate=0.,
+            drop_path=0.,
+            act_fn='SiLU',
+            use_rel_bias=False,
+            pos_enc=False),
+        loss=dict(
+            type='KLDiscretLoss',
+            use_target_weight=True,
+            beta=1.,
+            label_softmax=True,
+            label_beta=10.,
+            mask=list(range(23, 91)),
+            mask_weight=0.5,
+        ),
+        decoder=codec),
+    test_cfg=dict(flip_test=True))
+
+# base dataset settings
+dataset_type = 'CocoWholeBodyDataset'
+data_mode = 'topdown'
+data_root = 'data/'
+
+backend_args = dict(backend='local')
+
+# pipelines
+train_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(type='RandomFlip', direction='horizontal'),
+    dict(type='RandomHalfBody'),
+    dict(
+        type='RandomBBoxTransform', scale_factor=[0.5, 1.5], rotate_factor=90),
+    dict(type='TopdownAffine', input_size=codec['input_size']),
+    dict(type='PhotometricDistortion'),
+    dict(
+        type='Albumentation',
+        transforms=[
+            dict(type='Blur', p=0.1),
+            dict(type='MedianBlur', p=0.1),
+            dict(
+                type='CoarseDropout',
+                max_holes=1,
+                max_height=0.4,
+                max_width=0.4,
+                min_holes=1,
+                min_height=0.2,
+                min_width=0.2,
+                p=0.5),
+        ]),
+    dict(
+        type='GenerateTarget',
+        encoder=codec,
+        use_dataset_keypoint_weights=True),
+    dict(type='PackPoseInputs')
+]
+val_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(type='TopdownAffine', input_size=codec['input_size']),
+    dict(type='PackPoseInputs')
+]
+train_pipeline_stage2 = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(type='RandomFlip', direction='horizontal'),
+    dict(type='RandomHalfBody'),
+    dict(
+        type='RandomBBoxTransform',
+        shift_factor=0.,
+        scale_factor=[0.5, 1.5],
+        rotate_factor=90),
+    dict(type='TopdownAffine', input_size=codec['input_size']),
+    dict(
+        type='Albumentation',
+        transforms=[
+            dict(type='Blur', p=0.1),
+            dict(type='MedianBlur', p=0.1),
+        ]),
+    dict(
+        type='GenerateTarget',
+        encoder=codec,
+        use_dataset_keypoint_weights=True),
+    dict(type='PackPoseInputs')
+]
+
+# mapping
+
+aic_coco133 = [(0, 6), (1, 8), (2, 10), (3, 5), (4, 7), (5, 9), (6, 12),
+               (7, 14), (8, 16), (9, 11), (10, 13), (11, 15)]
+
+crowdpose_coco133 = [(0, 5), (1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (6, 11),
+                     (7, 12), (8, 13), (9, 14), (10, 15), (11, 16)]
+
+mpii_coco133 = [
+    (0, 16),
+    (1, 14),
+    (2, 12),
+    (3, 11),
+    (4, 13),
+    (5, 15),
+    (10, 10),
+    (11, 8),
+    (12, 6),
+    (13, 5),
+    (14, 7),
+    (15, 9),
+]
+
+jhmdb_coco133 = [
+    (3, 6),
+    (4, 5),
+    (5, 12),
+    (6, 11),
+    (7, 8),
+    (8, 7),
+    (9, 14),
+    (10, 13),
+    (11, 10),
+    (12, 9),
+    (13, 16),
+    (14, 15),
+]
+
+halpe_coco133 = [(i, i)
+                 for i in range(17)] + [(20, 17), (21, 20), (22, 18), (23, 21),
+                                        (24, 19),
+                                        (25, 22)] + [(i, i - 3)
+                                                     for i in range(26, 136)]
+
+posetrack_coco133 = [
+    (0, 0),
+    (3, 3),
+    (4, 4),
+    (5, 5),
+    (6, 6),
+    (7, 7),
+    (8, 8),
+    (9, 9),
+    (10, 10),
+    (11, 11),
+    (12, 12),
+    (13, 13),
+    (14, 14),
+    (15, 15),
+    (16, 16),
+]
+
+humanart_coco133 = [(i, i) for i in range(17)] + [(17, 99), (18, 120),
+                                                  (19, 17), (20, 20)]
+
+# train datasets
+dataset_coco = dict(
+    type=dataset_type,
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='coco/annotations/coco_wholebody_train_v1.0.json',
+    data_prefix=dict(img='detection/coco/train2017/'),
+    pipeline=[],
+)
+
+dataset_aic = dict(
+    type='AicDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='aic/annotations/aic_train.json',
+    data_prefix=dict(img='pose/ai_challenge/ai_challenger_keypoint'
+                     '_train_20170902/keypoint_train_images_20170902/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=aic_coco133)
+    ],
+)
+
+dataset_crowdpose = dict(
+    type='CrowdPoseDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='crowdpose/annotations/mmpose_crowdpose_trainval.json',
+    data_prefix=dict(img='pose/CrowdPose/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=crowdpose_coco133)
+    ],
+)
+
+dataset_mpii = dict(
+    type='MpiiDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='mpii/annotations/mpii_train.json',
+    data_prefix=dict(img='pose/MPI/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=mpii_coco133)
+    ],
+)
+
+dataset_jhmdb = dict(
+    type='JhmdbDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='jhmdb/annotations/Sub1_train.json',
+    data_prefix=dict(img='pose/JHMDB/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=jhmdb_coco133)
+    ],
+)
+
+dataset_halpe = dict(
+    type='HalpeDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='halpe/annotations/halpe_train_v1.json',
+    data_prefix=dict(img='pose/Halpe/hico_20160224_det/images/train2015'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=halpe_coco133)
+    ],
+)
+
+dataset_posetrack = dict(
+    type='PoseTrack18Dataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='posetrack18/annotations/posetrack18_train.json',
+    data_prefix=dict(img='pose/PoseChallenge2018/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=posetrack_coco133)
+    ],
+)
+
+dataset_humanart = dict(
+    type='HumanArt21Dataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='HumanArt/annotations/training_humanart.json',
+    filter_cfg=dict(scenes=['real_human']),
+    data_prefix=dict(img='pose/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=humanart_coco133)
+    ])
+
+ubody_scenes = [
+    'Magic_show', 'Entertainment', 'ConductMusic', 'Online_class', 'TalkShow',
+    'Speech', 'Fitness', 'Interview', 'Olympic', 'TVShow', 'Singing',
+    'SignLanguage', 'Movie', 'LiveVlog', 'VideoConference'
+]
+
+ubody_datasets = []
+for scene in ubody_scenes:
+    each = dict(
+        type='UBody2dDataset',
+        data_root=data_root,
+        data_mode=data_mode,
+        ann_file=f'Ubody/annotations/{scene}/train_annotations.json',
+        data_prefix=dict(img='pose/UBody/images/'),
+        pipeline=[],
+        sample_interval=10)
+    ubody_datasets.append(each)
+
+dataset_ubody = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/ubody2d.py'),
+    datasets=ubody_datasets,
+    pipeline=[],
+    test_mode=False,
+)
+
+face_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale', padding=1.25),
+    dict(
+        type='RandomBBoxTransform',
+        shift_factor=0.,
+        scale_factor=[1.5, 2.0],
+        rotate_factor=0),
+]
+
+wflw_coco133 = [(i * 2, 23 + i)
+                for i in range(17)] + [(33 + i, 40 + i) for i in range(5)] + [
+                    (42 + i, 45 + i) for i in range(5)
+                ] + [(51 + i, 50 + i)
+                     for i in range(9)] + [(60, 59), (61, 60), (63, 61),
+                                           (64, 62), (65, 63), (67, 64),
+                                           (68, 65), (69, 66), (71, 67),
+                                           (72, 68), (73, 69),
+                                           (75, 70)] + [(76 + i, 71 + i)
+                                                        for i in range(20)]
+dataset_wflw = dict(
+    type='WFLWDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='wflw/annotations/face_landmarks_wflw_train.json',
+    data_prefix=dict(img='pose/WFLW/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=wflw_coco133), *face_pipeline
+    ],
+)
+
+mapping_300w_coco133 = [(i, 23 + i) for i in range(68)]
+dataset_300w = dict(
+    type='Face300WDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='300w/annotations/face_landmarks_300w_train.json',
+    data_prefix=dict(img='pose/300w/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=mapping_300w_coco133), *face_pipeline
+    ],
+)
+
+cofw_coco133 = [(0, 40), (2, 44), (4, 42), (1, 49), (3, 45), (6, 47), (8, 59),
+                (10, 62), (9, 68), (11, 65), (18, 54), (19, 58), (20, 53),
+                (21, 56), (22, 71), (23, 77), (24, 74), (25, 85), (26, 89),
+                (27, 80), (28, 31)]
+dataset_cofw = dict(
+    type='COFWDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='cofw/annotations/cofw_train.json',
+    data_prefix=dict(img='pose/COFW/images/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=cofw_coco133), *face_pipeline
+    ],
+)
+
+lapa_coco133 = [(i * 2, 23 + i) for i in range(17)] + [
+    (33 + i, 40 + i) for i in range(5)
+] + [(42 + i, 45 + i) for i in range(5)] + [
+    (51 + i, 50 + i) for i in range(4)
+] + [(58 + i, 54 + i) for i in range(5)] + [(66, 59), (67, 60), (69, 61),
+                                            (70, 62), (71, 63), (73, 64),
+                                            (75, 65), (76, 66), (78, 67),
+                                            (79, 68), (80, 69),
+                                            (82, 70)] + [(84 + i, 71 + i)
+                                                         for i in range(20)]
+dataset_lapa = dict(
+    type='LapaDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='LaPa/annotations/lapa_trainval.json',
+    data_prefix=dict(img='pose/LaPa/'),
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=lapa_coco133), *face_pipeline
+    ],
+)
+
+dataset_wb = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[dataset_coco, dataset_halpe, dataset_ubody],
+    pipeline=[],
+    test_mode=False,
+)
+
+dataset_body = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[
+        dataset_aic,
+        dataset_crowdpose,
+        dataset_mpii,
+        dataset_jhmdb,
+        dataset_posetrack,
+        dataset_humanart,
+    ],
+    pipeline=[],
+    test_mode=False,
+)
+
+dataset_face = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[
+        dataset_wflw,
+        dataset_300w,
+        dataset_cofw,
+        dataset_lapa,
+    ],
+    pipeline=[],
+    test_mode=False,
+)
+
+hand_pipeline = [
+    dict(type='LoadImage', backend_args=backend_args),
+    dict(type='GetBBoxCenterScale'),
+    dict(
+        type='RandomBBoxTransform',
+        shift_factor=0.,
+        scale_factor=[1.5, 2.0],
+        rotate_factor=0),
+]
+
+interhand_left = [(21, 95), (22, 94), (23, 93), (24, 92), (25, 99), (26, 98),
+                  (27, 97), (28, 96), (29, 103), (30, 102), (31, 101),
+                  (32, 100), (33, 107), (34, 106), (35, 105), (36, 104),
+                  (37, 111), (38, 110), (39, 109), (40, 108), (41, 91)]
+interhand_right = [(i - 21, j + 21) for i, j in interhand_left]
+interhand_coco133 = interhand_right + interhand_left
+
+dataset_interhand2d = dict(
+    type='InterHand2DDoubleDataset',
+    data_root=data_root,
+    data_mode=data_mode,
+    ann_file='interhand26m/annotations/all/InterHand2.6M_train_data.json',
+    camera_param_file='interhand26m/annotations/all/'
+    'InterHand2.6M_train_camera.json',
+    joint_file='interhand26m/annotations/all/'
+    'InterHand2.6M_train_joint_3d.json',
+    data_prefix=dict(img='interhand2.6m/images/train/'),
+    sample_interval=10,
+    pipeline=[
+        dict(
+            type='KeypointConverter',
+            num_keypoints=num_keypoints,
+            mapping=interhand_coco133,
+        ), *hand_pipeline
+    ],
+)
+
+dataset_hand = dict(
+    type='CombinedDataset',
+    metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+    datasets=[dataset_interhand2d],
+    pipeline=[],
+    test_mode=False,
+)
+
+train_datasets = [dataset_wb, dataset_body, dataset_face, dataset_hand]
+
+# data loaders
+train_dataloader = dict(
+    batch_size=train_batch_size,
+    num_workers=4,
+    pin_memory=False,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dict(
+        type='CombinedDataset',
+        metainfo=dict(from_file='configs/_base_/datasets/coco_wholebody.py'),
+        datasets=train_datasets,
+        pipeline=train_pipeline,
+        test_mode=False,
+    ))
+
+val_dataloader = dict(
+    batch_size=val_batch_size,
+    num_workers=4,
+    persistent_workers=True,
+    drop_last=False,
+    sampler=dict(type='DefaultSampler', shuffle=False, round_up=False),
+    dataset=dict(
+        type='CocoWholeBodyDataset',
+        ann_file='data/coco/annotations/coco_wholebody_val_v1.0.json',
+        data_prefix=dict(img='data/detection/coco/val2017/'),
+        pipeline=val_pipeline,
+        bbox_file='data/coco/person_detection_results/'
+        'COCO_val2017_detections_AP_H_56_person.json',
+        test_mode=True))
+
+test_dataloader = val_dataloader
+
+# hooks
+default_hooks = dict(
+    checkpoint=dict(
+        save_best='coco-wholebody/AP', rule='greater', max_keep_ckpts=1))
+
+custom_hooks = [
+    dict(
+        type='EMAHook',
+        ema_type='ExpMomentumEMA',
+        momentum=0.0002,
+        update_buffers=True,
+        priority=49),
+    dict(
+        type='mmdet.PipelineSwitchHook',
+        switch_epoch=max_epochs - stage2_num_epochs,
+        switch_pipeline=train_pipeline_stage2)
+]
+
+# evaluators
+val_evaluator = dict(
+    type='CocoWholeBodyMetric',
+    ann_file='data/coco/annotations/coco_wholebody_val_v1.0.json')
+test_evaluator = val_evaluator
diff --git a/tests/data/exlpose/imgs_0212_hwangridan_vid000020_exp1200_dark_000052__gain_3.40_exposure_417.png b/tests/data/exlpose/imgs_0212_hwangridan_vid000020_exp1200_dark_000052__gain_3.40_exposure_417.png
new file mode 100644
index 0000000000..af88084c86
Binary files /dev/null and b/tests/data/exlpose/imgs_0212_hwangridan_vid000020_exp1200_dark_000052__gain_3.40_exposure_417.png differ
diff --git a/tests/data/exlpose/imgs_0212_hwangridan_vid000020_exp400_dark_000052__gain_3.40_exposure_1250.png b/tests/data/exlpose/imgs_0212_hwangridan_vid000020_exp400_dark_000052__gain_3.40_exposure_1250.png
new file mode 100644
index 0000000000..1edaba86ab
Binary files /dev/null and b/tests/data/exlpose/imgs_0212_hwangridan_vid000020_exp400_dark_000052__gain_3.40_exposure_1250.png differ
diff --git a/tests/data/exlpose/test_exlpose.json b/tests/data/exlpose/test_exlpose.json
new file mode 100644
index 0000000000..c6f13da538
--- /dev/null
+++ b/tests/data/exlpose/test_exlpose.json
@@ -0,0 +1,444 @@
+{
+    "info": {
+        "description": "CGLab of POSTECH",
+        "year": 2022,
+        "date_created": "2022/02"
+    },
+    "categories": [
+        {
+            "supercategory": "person",
+            "id": 1,
+            "name": "person",
+            "keypoints": [
+                "left_shoulder",
+                "right_shoulder",
+                "left_elbow",
+                "right_elbow",
+                "left_wrist",
+                "right_wrist",
+                "left_hip",
+                "right_hip",
+                "left_knee",
+                "right_knee",
+                "left_ankle",
+                "right_ankle",
+                "head",
+                "neck"
+            ],
+            "skeleton": [
+                [
+                    12,
+                    13
+                ],
+                [
+                    13,
+                    0
+                ],
+                [
+                    13,
+                    1
+                ],
+                [
+                    0,
+                    2
+                ],
+                [
+                    2,
+                    4
+                ],
+                [
+                    1,
+                    3
+                ],
+                [
+                    3,
+                    5
+                ],
+                [
+                    13,
+                    7
+                ],
+                [
+                    13,
+                    6
+                ],
+                [
+                    7,
+                    9
+                ],
+                [
+                    9,
+                    11
+                ],
+                [
+                    6,
+                    8
+                ],
+                [
+                    8,
+                    10
+                ]
+            ]
+        }
+    ],
+    "images": [
+        {
+            "file_name": "imgs_0212_hwangridan_vid000020_exp400_dark_000052__gain_3.40_exposure_1250.png",
+            "id": 0,
+            "height": 1198,
+            "width": 1919,
+            "crowdIndex": 0
+        },
+        {
+            "file_name": "imgs_0212_hwangridan_vid000020_exp1200_dark_000052__gain_3.40_exposure_417.png",
+            "id": 1,
+            "height": 1198,
+            "width": 1919,
+            "crowdIndex": 0
+        }
+    ],
+    "annotations": [
+        {
+            "num_keypoints": 14,
+            "iscrowd": 0,
+            "keypoints": [
+                1399.73,
+                332.68,
+                2,
+                1240.1,
+                316.9,
+                2,
+                1438.03,
+                460.36,
+                2,
+                1195.44,
+                428.44,
+                2,
+                1438.0,
+                576.8,
+                2,
+                1169.9,
+                517.82,
+                2,
+                1329.5,
+                581.66,
+                2,
+                1233.74,
+                568.89,
+                2,
+                1297.58,
+                741.26,
+                2,
+                1240.13,
+                734.88,
+                2,
+                1303.97,
+                894.48,
+                2,
+                1272.05,
+                894.48,
+                2,
+                1318.7,
+                190.7,
+                2,
+                1323.12,
+                294.38,
+                2
+            ],
+            "image_id": 0,
+            "bbox": [
+                1141.75,
+                169.89,
+                329.81999999999994,
+                816.16
+            ],
+            "category_id": 1,
+            "id": 0
+        },
+        {
+            "num_keypoints": 14,
+            "iscrowd": 0,
+            "keypoints": [
+                1198.02,
+                352.83,
+                2,
+                1077.6,
+                352.83,
+                2,
+                1214.45,
+                456.83,
+                2,
+                1046.3,
+                434.0,
+                2,
+                1196.6,
+                535.0,
+                2,
+                1018.5,
+                463.0,
+                2,
+                1159.8,
+                567.3,
+                2,
+                1086.1,
+                560.3,
+                2,
+                1115.92,
+                703.14,
+                2,
+                1077.6,
+                697.67,
+                2,
+                1141.7,
+                832.6,
+                2,
+                1080.6,
+                859.9,
+                2,
+                1137.81,
+                232.41,
+                2,
+                1137.81,
+                325.46,
+                2
+            ],
+            "image_id": 0,
+            "bbox": [
+                1029.62,
+                218.73,
+                177.08000000000015,
+                699.62
+            ],
+            "category_id": 1,
+            "id": 1
+        },
+        {
+            "num_keypoints": 14,
+            "iscrowd": 0,
+            "keypoints": [
+                743.06,
+                351.98,
+                2,
+                790.68,
+                357.27,
+                2,
+                730.72,
+                383.72,
+                2,
+                794.21,
+                394.31,
+                2,
+                734.24,
+                415.47,
+                2,
+                787.16,
+                415.47,
+                2,
+                746.59,
+                419.0,
+                2,
+                778.34,
+                419.0,
+                2,
+                748.35,
+                463.09,
+                2,
+                772.4,
+                465.8,
+                2,
+                744.9,
+                504.3,
+                2,
+                761.3,
+                504.9,
+                2,
+                773.7,
+                312.0,
+                2,
+                771.28,
+                337.87,
+                2
+            ],
+            "image_id": 1,
+            "bbox": [
+                732.07,
+                307.0,
+                72.13,
+                224.76
+            ],
+            "category_id": 1,
+            "id": 2
+        },
+        {
+            "num_keypoints": 12,
+            "iscrowd": 0,
+            "keypoints": [
+                674.3,
+                336.65,
+                2,
+                715.1,
+                338.5,
+                2,
+                656.88,
+                362.79,
+                2,
+                733.1,
+                368.7,
+                2,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                680.84,
+                419.42,
+                2,
+                702.62,
+                421.6,
+                2,
+                689.55,
+                476.06,
+                2,
+                702.62,
+                478.23,
+                2,
+                687.37,
+                530.51,
+                2,
+                693.91,
+                534.87,
+                2,
+                709.8,
+                297.9,
+                2,
+                700.44,
+                321.4,
+                2
+            ],
+            "image_id": 1,
+            "bbox": [
+                650.96,
+                276.75,
+                117.55999999999995,
+                277.80999999999995
+            ],
+            "category_id": 1,
+            "id": 3
+        },
+        {
+            "num_keypoints": 14,
+            "iscrowd": 0,
+            "keypoints": [
+                620.44,
+                336.27,
+                2,
+                641.2,
+                337.86,
+                2,
+                614.85,
+                350.64,
+                2,
+                646.79,
+                354.63,
+                2,
+                612.46,
+                361.82,
+                2,
+                646.79,
+                365.81,
+                2,
+                622.84,
+                364.21,
+                2,
+                638.01,
+                366.61,
+                2,
+                623.64,
+                380.98,
+                2,
+                636.45,
+                382.97,
+                2,
+                626.03,
+                400.93,
+                2,
+                638.2,
+                399.9,
+                2,
+                632.42,
+                314.71,
+                2,
+                631.62,
+                329.88,
+                2
+            ],
+            "image_id": 1,
+            "bbox": [
+                605.38,
+                311.92,
+                42.690000000000055,
+                101.19
+            ],
+            "category_id": 1,
+            "id": 4
+        },
+        {
+            "num_keypoints": 14,
+            "iscrowd": 0,
+            "keypoints": [
+                806.08,
+                340.99,
+                2,
+                856.65,
+                342.8,
+                2,
+                795.25,
+                382.53,
+                2,
+                867.49,
+                386.14,
+                2,
+                793.44,
+                411.42,
+                2,
+                869.29,
+                418.64,
+                2,
+                815.5,
+                417.8,
+                2,
+                847.62,
+                416.84,
+                2,
+                812.8,
+                471.1,
+                2,
+                840.9,
+                473.2,
+                2,
+                809.7,
+                512.55,
+                2,
+                831.9,
+                513.5,
+                2,
+                833.3,
+                306.9,
+                2,
+                831.37,
+                324.74,
+                2
+            ],
+            "image_id": 1,
+            "bbox": [
+                794.29,
+                302.16,
+                82.19000000000005,
+                230.16000000000003
+            ],
+            "category_id": 1,
+            "id": 5
+        }
+    ]
+}
\ No newline at end of file
diff --git a/tests/data/h3wb/h3wb_train_sub.npz b/tests/data/h3wb/h3wb_train_sub.npz
new file mode 100644
index 0000000000..dfbaaada03
Binary files /dev/null and b/tests/data/h3wb/h3wb_train_sub.npz differ
diff --git a/tests/test_codecs/test_simcc_label.py b/tests/test_codecs/test_simcc_label.py
index b4c242ef4e..25ae243d9f 100644
--- a/tests/test_codecs/test_simcc_label.py
+++ b/tests/test_codecs/test_simcc_label.py
@@ -100,6 +100,23 @@ def test_decode(self):
                              f'Failed case: "{name}"')
             self.assertEqual(scores.shape, (1, 17), f'Failed case: "{name}"')
 
+        # test decode_visibility
+        cfg = cfg.copy()
+        cfg['decode_visibility'] = True
+        codec = KEYPOINT_CODECS.build(cfg)
+
+        simcc_x = np.random.rand(1, 17, int(
+            192 * codec.simcc_split_ratio)) * 10
+        simcc_y = np.random.rand(1, 17, int(
+            256 * codec.simcc_split_ratio)) * 10
+        keypoints, scores = codec.decode(simcc_x, simcc_y)
+
+        self.assertEqual(len(scores), 2)
+        self.assertEqual(scores[0].shape, (1, 17), f'Failed case: "{name}"')
+        self.assertEqual(scores[1].shape, (1, 17), f'Failed case: "{name}"')
+        self.assertGreaterEqual(scores[1].min(), 0.0)
+        self.assertLessEqual(scores[1].max(), 1.0)
+
     def test_cicular_verification(self):
         keypoints = self.data['keypoints']
         keypoints_visible = self.data['keypoints_visible']
diff --git a/tests/test_datasets/test_datasets/test_body_datasets/test_exlpose_dataset.py b/tests/test_datasets/test_datasets/test_body_datasets/test_exlpose_dataset.py
new file mode 100644
index 0000000000..d8ac49f6f0
--- /dev/null
+++ b/tests/test_datasets/test_datasets/test_body_datasets/test_exlpose_dataset.py
@@ -0,0 +1,147 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from unittest import TestCase
+
+import numpy as np
+
+from mmpose.datasets.datasets.body import ExlposeDataset
+
+
+class TestExlposeDataset(TestCase):
+
+    def build_exlpose_dataset(self, **kwargs):
+
+        cfg = dict(
+            ann_file='test_exlpose.json',
+            bbox_file=None,
+            data_mode='topdown',
+            data_root='tests/data/exlpose',
+            pipeline=[],
+            test_mode=False)
+
+        cfg.update(kwargs)
+        return ExlposeDataset(**cfg)
+
+    def check_data_info_keys(self,
+                             data_info: dict,
+                             data_mode: str = 'topdown'):
+        if data_mode == 'topdown':
+            expected_keys = dict(
+                img_id=int,
+                img_path=str,
+                bbox=np.ndarray,
+                bbox_score=np.ndarray,
+                keypoints=np.ndarray,
+                keypoints_visible=np.ndarray,
+                id=int)
+        elif data_mode == 'bottomup':
+            expected_keys = dict(
+                img_id=int,
+                img_path=str,
+                bbox=np.ndarray,
+                bbox_score=np.ndarray,
+                keypoints=np.ndarray,
+                keypoints_visible=np.ndarray,
+                invalid_segs=list,
+                area=(list, np.ndarray),
+                id=list)
+        else:
+            raise ValueError(f'Invalid data_mode {data_mode}')
+
+        for key, type_ in expected_keys.items():
+            self.assertIn(key, data_info)
+            self.assertIsInstance(data_info[key], type_, key)
+
+    def check_metainfo_keys(self, metainfo: dict):
+        expected_keys = dict(
+            dataset_name=str,
+            num_keypoints=int,
+            keypoint_id2name=dict,
+            keypoint_name2id=dict,
+            upper_body_ids=list,
+            lower_body_ids=list,
+            flip_indices=list,
+            flip_pairs=list,
+            keypoint_colors=np.ndarray,
+            num_skeleton_links=int,
+            skeleton_links=list,
+            skeleton_link_colors=np.ndarray,
+            dataset_keypoint_weights=np.ndarray)
+
+        for key, type_ in expected_keys.items():
+            self.assertIn(key, metainfo)
+            self.assertIsInstance(metainfo[key], type_, key)
+
+    def test_metainfo(self):
+        dataset = self.build_exlpose_dataset()
+        self.check_metainfo_keys(dataset.metainfo)
+        # test dataset_name
+        self.assertEqual(dataset.metainfo['dataset_name'], 'exlpose')
+
+        # test number of keypoints
+        num_keypoints = 14
+        self.assertEqual(dataset.metainfo['num_keypoints'], num_keypoints)
+        self.assertEqual(
+            len(dataset.metainfo['keypoint_colors']), num_keypoints)
+        self.assertEqual(
+            len(dataset.metainfo['dataset_keypoint_weights']), num_keypoints)
+        # note that len(sigmas) may be zero if dataset.metainfo['sigmas'] = []
+        self.assertEqual(len(dataset.metainfo['sigmas']), num_keypoints)
+
+        # test some extra metainfo
+        self.assertEqual(
+            len(dataset.metainfo['skeleton_links']),
+            len(dataset.metainfo['skeleton_link_colors']))
+
+    def test_topdown(self):
+        # test topdown training
+        dataset = self.build_exlpose_dataset(data_mode='topdown')
+        self.assertEqual(dataset.bbox_file, None)
+        self.assertEqual(len(dataset), 6)
+        self.check_data_info_keys(dataset[0], data_mode='topdown')
+
+        # test topdown testing
+        dataset = self.build_exlpose_dataset(
+            data_mode='topdown', test_mode=True)
+        self.assertEqual(dataset.bbox_file, None)
+        self.assertEqual(len(dataset), 6)
+        self.check_data_info_keys(dataset[0], data_mode='topdown')
+
+    def test_bottomup(self):
+        # test bottomup training
+        dataset = self.build_exlpose_dataset(data_mode='bottomup')
+        self.assertEqual(len(dataset), 2)
+        self.check_data_info_keys(dataset[0], data_mode='bottomup')
+
+        # test bottomup testing
+        dataset = self.build_exlpose_dataset(
+            data_mode='bottomup', test_mode=True)
+        self.assertEqual(len(dataset), 2)
+        self.check_data_info_keys(dataset[0], data_mode='bottomup')
+
+    def test_exceptions_and_warnings(self):
+
+        with self.assertRaisesRegex(ValueError, 'got invalid data_mode'):
+            _ = self.build_exlpose_dataset(data_mode='invalid')
+
+        with self.assertRaisesRegex(
+                ValueError,
+                '"bbox_file" is only supported when `test_mode==True`'):
+            _ = self.build_exlpose_dataset(
+                data_mode='topdown',
+                test_mode=False,
+                bbox_file='temp_bbox_file.json')
+
+        with self.assertRaisesRegex(
+                ValueError, '"bbox_file" is only supported in topdown mode'):
+            _ = self.build_exlpose_dataset(
+                data_mode='bottomup',
+                test_mode=True,
+                bbox_file='temp_bbox_file.json')
+
+        with self.assertRaisesRegex(
+                ValueError,
+                '"bbox_score_thr" is only supported in topdown mode'):
+            _ = self.build_exlpose_dataset(
+                data_mode='bottomup',
+                test_mode=True,
+                filter_cfg=dict(bbox_score_thr=0.3))
diff --git a/tests/test_datasets/test_datasets/test_wholebody_datasets/test_h3wb_dataset.py b/tests/test_datasets/test_datasets/test_wholebody_datasets/test_h3wb_dataset.py
new file mode 100644
index 0000000000..ffcb8d78e0
--- /dev/null
+++ b/tests/test_datasets/test_datasets/test_wholebody_datasets/test_h3wb_dataset.py
@@ -0,0 +1,75 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from unittest import TestCase
+
+import numpy as np
+
+from mmpose.datasets.datasets.wholebody3d import H36MWholeBodyDataset
+
+
+class TestH36MWholeBodyDataset(TestCase):
+
+    def build_h3wb_dataset(self, **kwargs):
+
+        cfg = dict(
+            ann_file='h3wb_train_sub.npz',
+            data_mode='topdown',
+            data_root='tests/data/h3wb',
+            pipeline=[])
+
+        cfg.update(kwargs)
+        return H36MWholeBodyDataset(**cfg)
+
+    def check_data_info_keys(self, data_info: dict):
+        expected_keys = dict(
+            img_paths=list,
+            keypoints=np.ndarray,
+            keypoints_3d=np.ndarray,
+            scale=np.ndarray,
+            center=np.ndarray,
+            id=int)
+
+        for key, type_ in expected_keys.items():
+            self.assertIn(key, data_info)
+            self.assertIsInstance(data_info[key], type_, key)
+
+    def test_metainfo(self):
+        dataset = self.build_h3wb_dataset()
+        # test dataset_name
+        self.assertEqual(dataset.metainfo['dataset_name'], 'h3wb')
+
+        # test number of keypoints
+        num_keypoints = 133
+        self.assertEqual(dataset.metainfo['num_keypoints'], num_keypoints)
+        self.assertEqual(
+            len(dataset.metainfo['keypoint_colors']), num_keypoints)
+        self.assertEqual(
+            len(dataset.metainfo['dataset_keypoint_weights']), num_keypoints)
+
+        # test some extra metainfo
+        self.assertEqual(
+            len(dataset.metainfo['skeleton_links']),
+            len(dataset.metainfo['skeleton_link_colors']))
+
+    def test_topdown(self):
+        # test topdown training
+        dataset = self.build_h3wb_dataset(data_mode='topdown')
+        dataset.full_init()
+        self.assertEqual(len(dataset), 3)
+        self.check_data_info_keys(dataset[0])
+
+        # test topdown testing
+        dataset = self.build_h3wb_dataset(data_mode='topdown', test_mode=True)
+        dataset.full_init()
+        self.assertEqual(len(dataset), 1)
+        self.check_data_info_keys(dataset[0])
+
+        # test topdown training with sequence config
+        dataset = self.build_h3wb_dataset(
+            data_mode='topdown',
+            seq_len=1,
+            seq_step=1,
+            causal=False,
+            pad_video_seq=True)
+        dataset.full_init()
+        self.assertEqual(len(dataset), 3)
+        self.check_data_info_keys(dataset[0])
diff --git a/tests/test_datasets/test_transforms/test_formatting.py b/tests/test_datasets/test_transforms/test_formatting.py
index 95fadb55b2..0db4fa250f 100644
--- a/tests/test_datasets/test_transforms/test_formatting.py
+++ b/tests/test_datasets/test_transforms/test_formatting.py
@@ -105,4 +105,5 @@ def test_transform(self):
     def test_repr(self):
         transform = PackPoseInputs(meta_keys=self.meta_keys)
         self.assertEqual(
-            repr(transform), f'PackPoseInputs(meta_keys={self.meta_keys})')
+            repr(transform), f'PackPoseInputs(meta_keys={self.meta_keys}, '
+            f'pack_transformed={transform.pack_transformed})')
diff --git a/tests/test_engine/test_hooks/test_mode_switch_hooks.py b/tests/test_engine/test_hooks/test_mode_switch_hooks.py
index fbf10bd3ef..4d149d0933 100644
--- a/tests/test_engine/test_hooks/test_mode_switch_hooks.py
+++ b/tests/test_engine/test_hooks/test_mode_switch_hooks.py
@@ -7,7 +7,7 @@
 from mmengine.runner import Runner
 from torch.utils.data import Dataset
 
-from mmpose.engine.hooks import YOLOXPoseModeSwitchHook
+from mmpose.engine.hooks import RTMOModeSwitchHook, YOLOXPoseModeSwitchHook
 from mmpose.utils import register_all_modules
 
 
@@ -65,3 +65,36 @@ def test(self):
         self.assertTrue(runner.model.bbox_head.use_aux_loss)
         self.assertEqual(runner.train_loop.dataloader.dataset.pipeline,
                          pipeline2)
+
+
+class TestRTMOModeSwitchHook(TestCase):
+
+    def test(self):
+
+        runner = Mock()
+        runner.model = Mock()
+        runner.model.head = Mock()
+        runner.model.head.loss = Mock()
+
+        runner.model.head.attr1 = False
+        runner.model.head.loss.attr2 = 1.0
+
+        hook = RTMOModeSwitchHook(epoch_attributes={
+            0: {
+                'attr1': True
+            },
+            10: {
+                'loss.attr2': 0.5
+            }
+        })
+
+        # test after change mode
+        runner.epoch = 0
+        hook.before_train_epoch(runner)
+        self.assertTrue(runner.model.head.attr1)
+        self.assertEqual(runner.model.head.loss.attr2, 1.0)
+
+        runner.epoch = 10
+        hook.before_train_epoch(runner)
+        self.assertTrue(runner.model.head.attr1)
+        self.assertEqual(runner.model.head.loss.attr2, 0.5)
diff --git a/tests/test_engine/test_schedulers/test_lr_scheduler.py b/tests/test_engine/test_schedulers/test_lr_scheduler.py
new file mode 100644
index 0000000000..e790df1d71
--- /dev/null
+++ b/tests/test_engine/test_schedulers/test_lr_scheduler.py
@@ -0,0 +1,88 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from unittest import TestCase
+
+import torch
+import torch.nn.functional as F
+import torch.optim as optim
+from mmengine.optim.scheduler import _ParamScheduler
+from mmengine.testing import assert_allclose
+
+from mmpose.engine.schedulers import ConstantLR
+
+
+class ToyModel(torch.nn.Module):
+
+    def __init__(self):
+        super().__init__()
+        self.conv1 = torch.nn.Conv2d(1, 1, 1)
+        self.conv2 = torch.nn.Conv2d(1, 1, 1)
+
+    def forward(self, x):
+        return self.conv2(F.relu(self.conv1(x)))
+
+
+class TestLRScheduler(TestCase):
+
+    def setUp(self):
+        """Setup the model and optimizer which are used in every test method.
+
+        TestCase calls functions in this order: setUp() -> testMethod() ->
+        tearDown() -> cleanUp()
+        """
+        self.model = ToyModel()
+        lr = 0.05
+        self.layer2_mult = 10
+        self.optimizer = optim.SGD([{
+            'params': self.model.conv1.parameters()
+        }, {
+            'params': self.model.conv2.parameters(),
+            'lr': lr * self.layer2_mult,
+        }],
+                                   lr=lr,
+                                   momentum=0.01,
+                                   weight_decay=5e-4)
+
+    def _test_scheduler_value(self,
+                              schedulers,
+                              targets,
+                              epochs=10,
+                              param_name='lr',
+                              step_kwargs=None):
+        if isinstance(schedulers, _ParamScheduler):
+            schedulers = [schedulers]
+        if step_kwargs is None:
+            step_kwarg = [{} for _ in range(len(schedulers))]
+            step_kwargs = [step_kwarg for _ in range(epochs)]
+        else:  # step_kwargs is not None
+            assert len(step_kwargs) == epochs
+            assert len(step_kwargs[0]) == len(schedulers)
+        for epoch in range(epochs):
+            for param_group, target in zip(self.optimizer.param_groups,
+                                           targets):
+                assert_allclose(
+                    target[epoch],
+                    param_group[param_name],
+                    msg='{} is wrong in epoch {}: expected {}, got {}'.format(
+                        param_name, epoch, target[epoch],
+                        param_group[param_name]),
+                    atol=1e-5,
+                    rtol=0)
+            [
+                scheduler.step(**step_kwargs[epoch][i])
+                for i, scheduler in enumerate(schedulers)
+            ]
+
+    def test_constant_scheduler(self):
+
+        # lr = 0.025     if epoch < 5
+        # lr = 0.005    if 5 <= epoch
+        epochs = 10
+        single_targets = [0.025] * 4 + [0.05] * 6
+        targets = [
+            single_targets, [x * self.layer2_mult for x in single_targets]
+        ]
+        scheduler = ConstantLR(self.optimizer, factor=1.0 / 2, end=5)
+        self._test_scheduler_value(scheduler, targets, epochs)
+
+        # remove factor range restriction
+        _ = ConstantLR(self.optimizer, factor=99, end=100)
diff --git a/tests/test_models/test_losses/test_classification_losses.py b/tests/test_models/test_losses/test_classification_losses.py
index fd7d3fd898..f8f7fe67f5 100644
--- a/tests/test_models/test_losses/test_classification_losses.py
+++ b/tests/test_models/test_losses/test_classification_losses.py
@@ -3,7 +3,7 @@
 
 import torch
 
-from mmpose.models.losses.classification_loss import InfoNCELoss
+from mmpose.models.losses.classification_loss import InfoNCELoss, VariFocalLoss
 
 
 class TestInfoNCELoss(TestCase):
@@ -20,3 +20,45 @@ def test_loss(self):
         # check if the value of temperature is positive
         with self.assertRaises(AssertionError):
             loss = InfoNCELoss(temperature=0.)
+
+
+class TestVariFocalLoss(TestCase):
+
+    def test_forward_no_target_weight_mean_reduction(self):
+        # Test the forward method with no target weight and mean reduction
+        output = torch.tensor([[0.3, -0.2], [-0.1, 0.4]], dtype=torch.float32)
+        target = torch.tensor([[1.0, 0.0], [0.0, 1.0]], dtype=torch.float32)
+
+        loss_func = VariFocalLoss(use_target_weight=False, reduction='mean')
+        loss = loss_func(output, target)
+
+        # Calculate expected loss manually or using an alternative method
+        expected_loss = 0.31683
+        self.assertAlmostEqual(loss.item(), expected_loss, places=5)
+
+    def test_forward_with_target_weight_sum_reduction(self):
+        # Test the forward method with target weight and sum reduction
+        output = torch.tensor([[0.3, -0.2], [-0.1, 0.4]], dtype=torch.float32)
+        target = torch.tensor([[1.0, 0.0], [0.0, 1.0]], dtype=torch.float32)
+        target_weight = torch.tensor([1.0, 0.5], dtype=torch.float32)
+
+        loss_func = VariFocalLoss(use_target_weight=True, reduction='sum')
+        loss = loss_func(output, target, target_weight)
+
+        # Calculate expected loss manually or using an alternative method
+        expected_loss = 0.956299
+        self.assertAlmostEqual(loss.item(), expected_loss, places=5)
+
+    def test_inf_nan_handling(self):
+        # Test handling of inf and nan values
+        output = torch.tensor([[float('inf'), float('-inf')],
+                               [float('nan'), 0.4]],
+                              dtype=torch.float32)
+        target = torch.tensor([[1.0, 0.0], [0.0, 1.0]], dtype=torch.float32)
+
+        loss_func = VariFocalLoss(use_target_weight=False, reduction='mean')
+        loss = loss_func(output, target)
+
+        # Check if loss is valid (not nan or inf)
+        self.assertFalse(torch.isnan(loss).item())
+        self.assertFalse(torch.isinf(loss).item())
diff --git a/tests/test_models/test_losses/test_heatmap_losses.py b/tests/test_models/test_losses/test_heatmap_losses.py
index 00da170389..cb8b38877c 100644
--- a/tests/test_models/test_losses/test_heatmap_losses.py
+++ b/tests/test_models/test_losses/test_heatmap_losses.py
@@ -5,7 +5,7 @@
 
 from mmpose.models.losses.heatmap_loss import (AdaptiveWingLoss,
                                                FocalHeatmapLoss,
-                                               KeypointMSELoss)
+                                               KeypointMSELoss, MLECCLoss)
 
 
 class TestAdaptiveWingLoss(TestCase):
@@ -117,3 +117,39 @@ def test_loss(self):
                 loss(fake_pred, fake_label, fake_weight, fake_mask),
                 torch.tensor(0.),
                 atol=1e-4))
+
+
+class TestMLECCLoss(TestCase):
+
+    def setUp(self):
+        self.outputs = (torch.rand(10, 2, 5), torch.rand(10, 2, 5))
+        self.targets = (torch.rand(10, 2, 5), torch.rand(10, 2, 5))
+
+    def test_mean_reduction_log_mode(self):
+        loss_func = MLECCLoss(reduction='mean', mode='log')
+        loss = loss_func(self.outputs, self.targets)
+        self.assertIsInstance(loss, torch.Tensor)
+
+    def test_sum_reduction_linear_mode(self):
+        loss_func = MLECCLoss(reduction='sum', mode='linear')
+        loss = loss_func(self.outputs, self.targets)
+        self.assertIsInstance(loss, torch.Tensor)
+
+    def test_none_reduction_square_mode(self):
+        loss_func = MLECCLoss(reduction='none', mode='square')
+        loss = loss_func(self.outputs, self.targets)
+        self.assertIsInstance(loss, torch.Tensor)
+
+    def test_target_weight(self):
+        target_weight = torch.rand(10)  # Random weights
+        loss_func = MLECCLoss(use_target_weight=True)
+        loss = loss_func(self.outputs, self.targets, target_weight)
+        self.assertIsInstance(loss, torch.Tensor)
+
+    def test_invalid_reduction(self):
+        with self.assertRaises(AssertionError):
+            MLECCLoss(reduction='invalid_reduction')
+
+    def test_invalid_mode(self):
+        with self.assertRaises(AssertionError):
+            MLECCLoss(mode='invalid_mode')
diff --git a/tests/test_models/test_utils/test_transformers.py b/tests/test_models/test_utils/test_transformers.py
new file mode 100644
index 0000000000..294400e314
--- /dev/null
+++ b/tests/test_models/test_utils/test_transformers.py
@@ -0,0 +1,159 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from unittest import TestCase
+
+import torch
+
+from mmpose.models.utils.transformer import GAUEncoder, SinePositionalEncoding
+
+
+class TestSinePositionalEncoding(TestCase):
+
+    def test_init(self):
+
+        spe = SinePositionalEncoding(out_channels=128)
+        self.assertTrue(hasattr(spe, 'dim_t'))
+        self.assertFalse(spe.dim_t.requires_grad)
+        self.assertEqual(spe.dim_t.size(0), 128 // 2)
+
+        spe = SinePositionalEncoding(out_channels=128, learnable=True)
+        self.assertTrue(spe.dim_t.requires_grad)
+
+        spe = SinePositionalEncoding(out_channels=128, eval_size=10)
+        self.assertTrue(hasattr(spe, 'pos_enc_10'))
+        self.assertEqual(spe.pos_enc_10.size(-1), 128)
+
+        spe = SinePositionalEncoding(
+            out_channels=128, eval_size=(2, 3), spatial_dim=2)
+        self.assertTrue(hasattr(spe, 'pos_enc_(2, 3)'))
+        self.assertSequenceEqual(
+            getattr(spe, 'pos_enc_(2, 3)').shape[-2:], (128, 2))
+
+    def test_generate_speoding(self):
+
+        # spatial_dim = 1
+        spe = SinePositionalEncoding(out_channels=128)
+        pos_enc = spe.generate_pos_encoding(size=10)
+        self.assertSequenceEqual(pos_enc.shape, (10, 128))
+
+        position = torch.arange(8)
+        pos_enc = spe.generate_pos_encoding(position=position)
+        self.assertSequenceEqual(pos_enc.shape, (8, 128))
+
+        with self.assertRaises(AssertionError):
+            pos_enc = spe.generate_pos_encoding(size=10, position=position)
+
+        # spatial_dim = 2
+        spe = SinePositionalEncoding(out_channels=128, spatial_dim=2)
+        pos_enc = spe.generate_pos_encoding(size=10)
+        self.assertSequenceEqual(pos_enc.shape, (100, 128, 2))
+
+        pos_enc = spe.generate_pos_encoding(size=(5, 6))
+        self.assertSequenceEqual(pos_enc.shape, (30, 128, 2))
+
+        position = torch.arange(8).unsqueeze(1).repeat(1, 2)
+        pos_enc = spe.generate_pos_encoding(position=position)
+        self.assertSequenceEqual(pos_enc.shape, (8, 128, 2))
+
+        with self.assertRaises(AssertionError):
+            pos_enc = spe.generate_pos_encoding(size=10, position=position)
+
+        with self.assertRaises(ValueError):
+            pos_enc = spe.generate_pos_encoding(size=position)
+
+    def test_apply_additional_pos_enc(self):
+
+        # spatial_dim = 1
+        spe = SinePositionalEncoding(out_channels=128)
+        pos_enc = spe.generate_pos_encoding(size=10)
+        feature = torch.randn(2, 3, 10, 128)
+        out_feature = spe.apply_additional_pos_enc(feature, pos_enc,
+                                                   spe.spatial_dim)
+        self.assertSequenceEqual(feature.shape, out_feature.shape)
+
+        # spatial_dim = 2
+        spe = SinePositionalEncoding(out_channels=128 // 2, spatial_dim=2)
+        pos_enc = spe.generate_pos_encoding(size=(2, 5))
+        feature = torch.randn(2, 3, 10, 128)
+        out_feature = spe.apply_additional_pos_enc(feature, pos_enc,
+                                                   spe.spatial_dim)
+        self.assertSequenceEqual(feature.shape, out_feature.shape)
+
+    def test_apply_rotary_pos_enc(self):
+
+        # spatial_dim = 1
+        spe = SinePositionalEncoding(out_channels=128)
+        pos_enc = spe.generate_pos_encoding(size=10)
+        feature = torch.randn(2, 3, 10, 128)
+        out_feature = spe.apply_rotary_pos_enc(feature, pos_enc,
+                                               spe.spatial_dim)
+        self.assertSequenceEqual(feature.shape, out_feature.shape)
+
+        # spatial_dim = 2
+        spe = SinePositionalEncoding(out_channels=128, spatial_dim=2)
+        pos_enc = spe.generate_pos_encoding(size=(2, 5))
+        feature = torch.randn(2, 3, 10, 128)
+        out_feature = spe.apply_rotary_pos_enc(feature, pos_enc,
+                                               spe.spatial_dim)
+        self.assertSequenceEqual(feature.shape, out_feature.shape)
+
+
+class TestGAUEncoder(TestCase):
+
+    def test_init(self):
+        gau = GAUEncoder(in_token_dims=64, out_token_dims=64)
+        self.assertTrue(gau.shortcut)
+
+        gau = GAUEncoder(in_token_dims=64, out_token_dims=64, dropout_rate=0.5)
+        self.assertTrue(hasattr(gau, 'dropout'))
+
+    def test_forward(self):
+        gau = GAUEncoder(in_token_dims=64, out_token_dims=64)
+
+        # compatibility with various dimension input
+        feat = torch.randn(2, 3, 64)
+        with torch.no_grad():
+            out_feat = gau.forward(feat)
+        self.assertSequenceEqual(feat.shape, out_feat.shape)
+
+        feat = torch.randn(1, 2, 3, 64)
+        with torch.no_grad():
+            out_feat = gau.forward(feat)
+        self.assertSequenceEqual(feat.shape, out_feat.shape)
+
+        feat = torch.randn(1, 2, 3, 4, 64)
+        with torch.no_grad():
+            out_feat = gau.forward(feat)
+        self.assertSequenceEqual(feat.shape, out_feat.shape)
+
+        # positional encoding
+        gau = GAUEncoder(
+            s=32, in_token_dims=64, out_token_dims=64, pos_enc=True)
+        feat = torch.randn(2, 3, 64)
+        spe = SinePositionalEncoding(out_channels=32)
+        pos_enc = spe.generate_pos_encoding(size=3)
+        with torch.no_grad():
+            out_feat = gau.forward(feat, pos_enc=pos_enc)
+        self.assertSequenceEqual(feat.shape, out_feat.shape)
+
+        gau = GAUEncoder(
+            s=32,
+            in_token_dims=64,
+            out_token_dims=64,
+            pos_enc=True,
+            spatial_dim=2)
+        feat = torch.randn(1, 2, 6, 64)
+        spe = SinePositionalEncoding(out_channels=32, spatial_dim=2)
+        pos_enc = spe.generate_pos_encoding(size=(2, 3))
+        with torch.no_grad():
+            out_feat = gau.forward(feat, pos_enc=pos_enc)
+        self.assertSequenceEqual(feat.shape, out_feat.shape)
+
+        # mask
+        gau = GAUEncoder(in_token_dims=64, out_token_dims=64)
+
+        # compatibility with various dimension input
+        feat = torch.randn(2, 3, 64)
+        mask = torch.rand(2, 3, 3)
+        with torch.no_grad():
+            out_feat = gau.forward(feat, mask=mask)
+        self.assertSequenceEqual(feat.shape, out_feat.shape)
diff --git a/tests/test_structures/test_bbox/test_bbox_transforms.py b/tests/test_structures/test_bbox/test_bbox_transforms.py
index b2eb3da683..d70c39e08c 100644
--- a/tests/test_structures/test_bbox/test_bbox_transforms.py
+++ b/tests/test_structures/test_bbox/test_bbox_transforms.py
@@ -4,7 +4,8 @@
 import numpy as np
 
 from mmpose.structures.bbox import (bbox_clip_border, bbox_corner2xyxy,
-                                    bbox_xyxy2corner, get_pers_warp_matrix)
+                                    bbox_xyxy2corner, get_pers_warp_matrix,
+                                    get_warp_matrix)
 
 
 class TestBBoxClipBorder(TestCase):
@@ -124,3 +125,61 @@ def test_get_pers_warp_matrix_scale_rotation_shear(self):
         # Use np.allclose to compare floating-point arrays within a tolerance
         self.assertTrue(
             np.allclose(warp_matrix, expected_matrix, rtol=1e-3, atol=1e-3))
+
+
+class TestGetWarpMatrix(TestCase):
+
+    def test_basic_transformation(self):
+        # Test with basic parameters
+        center = np.array([100, 100])
+        scale = np.array([50, 50])
+        rot = 0
+        output_size = (200, 200)
+        warp_matrix = get_warp_matrix(center, scale, rot, output_size)
+        expected_matrix = np.array([[4, 0, -300], [0, 4, -300]])
+        np.testing.assert_array_almost_equal(warp_matrix, expected_matrix)
+
+    def test_rotation(self):
+        # Test with rotation
+        center = np.array([100, 100])
+        scale = np.array([50, 50])
+        rot = 45  # 45 degree rotation
+        output_size = (200, 200)
+        warp_matrix = get_warp_matrix(center, scale, rot, output_size)
+        expected_matrix = np.array([[2.828427, 2.828427, -465.685303],
+                                    [-2.828427, 2.828427, 100.]])
+        np.testing.assert_array_almost_equal(warp_matrix, expected_matrix)
+
+    def test_shift(self):
+        # Test with shift
+        center = np.array([100, 100])
+        scale = np.array([50, 50])
+        rot = 0
+        output_size = (200, 200)
+        shift = (0.1, 0.1)  # 10% shift
+        warp_matrix = get_warp_matrix(
+            center, scale, rot, output_size, shift=shift)
+        expected_matrix = np.array([[4, 0, -320], [0, 4, -320]])
+        np.testing.assert_array_almost_equal(warp_matrix, expected_matrix)
+
+    def test_inverse(self):
+        # Test inverse transformation
+        center = np.array([100, 100])
+        scale = np.array([50, 50])
+        rot = 0
+        output_size = (200, 200)
+        warp_matrix = get_warp_matrix(
+            center, scale, rot, output_size, inv=True)
+        expected_matrix = np.array([[0.25, 0, 75], [0, 0.25, 75]])
+        np.testing.assert_array_almost_equal(warp_matrix, expected_matrix)
+
+    def test_aspect_ratio(self):
+        # Test with fix_aspect_ratio set to False
+        center = np.array([100, 100])
+        scale = np.array([50, 20])
+        rot = 0
+        output_size = (200, 200)
+        warp_matrix = get_warp_matrix(
+            center, scale, rot, output_size, fix_aspect_ratio=False)
+        expected_matrix = np.array([[4, 0, -300], [0, 10, -900]])
+        np.testing.assert_array_almost_equal(warp_matrix, expected_matrix)
diff --git a/tools/misc/generate_bbox_file.py b/tools/misc/generate_bbox_file.py
new file mode 100644
index 0000000000..bb13dc0866
--- /dev/null
+++ b/tools/misc/generate_bbox_file.py
@@ -0,0 +1,70 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import json
+
+import numpy as np
+from mmengine import Config
+
+from mmpose.evaluation.functional import nms
+from mmpose.registry import DATASETS
+from mmpose.structures import bbox_xyxy2xywh
+from mmpose.utils import register_all_modules
+
+try:
+    from mmdet.apis import DetInferencer
+    has_mmdet = True
+except ImportError:
+    print('Please install mmdet to use this script!')
+    has_mmdet = False
+
+
+def main():
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument('det_config')
+    parser.add_argument('det_weight')
+    parser.add_argument('output', type=str)
+    parser.add_argument(
+        '--pose-config',
+        default='configs/body_2d_keypoint/topdown_heatmap/'
+        'coco/td-hm_hrnet-w32_8xb64-210e_coco-256x192.py')
+    parser.add_argument('--score-thr', default=0.1)
+    parser.add_argument('--nms-thr', default=0.65)
+    args = parser.parse_args()
+
+    register_all_modules()
+
+    config = Config.fromfile(args.pose_config)
+    config.test_dataloader.dataset.data_mode = 'bottomup'
+    config.test_dataloader.dataset.bbox_file = None
+    test_set = DATASETS.build(config.test_dataloader.dataset)
+    print(f'number of images: {len(test_set)}')
+
+    detector = DetInferencer(args.det_config, args.det_weight)
+
+    new_bbox_files = []
+    for i in range(len(test_set)):
+        data = test_set.get_data_info(i)
+        image_id = data['img_id']
+        img_path = data['img_path']
+        result = detector(
+            img_path,
+            return_datasamples=True)['predictions'][0].pred_instances.numpy()
+        bboxes = np.concatenate((result.bboxes, result.scores[:, None]),
+                                axis=1)
+        bboxes = bboxes[bboxes[..., -1] > args.score_thr]
+        bboxes = bboxes[nms(bboxes, args.nms_thr)]
+        scores = bboxes[..., -1].tolist()
+        bboxes = bbox_xyxy2xywh(bboxes[..., :4]).tolist()
+
+        for bbox, score in zip(bboxes, scores):
+            new_bbox_files.append(
+                dict(category_id=1, image_id=image_id, score=score, bbox=bbox))
+
+    with open(args.output, 'w') as f:
+        json.dump(new_bbox_files, f, indent='')
+
+
+if __name__ == '__main__':
+    if has_mmdet:
+        main()
diff --git a/tools/misc/pth_transfer.py b/tools/misc/pth_transfer.py
index 7433c6771e..cf59a5bd53 100644
--- a/tools/misc/pth_transfer.py
+++ b/tools/misc/pth_transfer.py
@@ -14,6 +14,8 @@ def change_model(args):
                 all_name.append((name[8:], v))
             elif name.startswith('distill_losses.loss_mgd.down'):
                 all_name.append(('head.' + name[24:], v))
+            elif name.startswith('teacher.neck'):
+                all_name.append((name[8:], v))
             elif name.startswith('student.head'):
                 all_name.append((name[8:], v))
             else:
diff --git a/tools/slurm_test.sh b/tools/slurm_test.sh
index c528dc9d45..019f995c23 100644
--- a/tools/slurm_test.sh
+++ b/tools/slurm_test.sh
@@ -10,7 +10,6 @@ CHECKPOINT=$4
 GPUS=${GPUS:-8}
 GPUS_PER_NODE=${GPUS_PER_NODE:-8}
 CPUS_PER_TASK=${CPUS_PER_TASK:-5}
-PY_ARGS=${@:5}
 SRUN_ARGS=${SRUN_ARGS:-""}
 
 PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
@@ -22,4 +21,4 @@ srun -p ${PARTITION} \
     --cpus-per-task=${CPUS_PER_TASK} \
     --kill-on-bad-exit=1 \
     ${SRUN_ARGS} \
-    python -u tools/test.py ${CONFIG} ${CHECKPOINT} --launcher="slurm" ${PY_ARGS}
+    python -u tools/test.py ${CONFIG} ${CHECKPOINT} --launcher="slurm" ${@:5}
diff --git a/tools/slurm_train.sh b/tools/slurm_train.sh
index c3b65490a5..a0df8ce259 100644
--- a/tools/slurm_train.sh
+++ b/tools/slurm_train.sh
@@ -11,7 +11,6 @@ GPUS=${GPUS:-8}
 GPUS_PER_NODE=${GPUS_PER_NODE:-8}
 CPUS_PER_TASK=${CPUS_PER_TASK:-5}
 SRUN_ARGS=${SRUN_ARGS:-""}
-PY_ARGS=${@:5}
 
 PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
 srun -p ${PARTITION} \
@@ -22,4 +21,4 @@ srun -p ${PARTITION} \
     --cpus-per-task=${CPUS_PER_TASK} \
     --kill-on-bad-exit=1 \
     ${SRUN_ARGS} \
-    python -u tools/train.py ${CONFIG} --work-dir=${WORK_DIR} --launcher="slurm" ${PY_ARGS}
+    python -u tools/train.py ${CONFIG} --work-dir=${WORK_DIR} --launcher="slurm" ${@:5}