diff --git a/index.html b/index.html index f2e6b4d..7a97469 100644 --- a/index.html +++ b/index.html @@ -76,7 +76,7 @@

- There are 48 combinatorial ways of assigning coordinate frame axes (assign right/left, up/down, and forward/backward to x, y, z, which is $6 \times 4 \times 2$), and it seems as if our disciplines give their best in trying to use all of them. Unfortunately, this means there are $48\times47=2556$ ways of converting between coordinate frames, and each of them is dangerously close to a bug. As if that were not enough, words like extrinsics or pose matrix are used in different meanings, adding to the confusion that inherently surrounds transforms and rotations. + There are 48 combinatorial ways of assigning coordinate frame axes (assign right/left, up/down, and forward/backward to x, y, z, which is $6 \times 4 \times 2$), and it seems as if our disciplines give their best in trying to use all of them. Unfortunately, this means there are $48\times47=2556$ ways of converting between coordinate frames, and each of them is dangerously close to a bug. As if that were not enough, words like extrinsics or pose matrix are used with different meanings, adding to the confusion that inherently surrounds transforms and rotations. Read more diff --git a/index.xml b/index.xml index 8410e8f..e29fe42 100644 --- a/index.xml +++ b/index.xml @@ -14,7 +14,7 @@ Sat, 22 Jun 2024 00:00:00 +0000 https://mohamedkari.github.io/blog.mkari.de/posts/cam-transform/ - There are 48 combinatorial ways of assigning coordinate frame axes (assign right/left, up/down, and forward/backward to x, y, z, which is $6 \times 4 \times 2$), and it seems as if our disciplines give their best in trying to use all of them. Unfortunately, this means there are $48\times47=2556$ ways of converting between coordinate frames, and each of them is dangerously close to a bug. As if that were not enough, words like extrinsics or pose matrix are used in different meanings, adding to the confusion that inherently surrounds transforms and rotations. + There are 48 combinatorial ways of assigning coordinate frame axes (assign right/left, up/down, and forward/backward to x, y, z, which is $6 \times 4 \times 2$), and it seems as if our disciplines give their best in trying to use all of them. Unfortunately, this means there are $48\times47=2556$ ways of converting between coordinate frames, and each of them is dangerously close to a bug. As if that were not enough, words like extrinsics or pose matrix are used with different meanings, adding to the confusion that inherently surrounds transforms and rotations. diff --git a/posts/cam-transform/index.html b/posts/cam-transform/index.html index 8aa21e7..740756f 100644 --- a/posts/cam-transform/index.html +++ b/posts/cam-transform/index.html @@ -86,14 +86,14 @@

Camera Conventions, Transforms, and Conversions

There are 48 combinatorial ways of assigning coordinate frame axes (assign right/left, up/down, and forward/backward to x, y, z, which is $6 \times 4 \times 2$), and it seems as if our disciplines give their best in trying to use all of them. Unfortunately, this means there are $48\times47=2556$ ways of converting between coordinate frames, and each of them is dangerously close to a bug. -As if that were not enough, words like extrinsics or pose matrix are used in different meanings, adding to the confusion that inherently surrounds transforms and rotations. +As if that were not enough, words like extrinsics or pose matrix are used with different meanings, adding to the confusion that inherently surrounds transforms and rotations. My coordinate frame convention conversion tool (https://mkari.de/coord-converter/) simplifies this process radically. In this blog post, I discuss the underlying background of transforms, the ‘manual’ process of performing point and especially rotation conversions, and the tasks that typically follow or are associated with them.

Table of Contents


  • Table of Contents
  • -
  • Step 1: Understanding coordinate frame conventions
  • +
  • Step 1: Understanding coordinate frame conventions Of the above-mentioned 48 possible conventions, there are some conventions that are used quite frequently.
  • Step 2: Understanding camera transforms
    • 2a: Starting with a world-aligned camera
    • @@ -120,7 +120,7 @@

      Table of Contents

    • Properties of rotations @@ -139,7 +139,7 @@

      Table of Contents

      • Step 1: Visualizing world and camera axes
      • Step 2: Specifying a convention transform
      • -
      • Step 3: Point conversion
      • +
      • Step 3: Point conversions
      • Step 4: Rotation conversion
        -

        This post serves my personal primer whenever I start a new cross-convention 3D projects (e.g., capturing poses with sensors in one convention and processing them in a 3D engine with a different convention). -As a refresher, it’s also always good to take a look at the lectures on transforms by Prof. Kenneth Joy (https://www.youtube.com/playlist?list=PL_w_qWAQZtAZhtzPI5pkAtcUVgmzdAP8g).

        -

        Step 1: Understanding coordinate frame conventions

        -

        Of the above mentioned 48 possible conventions, there are some conventions that are used quite frequently. -In computer graphics, our 2D plane of interest is the image, thus the plane axes are denominated with x and y, while z refers to depth (hence z-fighting, z-buffering, etc.) and thus is assigned the forward or backward direction. -In navigation (and thus aviation), our 2D plane of interest is the earth surface, as we often think about navigating 2D maps, and thus x and y refer to forward/backward and left/right, whereas z refers to the up/down direction. +

        This post serves as my personal primer whenever I start a new cross-convention 3D project (e.g., capturing poses with sensors in one convention and processing them in a 3D engine with a different convention). +As a refresher, it’s also always good to take a look at the lecture series on computer graphics by Prof. Kenneth Joy.

        +

        Step 1: Understanding coordinate frame conventions Of the above-mentioned 48 possible conventions, there are some conventions that are used quite frequently.

        +

        In computer graphics, our 2D plane of interest is the image, thus the plane axes are denominated with x and y, while z refers to depth (hence z-fighting, z-buffering, etc.) and thus is assigned the forward or backward direction. +In navigation (and thus aviation), our 2D plane of interest is the earth’s surface, as we often think about navigating 2D maps, and thus x and y refer to forward/backward and left/right, whereas z refers to the up/down direction. I have encountered the following conventions in the past years, maintaining the below table as a quick cheat sheet.

        @@ -235,7 +234,7 @@

        Step 1: Understanding

        Table: Coordinate system conventions and camera conventions in different frameworks

        To specify our coordinate frame convention, we need to indicate three pieces of information as we have three axes. -One way of indicating the convention is specifying x as right or left, y as up or down, and z back or forward. +One way of indicating the convention is specifying x as right or left, y as up or down, and z as back or forward. This is my preferred way of indicating a convention. An alternative way of indicating the convention is specifying only two axes explicitly and providing the handedness of the coordinate frame so we can derive the missing axis by using the corresponding hand’s rule.

        The handedness of a coordinate system dictates two principles: @@ -278,11 +277,11 @@

        Step 1: Understanding

        Step 2: Understanding camera transforms

        Before getting to convert between different coordinate conventions, let’s begin at the beginning, i.e., using a single coordinate convention.

        2a: Starting with a world-aligned camera

        -

        The most basic thing in a coordinate frame is easy to understand: describe the position of a point $p = (x, y, z)$, which means go x units along the x axis, etc. -However, it gets a bit more difficult when we use the coordinate frame to describe transformations of either of point clouds or the camera. +

        The most basic thing in a coordinate frame is easy to understand: describe the position of a point $p = (x, y, z)$, which means go x units along the x-axis, etc. +However, it gets a bit more difficult when we use the coordinate frame to describe transformations of either point clouds or the camera. Most importantly, let’s consider rigid transformations, i.e., point motion or camera motion, and here let’s start with camera motion.

        -

        For example, ARKit uses a right-handed coordinate system with x right, y up, z backward, i.e., it is right-handed. -Consider an initial state where the smartphone is in landscape mode with the phone top to the left, and the world coordinate frame is reset, so that world coordinate frame and camera coordinate frame are perfectly aligned. +

        For example, ARKit uses a right-handed coordinate system with x right, y up, and z backward, i.e., it is right-handed. +Consider an initial state where the smartphone is in landscape mode with the phone top to the left, and the world coordinate frame is reset, so that the world coordinate frame and camera coordinate frame are perfectly aligned. Then, the camera looks down the negative z-axis. The camera’s up direction points along the positive y-axis. Thus (using the right-hand rule), the x-axis points to the right. @@ -300,9 +299,9 @@

        2a: Starting with a world-align $$

        -

        As ARKit uses gravity to determine the world’s up direction, any initial rotation in the world configuration about the z-axis does not influence the world coordinate frame definition. +

        As ARKit uses gravity to determine the world’s upward direction, any initial rotation in the world configuration about the z-axis does not influence the world coordinate frame definition. That is, if the phone were in standard portrait mode at reset time, it would be rotated about the z-axis from the start. -The right-hand grip rule tells us, that it would be rotated in positive direction by 270 degrees when started up in standard portrait mode.

        +The right-hand grip rule tells us, that it would be rotated in the positive direction by 270 degrees when started up in standard portrait mode.

        2b: Understanding camera translation & specifying cam-to-world translation

        Imagine physically mounting a plastic coordinate frame to the iPhone in this world-aligned state. Now, as we translate the camera through space and rotate it, the plastic coordinate frame also moves in lock-step. @@ -324,7 +323,7 @@

        To summarize, there are two ways of interpreting translation (and as we will see rotation) matrices:

          -
        1. “physically moving” a camera from one coordinate frames to the next while remaining in the same world coordinate system.
        2. +
        3. “physically moving” a camera from one coordinate frame to the next while remaining in the same world coordinate system.
        4. converting a point to a different notation of the exact same point where each notation is dependent upon the chosen coordinate frame.

        Both notions are numerically equivalent, i.e., the same numbers result. -They simply offer different approaches of “thinking” the transform operation. -Interesting, I am under the impression that the majority of books, people and frameworks operate with the second interpretation. +They simply offer different approaches to “thinking” the transform operation. +Interestingly, I am under the impression that the majority of books, people and frameworks operate with the second interpretation whenever possible. In contrast, I nearly always think of operations in the first manner.

        Here is my mnemonic technique to remember what cam-to-world means:

          @@ -352,14 +351,14 @@

          2c: Understanding camera rotation & specifying the cam-to-world rotation from an orthonormal basis

          -

          Now, instead, let’s imagine we stay in the world orign, and spin around the heel for +90 degrees around y (this means to the counter-clockwise, so that we face to the left afterwards). -The iPhone’s y-axis still points upwards, i.e.,positive y remains positive y. +

          Now, instead, let’s imagine we stay in the world origin, and spin around the heel for +90 degrees around y (this means counter-clockwise so that we face to the left afterward). +The iPhone’s y-axis still points upwards, i.e., positive y remains positive y. Its x-axis however now points to what we call negative z in the world coordinate frame (as we can also see from the plastic coordinate frame mounted to it). -And the iPhone’s z-axis now points to what we called positive x.

          +And the iPhone’s z-axis now points to what we call positive x*.

          How do we describe this numerically? -Of course, $(1, 0, 0)_{world}$ in world coordinates becomes $(1, 0, 0)_{cam}$ in camera coodinates but that is trivial and does not provide us with information about the spatial relationship between the two. +Of course, $(1, 0, 0)_{world}$ in world coordinates becomes $(1, 0, 0)_{cam}$ in camera coordinates but that is trivial and does not provide us with information about the spatial relationship between the two. Instead, we need to know how the camera coordinate axes are situated _with respect to the world axes_. Describing the camera motion from the perspective of the world coordinate system, the world's *positive x* $(1, 0, 0)_{\text{world}}$ becomes $(0, 0, -1)_{world}$ for the camera's x-axis. @@ -414,7 +413,7 @@

          -

          In summary, we can easily construct our desired rotation matrix by stacking the unit vectors pointing along the three desired new axis as columns, thus describing where each axis goes. -The matrix constructed like this will comply to the to two necessary and sufficient (i.e., “if and only if”) properties satified by any rotation matrix:

          +

          In summary, we can easily construct our desired rotation matrix by stacking the unit vectors pointing along the three desired new axes as columns, thus describing where each axis goes. +The matrix constructed like this will comply with the necessary and sufficient (i.e., “if and only if”) properties satisfied by any rotation matrix:

          • orthogonality ($R^{-1}=R^T$)
          • $det(R) = 1$

          In the same way that the above-constructed cam-to-world translation vector can be interpreted

            -
          1. as a vector that translate the origin-aligned camera from (0, 0, 0) in world coordinates out into the world and equivalently
          2. +
          3. as a vector that translates the origin-aligned camera from (0, 0, 0) in world coordinates out into the world and equivalently
          4. as a vector that translates any point from the camera coordinate system into world coordinates, the so-constructed cam-to-world rotation matrix can be interpreted analogously.
          @@ -444,10 +443,10 @@

          either as moving the camera from an initial world-aligned state to a new state that is described in world coordinates,
        • or as converting camera coordinates into world coordinates.
        • -

          After defining the camera’s translation vector and the camera’s 3x3 rotation matrix with columns like this, we often want to combine them in a single 4x4 homogeneous pose matrix $P$ (synoymously cam-to-world matrix $M_{\text{cam-to-world}}$) that can be easily premulitplied to incoming points.

          +

          After defining the camera’s translation vector and the camera’s 3x3 rotation matrix with columns like this, we often want to combine them in a single 4x4 homogeneous pose matrix $P$ (synonymously cam-to-world matrix** $M_{\text{cam-to-world}}$) that can be easily premultiplied to incoming points.

          We have at least two ways of obtaining $M_{\text{cam-to-world}}$.

          Note that I use the term pre-multiplying in the following, even though there is no point to the right. This linguistic convention serves as a reminder that corresponding geometric operations are applied from right to left. You can always imagine a point to the right to make sense of this.

          Alternative 1: Pre-multiplying the cam-to-world rotation and translation in the correct order

          -

          Imagine a camera in the origin and a point 5 meter in front of it. -Imagine the new camera is rotated to the left and also moved 1 meter to the left (hovering above your sholder). -To transform the point ahead to also be ahead to the camera, we first need to rotate, and then translate it.

          +

          Imagine a camera in the origin and a point 5 meters in front of it. +Imagine the new camera is rotated to the left and also moved 1 meter to the left (hovering above your shoulder). +To transform the point ahead to also be ahead of the camera, we first need to rotate, and then translate it.

          To do so,

          • first, compose the homogeneous translation matrix, i.e., take an identity matrix of 4x4, then plug in the translation $t$ into the last column’s top three values (yielding $T_{\text{cam-to-world}}$)
          • @@ -476,7 +475,7 @@

            Alternative 2: Plugging together the pose matrix directly

            Alternatively, we can also plug the pose matrix together from the camera position and cam-to-world rotation directly. -Simply initialize an identity matrix (M = np.identity(4)), +Simply initialize an identity matrix (M = np.identity(4)), plug in the 3x3 cam-to-world rotation into the first three columns and top three rows (M[:3,:3] = r), and then the 3x1 translation vector (M[:3,3] = t) into the last column.

            This yields:

            @@ -490,26 +489,26 @@

            Alternative 2: $$

            -

            Note that we can easily read-out the camera position from the pose matrix (hence, the name).

            +

            Note that we can easily read out the camera position from the pose matrix (hence, the name).

            These two alternatives produce the same result which we call pose matrix or cam-to-world matrix, i.e.,

            $$P=M_{\text{cam-to-world}}=[R_{\text{cam-to-world}}|t]=T\cdot R.$$

            2e: Obtaining the camera extrinsics matrix (synonymously: the world-to-cam matrix for the computer graphics & projection pipeline)

            The extrinsics matrix $E$ converts from world coordinates to camera coordinates. This is why it is extensively used in computer graphics where we need to get 3D points in world space into camera space before projecting them into image space.

            It can also be understood as moving the camera together with all its surrounding points in world coordinates to the origin while always remaining in the world coordinate system.

            -

            Again, there is multiple ways of obtaining it.

            +

            Again, there are multiple ways of obtaining it.

            Alternative 1: Inverting the cam-to-world matrix (synonymously: inverting the pose matrix)

            For this approach, we leverage that

            $E=M_{\text{world-to-cam}} = M_{\text{cam-to-world}}^{-1}$ .

            To this end, we invert the pose matrix using something like Python’s np.linalg.inv(m) or Unity’s Matrix4x4 inv = m.inverse;.

            Alternative 2: Pre-multiplying the inverses of the cam-to-world rotation and cam-to-world translation in the correct order

            -

            The above alterantive 1 inverts the 4x4 matrix numerically. +

            The above alternative 1 inverts the 4x4 matrix numerically. If we want to avoid this, we can instead reduce the problem to simpler inversions of the underlying rotation and translation matrices.

            -

            To this end, we first start with plugging the basics from above together:

            +

            To this end, we first start by plugging the basics from above together:

            $E=M_{\text{world-to-cam}} = (M_{\text{cam-to-world}})^{-1} = (T_{\text{cam-to-world}} \cdot R_{\text{cam-to-world}})^{-1}$

            Then, we can exploit the fact that the inverse of a product of two invertible matrices is the same as the product of the inverted matrices in reverse order, which finally gives us:

            $E=M_{\text{world-to-cam}} = R_{\text{cam-to-world}}^{-1} \cdot T_{\text{cam-to-world}}^{-1}$

            -

            Inverting the rotation matrix and the translation matrix is much very simple as

            +

            Inverting the rotation matrix and the translation matrix is very simple as

            • inverting a rotation matrix is taking its transpose,
            • inverting a translation matrix is flipping all signs in the last-column top-three elements.
            • @@ -528,7 +527,7 @@

              Alternat

              The top left 3x3 slice contains $R_{\text{world-to-cam}}$ which is the inverted $R_{\text{cam-to-world}}$ matrix. Remember that $R^T=R^{-1}$, so a valid rotation matrix is easy to invert. -The last 3x1 column is obtained as $t^*=-R_{\text{world-to-cam}}t$, where $t$ is the camera’s position in world. +The last 3x1 column is obtained as $t^*=-R_{\text{world-to-cam}}t$, where $t$ is the camera’s position in world coordinates. The bottom 1x4 row contains the homogeneous appendix $(0., 0., 0., 1.)$.

              Beware, too many times, I have tried to plug in the sign-swapped translation vector directly here, forgetting to premultiply $R_{\text{world-to-cam}}$.

              2f: Summary & cam transform convention

              @@ -555,8 +554,8 @@

              2f: Summary & cam transform co

              Table: different matrices involved in computer vision and computer graphics

              -

              The intrisics matrix contains the measured properties of the camera and can used to project 3D points onto the image plane. -The projection matrix can additionally take care for far-near clipping and viewport clipping. +

              The intrinsics matrix contains the measured properties of the camera and can used to project 3D points onto the image plane. +The projection matrix can additionally take care of far-near clipping and viewport clipping. Thus, while the intrinsics matrix contains only information relating to the camera properties, the projection matrix also contains information about the rendering chosen arbitrarily.

              In this post, we take particular interest in the pose matrix and the extrinsics matrix, which can obtained as follows:

              @@ -568,8 +567,8 @@

              2f: Summary & cam transform co

              -

              It always important to think about which “cam transform convention” the framework one is working with follows. -Sometimes, the camera position and rotation is indicated via a pose matrix, sometimes as an extrinsics matrix.

              +

              It is always important to think about which “cam transform convention” the framework one is working with follows. +Sometimes, the camera position and rotation are indicated via a pose matrix, at other times they are represented as an extrinsics matrix.

              For example, In ARKit, frame.camera.transform.columns refers to a camera pose matrix, not an extrinsics matrix.

              @@ -594,10 +593,10 @@

              2f: Summary & cam transform co

              To summarize, one good approach is to first obtain the pose matrix (cam-to-world matrix) and then the extrinsics matrix (world-to-camera matrix), as follows:

                -
              1. derive the rotation that maps each camera basis vectors to the directions provided by the world coordinate frame,
              2. +
              3. derive the rotation that maps each camera basis vector to a direction provided by the world coordinate frame,
              4. derive the translation that moves the world origin onto the camera coordinate,
              5. then rotate and afterwards translate ($T’ R’ \times p$) to obtain the cam-to-world matrix,
              6. -
              7. finally invert the whole thing to obtain world-to-cam matrix.
              8. +
              9. finally, invert the whole thing to obtain the world-to-cam matrix.

              Table: cam transform convention

              Step 3: Understanding properties and computing rules with transforms and rotation matrices

              @@ -605,9 +604,9 @@

              Properties of rotations

              Definition

              A matrix is a rotation matrix if and only if

                -
              1. $R$ is an orthormal matrix, i.e., +
              2. $R$ is an orthonormal matrix, i.e.,
                  -
                • $R$ is orthogonal (i.e., the matrix’ inverse is equal to its tranpose $R^{-1}=R^T$), and
                • +
                • $R$ is orthogonal (i.e., the matrix inverse is equal to its transpose $R^{-1}=R^T$), and
                • each row vector of $R$ has length 1,
              3. @@ -625,17 +624,17 @@

                Definition

                # properties of a rotation assert np.all(np.linalg.det(R) == 1) assert np.all(R.T == np.linalg.inv(R)) -

                Further properties

                -

                Rotation matrices can be, but are not necessarily symmetric, i.e., $R = R^{T}$ does generally not hold.

                +

                Non-Symmetry

                +

                Rotation matrices can be but are not necessarily symmetric, i.e., $R = R^{T}$ does generally not hold.

                Interesting implication of orthonormality

                -

                Orthonormality is equivalent to the property that the row vectors form an orthonomal basis. -This is equivalent to the property that the columns vector form an orthonormal basis (https://en.wikipedia.org/wiki/Orthogonal_matrix#Matrix_properties). +

                Orthonormality is equivalent to the property that the row vectors form an orthonormal basis. +This is equivalent to the property that the columns vector form an orthonormal basis. As an interesting side effect of the orthonormality property, the rotation matrix contains redundancy as we only need two base vectors, i.e. two rows, i.e., two columns, to compute the third vector as the cross product of the two given vectors.

                Interesting implication of $det(R) = +1$

                -

                An orthonormal matrix could have determinant -1 or +1 (https://en.wikipedia.org/wiki/Orthogonal_matrix), be a rotation matrix is orthonormal matrix with $det(R)=+1$ (https://en.wikipedia.org/wiki/Orthogonal_group, https://en.wikipedia.org/wiki/3D_rotation_group#Orthogonal_and_rotation_matrices).

                +

                An orthonormal matrix could have determinant -1 or +1, but [a rotation matrix is an orthonormal matrix with $det(R)=+1$](https://en.wikipedia.org/wiki/Orthogonal_group, https://en.wikipedia.org/wiki/3D_rotation_group#Orthogonal_and_rotation_matrices).

                An axis sign flip cannot be represented by a proper rotation matrix. Flipping a sign in a rotation matrix with 3 positive or negative ones will flip the determinant. -So, to convert from a right-hand coordinate system to a left-hand coordinate system, we have to put first align two the three axes, and then flip the remaining one.

                +So, to convert from a right-hand coordinate system to a left-hand coordinate system, we have to first align two axes and then flip the remaining one.

                Properties of translations

                ### Translation ###
                 t = np.array([1, 2, 3])
                @@ -664,8 +663,8 @@ 

                # this is not a way of obtaining the pose matrix assert not np.all(R @ T == M)

                $R \times T = (T^{-1} \times R^{-1})^{-1}$

                -

                Too many times, I doubted myself because I was violating the $TRS\times p$ rule: first scale, then rotate, finally translate. -However, as described above when computing the extrinsics matrix from the pose matrix in alterantive, we are in need to first translate and only then rotate some of the matrices we are looking at.

                +

                Too many times, I doubted myself because I was violating the $TRS\times p$ rule: first scale, then rotate, and finally translate. +However, as described above when computing the extrinsics matrix from the pose matrix as an alternative, we are in need to first translate and only then rotate some of the matrices we are looking at.

                This follows from the following rule: Given two invertible matrices, the inverse of their product is the product of their inverses in reverse order (https://en.wikipedia.org/wiki/Invertible_matrix#Other_properties):

                $ @@ -708,20 +707,20 @@

                Step 1: Visualizing world and Are we looking at a global or local coordinate axis? For example, the camera coordinate system’s y-axis might drop to the world coordinate system’s x-axis during a rotation or convention transform, so be sure to be consistent about this.

                Step 2: Specifying a convention transform

                -

                A change in convention can be representation by a 3x3 or 4x4 convention transform. +

                A change in convention can be represented by a 3x3 or 4x4 convention transform. If the handedness does not change between conventions, the convention transform is a proper rotation matrix. If handedness changes, the convention transforms determinant becomes negative, making it what is sometimes called an improper rotation matrix.

                However, this does not change the process at all. Instead, simply need to ask ourselves how to define the convention transform.

                Imagine you get coordinates in NED (x forward, y right, z down) and want to map them to Unity (x right, y up, z forward).

                -

                My recipe is as follow:

                +

                My recipe is as follows:

                  -
                1. Draw the camera perspectives on paper with the source coordinate convention (here: NED) on the left and target (here: Unity) on the right.
                2. -
                3. Iterate over the the left (i.e., source) axes and ask yourself: Which target unit do I get for my source unit. That’s basically what conversion is, right? For each axes, note the answer in a colum, filling up from left to right.
                4. -
                5. Once done, you have 3 colums, making up an orthonormal matrix with determinant 1 (i.e., a rotation matrix), or an orthonormal matrix with determinant -1 (because handedness flipped).
                6. +
                7. Draw the camera perspectives on paper with the source coordinate convention (here: NED) on the left and the target (here: Unity) on the right.
                8. +
                9. Iterate over the left (i.e., source) axes and ask yourself: Which target unit do I get for my source unit? That’s basically what conversion is, right? For each axis, note the answer in a column, filling up from left to right.
                10. +
                11. Once done, you have 3 columns, making up an orthonormal matrix with determinant 1 (i.e., a rotation matrix), or an orthonormal matrix with determinant -1 (because handedness flipped).
                -

                Step 3: Point conversion

                -

                In order to convert incoming points from the source coordinate frame, pre-multiply the convention transform matrix to your incoming source points.

                +

                Step 3: Point conversions

                +

                To convert incoming points from the source coordinate frame, pre-multiply the convention transform matrix to your incoming source points.

                Step 4: Rotation conversion

                Converting rotations is a bit more intricate again. The process depends on the rotation representation used by the source system: @@ -729,15 +728,15 @@

                Step 4: Rotation conversion

                Converting incoming rotation matrices

                Imagine, we obtain tracking data from ARKit (x right, y up, z backward) and want to visualize it in a 3D rendering engine that uses an NED convention (x forward, y right, z down). I chose this conversion example, because all axes are different, making it easier to spot sign errors.

                -

                The example is visualized end-to-end in the next figure and verbalized afterwards in 3 steps.

                +

                The example is visualized end-to-end in the next figure and verbalized afterward in 3 steps.

                Visualization of the idea and process of a rotation matrix conversion

                Consider the initial state where the phone is aligned with the ARKit world coordinate system. -Imagine a physically mounted plastic frustrum extending forward as well as two physically mounted coordinate frames, one in the ARKit convention and one in the NED convention, all glued to the phone. +Imagine a physically mounted plastic frustum extending forward as well as two physically mounted coordinate frames, one in the ARKit convention and one in the NED convention, all glued to the phone. In this initial state in ARKit, the phone's rotation matrix as tracked by ARKit is equal to the identity. -Considering the ARKit coordinate frame, the tip of the z-axis lies at $(0,0,1)_\text{ARKit}$ in ARKit convention, facing backward and the jabbing the user into the eye. -Considering the NED coordinate frame, the tip of the x-axis extend forward $(1,0,0)_\text{ARKit}$. +Considering the ARKit coordinate frame, the tip of the z-axis lies at $(0,0,1)_\text{ARKit}$ in ARKit convention, facing backward and jabbing the user into the eye. +Considering the NED coordinate frame, the tip of the x-axis extends forward $(1,0,0)_\text{ARKit}$.

                Sub-Step A: Obtaining the convention transform from source to target

                @@ -784,9 +783,9 @@

                Sub

                Sub-Step B: Understanding a rotation in the source coordinate system

                -

                Remember, in a world-aligned initial pose, the ARKit iPhone neutrally rests in landscape mode, screen facing the user, selfie camera on the left side of phone, and USB-C/Lightning port to the right.

                -

                Rotating the phone 90 degrees counter-clockwise, so that the phone is in upside-down portrait mode afterwards, is a rotation around the z-axis in positive direction. -Purely talking ARKit, the rotation from state 0 to state 1 is described by rotation matrix:

                +

                Remember, in a world-aligned initial pose, the ARKit iPhone neutrally rests in landscape mode, the screen facing the user, the selfie camera on the left side of the phone, and the USB-C/Lightning port to the right.

                +

                Rotating the phone 90 degrees counter-clockwise, so that the phone is in upside-down portrait mode afterward, is a rotation around the z-axis in positive direction. +Purely talking ARKit, the rotation from state 0 to state 1 is described by the following rotation matrix:

                $$ @@ -799,8 +798,8 @@

                Sub $$

                -

                Considering the NED frame, rotating the phone 90 degrees counter-clockwise, is a rotation around the x-axis by $-90$ degrees or $+270$ degrees. -Purely talking NED, the rotation from state 0 to state 1 is described by rotation matrix:

                +

                Considering the NED frame, rotating the phone 90 degrees counter-clockwise is a rotation around the x-axis by $-90$ degrees or $+270$ degrees. +Purely talking NED, the rotation from state 0 to state 1 is described by the rotation matrix:

                $$ @@ -814,8 +813,8 @@

                Sub

                However, this last rotation $R_{\text{NED}_0\text{-to-NED}_1}$, we do not have. -Instead, all we have in our ARKit-in-a-NED-3D-visualization-system is the hard-coded convention transform, and incoming ARKit rotation. -The question becomes: How do we get convert the ARKit rotation matrix into a NED rotation matrix?

                +Instead, all we have in our system is the hard-coded convention transform and incoming ARKit rotations. +The question becomes: How do we convert the ARKit rotation matrix into a NED rotation matrix?

                Sub-Step C: Computing $R_{\text{NED}_0\text{-to-NED}_1}$

                We can’t directly apply the incoming rotation matrix to the NED point, and we also cannot just pre-multiply the convention transform as we can with points.

                Instead, the idea is to chain rotations as follows, converting the neutral coordinate coordinate frame to the source target frame first, applying the known rotation in the source system, and transforming back to the target system:

                @@ -885,12 +884,12 @@

                Sub-Step C: Computing $R_ assert np.all(R_ned_0_to_1 == R_arkit_to_ned @ R_arkit_0_to_1 @ R_arkit_to_ned.T)

                So, in summary, to convert an incoming rotation matrix, we need to pre-multiply with the convention transform, and post-multiply with the inverted convention transform.

                Converting incoming Euler angles

                -

                Euler angles, and their special cases of Tait-Byran angles are Davenport angles, are an annoyance in conversion due to two reasons:

                +

                Euler angles, and their special cases of Tait-Byran angles and Davenport angles, are an annoyance in conversion due to two reasons:

                First, they introduce yet another convention. In order to interpret three given Euler angles $\alpha, \beta, \gamma$ around $x, y$ and $z$, we need to know if they have been applied intrinsically or extrinsically, and in which order. -For example, Unity uses Euler angle convention of extrinsic zxy while DJI (and aviation quite often) uses Euler convention of intrinsic yaw-pitch-roll, i.e., intrinsic zyx.

                +For example, Unity uses Euler angle convention of extrinsic zxy while DJI (and aviation quite often) uses the Euler convention of intrinsic yaw-pitch-roll, i.e., intrinsic zyx.

                However, even worse than introducing the need for yet another convention, Euler angles are discontinuous, i.e., a small change such as a single degree in one axis can make all rotation angles jump abruptly. -For DJI aircraft, the motion is physically so limited that we mostly don’t notice these discontinuities, but in camera motions more generally, Euler angles can often lead to unexpected results (https://danceswithcode.net/engineeringnotes/rotations_in_3d/rotations_in_3d_part1.html)..

                +For DJI aircraft, the motion is physically so limited that we mostly don’t notice these discontinuities, but in camera motions more generally, Euler angles can often lead to unexpected results..

                Therefore, whenever receiving Euler angles, be sure to convert them to quaternions as soon as possible. To do so, one can intuitively use the axis-angle initialization for quaternions.

                To find the correct rotation, first use the corresponding hand’s grip rule for the axes x, y, and z source axes, and simply look up how this rotation is called in the target system. @@ -923,9 +922,9 @@

                Converting incoming Euler angles

                \text{quat}_\text{NED} = \text{quat}(\text{yaw}, 0, 1, 0) \times \text{quat}(\text{pitch}, 0, 0, -1) \times \text{quat}(\text{roll}, -1, 0, 0). $

                Conclusion

                -

                As this post demonstrates, I have spent my fair share on transforms of all sorts and conventions. -I come to the conclusion that thinking through what actually goes on rather than randomly swapping signs and orders has proven more sustainable to me. -This blog post and my associated coordinate frame conversion tool (https://mkari.de/coord-converter/) hope to help doing so.

                +

                As this post demonstrates, I have spent my fair share on transformations of all sorts and conventions. +I come to the conclusion that thinking through what goes on rather than randomly swapping signs and orders has proven more sustainable to me. +This blog post and my associated coordinate frame conversion tool (https://mkari.de/coord-converter/) hope to help do so.


                Typeset with Markdown Math in VSCode and with KaTeX in HTML.

                diff --git a/posts/index.html b/posts/index.html index 5992dcb..8e4c978 100644 --- a/posts/index.html +++ b/posts/index.html @@ -82,7 +82,7 @@

                - There are 48 combinatorial ways of assigning coordinate frame axes (assign right/left, up/down, and forward/backward to x, y, z, which is $6 \times 4 \times 2$), and it seems as if our disciplines give their best in trying to use all of them. Unfortunately, this means there are $48\times47=2556$ ways of converting between coordinate frames, and each of them is dangerously close to a bug. As if that were not enough, words like extrinsics or pose matrix are used in different meanings, adding to the confusion that inherently surrounds transforms and rotations. + There are 48 combinatorial ways of assigning coordinate frame axes (assign right/left, up/down, and forward/backward to x, y, z, which is $6 \times 4 \times 2$), and it seems as if our disciplines give their best in trying to use all of them. Unfortunately, this means there are $48\times47=2556$ ways of converting between coordinate frames, and each of them is dangerously close to a bug. As if that were not enough, words like extrinsics or pose matrix are used with different meanings, adding to the confusion that inherently surrounds transforms and rotations. Read more diff --git a/posts/index.xml b/posts/index.xml index 2e3c360..add7df7 100644 --- a/posts/index.xml +++ b/posts/index.xml @@ -14,7 +14,7 @@ Sat, 22 Jun 2024 00:00:00 +0000 https://mohamedkari.github.io/blog.mkari.de/posts/cam-transform/ - There are 48 combinatorial ways of assigning coordinate frame axes (assign right/left, up/down, and forward/backward to x, y, z, which is $6 \times 4 \times 2$), and it seems as if our disciplines give their best in trying to use all of them. Unfortunately, this means there are $48\times47=2556$ ways of converting between coordinate frames, and each of them is dangerously close to a bug. As if that were not enough, words like extrinsics or pose matrix are used in different meanings, adding to the confusion that inherently surrounds transforms and rotations. + There are 48 combinatorial ways of assigning coordinate frame axes (assign right/left, up/down, and forward/backward to x, y, z, which is $6 \times 4 \times 2$), and it seems as if our disciplines give their best in trying to use all of them. Unfortunately, this means there are $48\times47=2556$ ways of converting between coordinate frames, and each of them is dangerously close to a bug. As if that were not enough, words like extrinsics or pose matrix are used with different meanings, adding to the confusion that inherently surrounds transforms and rotations.