Skip to content

Add dynamic matrix#21

Open
Eleobert wants to merge 1 commit intogenbattle:masterfrom
Eleobert:dynamic-2d-array
Open

Add dynamic matrix#21
Eleobert wants to merge 1 commit intogenbattle:masterfrom
Eleobert:dynamic-2d-array

Conversation

@Eleobert
Copy link

@Eleobert Eleobert commented Feb 6, 2021

Instead of passing std::vector<std::array<T, N>> I added a new class as_matrix that allow passing data with size defined at runtime. The class is only an interface and all operations are read only.

For example, if the user has the data stored in std::vector<std::pair<float, float>>, we can easily pass it without having to copy to a new container:

auto matrix = dkm::as_matrix(reinterpret_cast<float*>(data.data()), data.size(), 2, false);
auto res = km::kmeans_lloyd(matrix, k);

Or if the data is an armadillo matrix (Eigen, etc) it is even easier:

auto matrix = dkm::as_matrix(data.memptr(), data.n_rows, data.n_cols);
auto res = km::kmeans_lloyd(matrix, k);

I think this approach is much better than the current one. For now, I only added the class and changed dkm.hpp (since I don't know if this will get merged and this is the only relevant part for me). One performance issue we have is as_matrix::row returning by value, but this can be easily fixed.

@genbattle
Copy link
Owner

Hi, thanks for your contribution!

This looks similar to an idea I had for a custom matrix data structure. I'll have a more detailed look at it over this weekend.

@genbattle genbattle self-assigned this Feb 26, 2021
auto res = std::vector<T>(n_cols);
for(size_t j = 0; j < n_cols; j++)
{
res[j] = (*this)(i, j);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the reasons this library is so performant is that it avoids allocations and copies wherever possible. This is a full copy of each and every row in the input data, multiple times. It will have a significant effect on performance.

The way to do this performantly and safely would be to return a std::span which points to the internal data. Given this library currently only requires C++11, the solution is probably returning a pointer to the row or a custom struct which emulates the behavior of std::span.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had been implementing a new version that takes this into account. The idea is whenever the matrix is column major we first copy (because we cannot change the incoming data) and transpose it.

The copy would work like this:

auto [owner, data] = copy(matrix);

From here we can safely return the span. See that the matrix data structure never owns the data, so we need to return an owner (arguably unique_ptr) from the copy function.

I plan to implement this when my project reaches the optimization stage. Here is an implementation I was already working on https://pastebin.com/DYQs2EBb

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can transposition not just produce a new as_matrix with the correct indexer if we need to switch to column instead of row major.

There's no reason to have any solution here that includes any form of copying.

Copy link
Author

@Eleobert Eleobert Mar 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transposition is performed only once right after the function is called. I didn't measure the performance yet but I think that the overhead is insignificant.

The idea with transposition is that with column major data the next element of a given row is at a distance of n_rows. It is more performant and easier to work with if we keep the elements at a distance of 1.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My argument was mainly that in this case the representation of the data doesn't need to change at all, we just need need a different view over the data, which seems to be the whole point of as_matrix (an abstracted view over the data).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your point, my only concern is that in the case of row major data to get a row we can easily return std::span(ptr, ncols). But in case of column major it is not possible without adding more complexity. To avoid copying the alternative I can think of would have to be a sort of row_vec whose the indexing would work as following:

auto row_vec::operator()(size_t i)
{
    return data[i * n_rows + this->row_number]
}

From my experience the transposing the data + changing from column major to row major is simpler and faster, specially when the input data is large enough. But I am not completely sure if this is true in fact. What is your opinion on this?

// This class is only an interface! Not designed to be used outside library internals.

template<typename T>
class as_matrix
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be fine to just name this matrix.

Copy link
Author

@Eleobert Eleobert Mar 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name comes from the fact that we are dealing with the data "as a matrix". But as_matrix is in fact not even a matrix (in common sense), it is only an interface between the algorithm and the data. I think the name as_matrix expresses better the idea.

public:
const size_t n_rows, n_cols;

as_matrix(const T *data, size_t n_rows, size_t n_cols, bool col_major = true)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to see a constructor for conversion from the existing vector<array<T, N>> type as well to ease migration for existing users.

Copy link
Author

@Eleobert Eleobert Mar 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can instead overload the main function and mark the old version as [[deprecated]]. What do you think?

Copy link
Owner

@genbattle genbattle Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach is also acceptable, as long as [[deprecated]] will be ignored by older compilers.

{}

auto row(size_t i) const -> std::vector<T>;
auto operator()(size_t i, size_t j) const -> const T&
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer this was just a named function like get.

@Eleobert
Copy link
Author

Eleobert commented Oct 8, 2021

Just to close this, I think mdspan is a better alternative.

@g40
Copy link

g40 commented May 5, 2023

Hi, did this simply stall? Looks like a very useful addition.

@genbattle
Copy link
Owner

Hi, did this simply stall? Looks like a very useful addition.

Yes, this change stalled, and I haven't had time to implement/update it myself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants