Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed datasets/ad_hoc.py to work for n>3 #896

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

RishiNandha
Copy link

Summary

This is related to #330

I have rewritten the ad_hoc dataset to work for n>3. I have also made some of the operations more efficient.

I ran "make lint" and "make style" as suggested in the contributing guidelines, but I got an error I was not able to resolve. This is my first contribution, if there are improvements to be made in the PR, please let me know.

I'm also working on bringing more dataset generators to the repository, so I hope ad_hoc.py can still be kept around as the simpler one in the family not that it's been rewritten

Details and comments

  1. The tensor products such as H^{(x)n} were written in n step loops, I've made them logn step loops. And similar other small optimizations such as precomputing Zi*Zj.

  2. The old implementation did an n-dimensional grid for generating random phase vectors. While this brought uniformity with small number of qubits, it was too much computational overhead for larger numbers. I've have swapped this out for 2 different options of stratified sampling

  3. The old implementation had labelling done with parity() and majority() in n=2 and n=3 respectively. I've used an n-qubit Z matrix directly, which aligns better with the reference paper attached in the documentation and also relieves the constraint an n being 2 or 3

  4. Vectorized the labelling loop. This has given some performance gain

@CLAassistant
Copy link

CLAassistant commented Mar 1, 2025

CLA assistant check
All committers have signed the CLA.

Comment on lines 30 to 36
train_size: int,
test_size: int,
n: int,
gap: int,
divisions: int = 0,
plot_data: bool = False,
one_hot: bool = True,
one_hot: bool = False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of the method signature any changes should be done in a backwards compatible way, i.e. if you already have code that uses this it should still work. To that end the change in name of training_size is problematic, as is the change in the one_hot default - they should really stay as they are. We do/can change things so they would break existing code but such changes are normally done via a deprecation. Having said that I have no idea if the data set produced by this new method, for values that were supported before, ends up the same or not with this new algorithm. Then the additional parameter, divisions, to maintain compatibility that should go at the end.

I guess another way to do this would be to keep the original method there as-is and add a complete new method, not quite sure of name, maybe ad_hoc_dataset, ad_hoc_data_sobol or something, that can be used instead. The current method could then be deprecated and removed later leaving only the new one. The default values and order of the parameters in a new method would be a free choice not constrained by compatibility.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you can see from CI there are failures due to these changes being incompatible with existing code - in this case unit tests we have here for example where they should still work/pass unchanged,

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, Ill fix the parameters to be backward compatible. Quick fix

About if it'll produce the same values as before. It will match in shape and format, but for the same seed it won't be the same values. Because the sampling method is different

So shall I fix the default parameters and order of arguments to support backward compatibility and edit the unit tests to assert the new expected values?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the depreciation of ad_hoc.py and the renaming of this new generator: I think we can do whatever you'd and the other reviewers would suggest.

If we make this a separate dataset and won't have constraints to work backwards with ad_hoc codes, I can also further explore including hyperparameters such as:

  1. Linear vs Circular vs Full Entanglements
  2. Labelling based on Expected Value vs Labelling based on Measurement

Reasons being:

  1. When we say "Separable by ZZFeaturemap" the most readily available example of ZZFeaturemap is linear entangled while ad_hoc by default is supporting only full entanglement. So giving that option might make the inner workings of this generator more transparent

  2. I'm thinking measurement would make it harder for a classical ML method to work well with while QML or VQA methods won't have a problem. If that is the case, we can demonstrate that Quantum Methods have an advantage in performing better in this dataset. But I'm not sure, I'm very new to this field of study. But I think including this as an option with one of hyperparameters can leave the choice up to the user

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a commit with the parameter order and default values fixed. Like speculated by you, the unit tests still continue to fail because the arrays generated by the new sampling technique are numerically different from the old. So I wanted to ask what you would suggest.

Shall we proceed with modifying the unit tests? Or shall we proceed with making this an independent dataset generator?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edoaltamura Do you guys have a take on this as the prime maintainers now? I chipped in based on what I saw and for me I think a new method with this functionality, since it ends up different than the existing one, for the compatible inputs overlap, would be "better". Existing users code would not suddenly end up having a different outcome due to it giving back different data than before - such as we see here with the unit test failures. If we wanted the existing method could be deprecated and removed later giving users a chance to switch over to the new method and sort out whatever they need on their end in using it and the outcome. Also the author has suggested they can look at further hyperparameters more flexibly this way without being constrained.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your input @woodsp-ibm! I'll add my comments for @RishiNandha in a separate review in a moment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR; I agree with @woodsp-ibm: both methods I think should be kept. But we could structure it nicely to make it obvious when to use Numpy random choice and when Sobol. See the other review.

edoaltamura and others added 2 commits March 4, 2025 00:41
Co-authored-by: Steve Wood <40241007+woodsp-ibm@users.noreply.github.com>
Copy link
Collaborator

@edoaltamura edoaltamura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, you can add a release note (reno), as in the contributing guide, describing the extension of the dataset to higher n. This description will then be included in the v.0.8.3 release notes.

Comment on lines +359 to +360
# Uncomment to test
# print(ad_hoc_data(10,10,5,0.1,3))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Uncomment to test
# print(ad_hoc_data(10,10,5,0.1,3))

Comment on lines 30 to 36
train_size: int,
test_size: int,
n: int,
gap: int,
divisions: int = 0,
plot_data: bool = False,
one_hot: bool = True,
one_hot: bool = False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR; I agree with @woodsp-ibm: both methods I think should be kept. But we could structure it nicely to make it obvious when to use Numpy random choice and when Sobol. See the other review.

labels = np.empty(n_samples, dtype=int)
cur = 0

while n_samples > 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason why you're using a while loop instead of a for one, defining samples outside the loop. Is it because Sobol and LHC can give different count values every time, where count is not always a submultiple of n_samples?

Copy link
Author

@RishiNandha RishiNandha Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read the other comments. I'll incorporate them soon. About the while loop, so since we only want to accept samples with labels 1 or -1 (that is the |exp_val|>gap), we might end up with lesser samples accepted in each pass than the total number of samples generated in that pass. So the while loop keeps generating as many datapoints as is still needed and terminates once we have n_sample number of accepted datapoints


while n_samples > 0:
# Stratified Sampling for x vector
if divisions>0: x_vecs = _modified_LHC(n, n_samples, divisions)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my view one way to incorporate the back-compatible

basis = algorithm_globals.random.random(
        (2**n, 2**n)
    ) + 1j * algorithm_globals.random.random((2**n, 2**n))

would be to introduce another function, say _uniform_havlicek, that implements it and is triggered in an elif logic like this. To make the current unit tests pass, then you'd select this sampling type.

If you were to go ahead with this, then I'd suggest introducing a function argument called eg sampling_method, that accept the Literal values "sobol", "hypercube", or "havlicek", and if the use selects "hypercube", then they must pass divisions too.

Then, you could raise an error in _uniform_havlicek if n>3, specifying that this (original) method only works for 2 and 3, or you could try to extend it to arbitrary n in a way that it reduces to the original form when n=2, 3.



def _phi_ij(x_vecs: np.ndarray, i: int, j: int) -> np.ndarray:
"""Compute the φ_ij term for given dimensions.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure you whitelist symbols like \phi and \Delta (somewhere else) in the dictionary used by Pylint: https://github.com/qiskit-community/qiskit-machine-learning/blob/main/.pylintdict.

This will avoid spell check errors to be raised when running GH actions workflows. You can find more info at https://github.com/qiskit-community/qiskit-machine-learning/blob/main/CONTRIBUTING.md.
The same goes for the code formatting with the commands pip install black; make black - this will reformat the code to match the repo standards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants