-
Notifications
You must be signed in to change notification settings - Fork 338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changed datasets/ad_hoc.py to work for n>3 #896
base: main
Are you sure you want to change the base?
Conversation
train_size: int, | ||
test_size: int, | ||
n: int, | ||
gap: int, | ||
divisions: int = 0, | ||
plot_data: bool = False, | ||
one_hot: bool = True, | ||
one_hot: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of the method signature any changes should be done in a backwards compatible way, i.e. if you already have code that uses this it should still work. To that end the change in name of training_size is problematic, as is the change in the one_hot default - they should really stay as they are. We do/can change things so they would break existing code but such changes are normally done via a deprecation. Having said that I have no idea if the data set produced by this new method, for values that were supported before, ends up the same or not with this new algorithm. Then the additional parameter, divisions, to maintain compatibility that should go at the end.
I guess another way to do this would be to keep the original method there as-is and add a complete new method, not quite sure of name, maybe ad_hoc_dataset, ad_hoc_data_sobol or something, that can be used instead. The current method could then be deprecated and removed later leaving only the new one. The default values and order of the parameters in a new method would be a free choice not constrained by compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you can see from CI there are failures due to these changes being incompatible with existing code - in this case unit tests we have here for example where they should still work/pass unchanged,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, Ill fix the parameters to be backward compatible. Quick fix
About if it'll produce the same values as before. It will match in shape and format, but for the same seed it won't be the same values. Because the sampling method is different
So shall I fix the default parameters and order of arguments to support backward compatibility and edit the unit tests to assert the new expected values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the depreciation of ad_hoc.py and the renaming of this new generator: I think we can do whatever you'd and the other reviewers would suggest.
If we make this a separate dataset and won't have constraints to work backwards with ad_hoc codes, I can also further explore including hyperparameters such as:
- Linear vs Circular vs Full Entanglements
- Labelling based on Expected Value vs Labelling based on Measurement
Reasons being:
-
When we say "Separable by ZZFeaturemap" the most readily available example of ZZFeaturemap is linear entangled while ad_hoc by default is supporting only full entanglement. So giving that option might make the inner workings of this generator more transparent
-
I'm thinking measurement would make it harder for a classical ML method to work well with while QML or VQA methods won't have a problem. If that is the case, we can demonstrate that Quantum Methods have an advantage in performing better in this dataset. But I'm not sure, I'm very new to this field of study. But I think including this as an option with one of hyperparameters can leave the choice up to the user
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a commit with the parameter order and default values fixed. Like speculated by you, the unit tests still continue to fail because the arrays generated by the new sampling technique are numerically different from the old. So I wanted to ask what you would suggest.
Shall we proceed with modifying the unit tests? Or shall we proceed with making this an independent dataset generator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@edoaltamura Do you guys have a take on this as the prime maintainers now? I chipped in based on what I saw and for me I think a new method with this functionality, since it ends up different than the existing one, for the compatible inputs overlap, would be "better". Existing users code would not suddenly end up having a different outcome due to it giving back different data than before - such as we see here with the unit test failures. If we wanted the existing method could be deprecated and removed later giving users a chance to switch over to the new method and sort out whatever they need on their end in using it and the outcome. Also the author has suggested they can look at further hyperparameters more flexibly this way without being constrained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your input @woodsp-ibm! I'll add my comments for @RishiNandha in a separate review in a moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR; I agree with @woodsp-ibm: both methods I think should be kept. But we could structure it nicely to make it obvious when to use Numpy random choice and when Sobol. See the other review.
Co-authored-by: Steve Wood <40241007+woodsp-ibm@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally, you can add a release note (reno), as in the contributing guide, describing the extension of the dataset to higher n. This description will then be included in the v.0.8.3 release notes.
# Uncomment to test | ||
# print(ad_hoc_data(10,10,5,0.1,3)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Uncomment to test | |
# print(ad_hoc_data(10,10,5,0.1,3)) |
train_size: int, | ||
test_size: int, | ||
n: int, | ||
gap: int, | ||
divisions: int = 0, | ||
plot_data: bool = False, | ||
one_hot: bool = True, | ||
one_hot: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR; I agree with @woodsp-ibm: both methods I think should be kept. But we could structure it nicely to make it obvious when to use Numpy random choice and when Sobol. See the other review.
labels = np.empty(n_samples, dtype=int) | ||
cur = 0 | ||
|
||
while n_samples > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason why you're using a while loop instead of a for one, defining samples outside the loop. Is it because Sobol and LHC can give different count
values every time, where count
is not always a submultiple of n_samples
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've read the other comments. I'll incorporate them soon. About the while loop, so since we only want to accept samples with labels 1 or -1 (that is the |exp_val|>gap), we might end up with lesser samples accepted in each pass than the total number of samples generated in that pass. So the while loop keeps generating as many datapoints as is still needed and terminates once we have n_sample number of accepted datapoints
|
||
while n_samples > 0: | ||
# Stratified Sampling for x vector | ||
if divisions>0: x_vecs = _modified_LHC(n, n_samples, divisions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my view one way to incorporate the back-compatible
basis = algorithm_globals.random.random(
(2**n, 2**n)
) + 1j * algorithm_globals.random.random((2**n, 2**n))
would be to introduce another function, say _uniform_havlicek
, that implements it and is triggered in an elif logic like this. To make the current unit tests pass, then you'd select this sampling type.
If you were to go ahead with this, then I'd suggest introducing a function argument called eg sampling_method
, that accept the Literal values "sobol", "hypercube", or "havlicek", and if the use selects "hypercube", then they must pass divisions
too.
Then, you could raise an error in _uniform_havlicek
if n>3, specifying that this (original) method only works for 2 and 3, or you could try to extend it to arbitrary n in a way that it reduces to the original form when n=2, 3.
|
||
|
||
def _phi_ij(x_vecs: np.ndarray, i: int, j: int) -> np.ndarray: | ||
"""Compute the φ_ij term for given dimensions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure you whitelist symbols like \phi and \Delta (somewhere else) in the dictionary used by Pylint: https://github.com/qiskit-community/qiskit-machine-learning/blob/main/.pylintdict.
This will avoid spell check errors to be raised when running GH actions workflows. You can find more info at https://github.com/qiskit-community/qiskit-machine-learning/blob/main/CONTRIBUTING.md.
The same goes for the code formatting with the commands pip install black; make black
- this will reformat the code to match the repo standards.
Summary
This is related to #330
I have rewritten the ad_hoc dataset to work for n>3. I have also made some of the operations more efficient.
I ran "make lint" and "make style" as suggested in the contributing guidelines, but I got an error I was not able to resolve. This is my first contribution, if there are improvements to be made in the PR, please let me know.
I'm also working on bringing more dataset generators to the repository, so I hope ad_hoc.py can still be kept around as the simpler one in the family not that it's been rewritten
Details and comments
The tensor products such as H^{(x)n} were written in n step loops, I've made them logn step loops. And similar other small optimizations such as precomputing Zi*Zj.
The old implementation did an n-dimensional grid for generating random phase vectors. While this brought uniformity with small number of qubits, it was too much computational overhead for larger numbers. I've have swapped this out for 2 different options of stratified sampling
The old implementation had labelling done with parity() and majority() in n=2 and n=3 respectively. I've used an n-qubit Z matrix directly, which aligns better with the reference paper attached in the documentation and also relieves the constraint an n being 2 or 3
Vectorized the labelling loop. This has given some performance gain