diff --git a/docs/Scaling of numerical variables.ipynb b/docs/Scaling of numerical variables.ipynb index a8a912e..86e94e9 100644 --- a/docs/Scaling of numerical variables.ipynb +++ b/docs/Scaling of numerical variables.ipynb @@ -1,764 +1,764 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "0c41daba-6432-4378-98e6-54cde5b9f03b", - "metadata": {}, - "source": [ - "### Background notes" - ] - }, - { - "cell_type": "markdown", - "id": "d5ca10d1-632d-4e50-ba45-4e4ffba47575", - "metadata": {}, - "source": [ - "In `exhibit` you can generate numerical values either from a uniform random distribution or from a normal distribution. These values can then be coerced to either floats or integers (as a hack to get discrete values like 0 and 1). It's unlikely you'd want to use generated values as is, so we apply linear scaling before returning the values to the user.\n", - "\n", - "To draw from a uniform distribution all we need is a starting value and the optional dispersion (noise) percentage to shift the final value around. This uniform value is still affected by feature weights, but with dispersion set to 0 you can get consistent values for the same row values. \n", - "\n", - "For normal distribution, you'd need the mean and the standard deviation. The two statistics, however, will change by the time you finish generating the dataset because for each row, the weights will affect the mean. Thus, you can use them as initial values, but will have to rescale at the end if you want to keep the original values. If the mean / std are commented out in the spec (user intending to scale to range, for example), then we'll use default mean of 1 (having it as zero will negate mean shifting by weights) and standard deviation of 1.\n", - "\n", - "Additionally, if the generated distribution has negative values, we will shift the whole set to be in the positive territory if the scaling is set to `target_sum`; other scaling options can be used with negative ranges / statistics.\n", - "\n", - "The scaling options are:\n", - "\n", - " - target range\n", - " - target statistic (`sum`, `mean`, `std`)\n", - " \n", - "The goal of the scaling options is to match user expectations and to preseve the shape of the data. Unfortunately, you can't have all matching statistics so by choosing to align to a certain metric, you leave others free-floating. This is so that we don't lose information from the weights and preseve the shape of the data.\n", - "\n", - "| scaling | min / max | sum | mean | standard deviation | weights |\n", - "| :- | :- | :- | :- | :- | :- |\n", - "| target range | preserved | free-floating | free-floating | free-floating | __*__ |\n", - "| target statistic (sum) | free-floating | preserved | free-floating | free-floating | preserved |\n", - "| target statistic (mean + std) | free-floating | free-floating | preserved | preserved | preserved |\n", - "\n", - "__\\*__ _if the ratio between target_min and target_max doesn't match the generated min and max, either the weights will change\n", - "or one of the target ranges will need to be adjusted. For example, assume you have 4 values influencing the weights - `0.05, 0.15, 0.3, 0.4`. You want to scale your generated data to between 10 and 150. Without scaling, the generated data can have values like `25, 75, 150, 200`. Now, the ratio `0.3 / 0.15` is the same as `150 / 75`. However, if we scale to between 10 and 150, we get `10, 50, 110, 150`. And here, we're dealing with intervals - `0.1 = 40` so values with weights 0.15 and 0.3 are separated by 60. Although this is not particularly intuitive, it's still preferable over adjusting the `target_max` like so `target_max = (target_min * generated.max()) / generated.min()` because often you have to have fixed min-max ranges and also adjusting the maximum in this way can lead to unexpected results, like pushing the maximum into the negative territory._\n", - "\n", - "When generating the specification from the original data, we populate all potential fields: `min`, `max`, `sum`, etc. Rather than creating separate functions for different statistics, we'll try to make sense of what we're given and issue a warning if an impossible or conflicting situation is given.\n", - "\n", - "For example:\n", - "\n", - "| min | max | sum | mean | standard deviation | outcome |\n", - "| :- | :- | :- | :- | :- |:- |\n", - "| given | missing | missing | missing | missing | derive the missing end of the range\n", - "| given | given | missing | missing | missing | valid\n", - "| given | given | given | given | given | warning - use sum as default\n", - "| missing | missing | given | given | given | warning - use sum as default\n", - "| missing | missing | missing | given | given | scale to both\n" - ] - }, - { - "cell_type": "markdown", - "id": "5c1ec07e-d3df-4bd3-a3d8-b7cee08294c0", - "metadata": {}, - "source": [ - "#### Scaling functions\n", - "Scaling floats works as expected. However, as soon as you ask for discrete values, we're at the mercy of rounding which is OK in the large registers, but will produce highly imbalanced datasets when values are low or when standard deviation is small. We try to compensate for it in both scaling to range and scaling to target sum, but if we only have mean / standard deviation to work with, the results will likely be quite imprecise." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "c5386e8b-ca06-46f4-b55f-d42501f5917f", - "metadata": {}, - "outputs": [], - "source": [ - "import pandas as pd\n", - "import numpy as np\n", - "import matplotlib.pyplot as plt" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "cf4670b6-12dd-4a23-b1bd-61557ca96084", - "metadata": {}, - "outputs": [], - "source": [ - "def scale_to_mean_std(array, target_mean=100, target_std=20, discrete=False):\n", - " \n", - " arr = np.array(array)\n", - " \n", - " result = target_mean + (arr - arr.mean()) * target_std / arr.std()\n", - " \n", - " if discrete:\n", - " result = result.round()\n", - " \n", - " return result" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "id": "fa835db8-eeed-4eec-9a07-a143098151f4", - "metadata": {}, - "outputs": [], - "source": [ - "def scale_to_range(array, target_min=None, target_max=None, discrete=False):\n", - " \n", - " X = np.array(array)\n", - " \n", - " # adjust for potential negative signs!\n", - " if not target_min:\n", - "\n", - " target_min = target_max - (abs(target_max) - abs(target_max * X.min() / X.max()))\n", - " \n", - " if not target_max:\n", - "\n", - " target_max = target_min + abs(target_min * X.max() / X.min()) - abs(target_min)\n", - "\n", - " if discrete:\n", - "\n", - " target_range = int(np.ceil(target_max) - np.floor(target_min))\n", - " bins = np.linspace(X.min(), X.max(), target_range + 2)\n", - " labels = np.arange(np.floor(target_min), np.ceil(target_max) + 1)\n", - "\n", - " result = pd.cut(X, bins=bins, right=True, include_lowest=True, labels=labels).to_numpy()\n", - " \n", - " return result\n", - " \n", - " result = (X - X.min()) / (X.max() - X.min()) * (target_max - target_min) + target_min\n", - "\n", - " return result" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "4d0823e7-a911-402e-b458-f2fecf879df0", - "metadata": {}, - "outputs": [], - "source": [ - "def scale_to_sum(array, target_sum=200, discrete=False):\n", - " \n", - " series = pd.Series(array)\n", - " \n", - " if any(series < 0):\n", - " series = series + abs(series.min())\n", - " \n", - " scaling_factor = target_sum / series.dropna().sum()\n", - " scaled_series = series * scaling_factor\n", - " \n", - " if discrete:\n", - " \n", - " row_diff = (target_sum - series.dropna().sum()) / len(series.dropna())\n", - " values = pd.Series(\n", - " np.where(\n", - " series + row_diff >= 0,\n", - " series + row_diff,\n", - " np.where(pd.isnull(series), np.NaN, 0)\n", - " )\n", - " )\n", - " \n", - " #how many rows will need to be rounded up to get to target\n", - " boundary = int(target_sum - np.floor(values).sum())\n", - "\n", - " #because values are limited at the lower end at zero, sometimes it's not possible\n", - " #to adjust them to a lower target_sum; we floor them and return\n", - " if boundary < 0:\n", - " return pd.Series(np.floor(values)).to_numpy()\n", - "\n", - " #if series has NAs, then the calcualtion will be off\n", - " clean_values = values.dropna() #keep original index\n", - "\n", - " #np.ceil and floor return Series so index is preserved\n", - " values.update(np.maximum(np.ceil(clean_values.iloc[0:boundary]), 1))\n", - " values.update(np.floor(clean_values.iloc[boundary:]))\n", - "\n", - " #return a series of ints or cast to float if there are any NAs\n", - " #see https://github.com/pandas-dev/pandas/issues/29618\n", - " #before migrating to the new Pandas null-aware int dtype\n", - " scaled_series = values if values.isna().any() else values.astype(int)\n", - " \n", - " return scaled_series" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "id": "176afd13-c1a2-42ca-818b-fe36fb80b31a", - "metadata": {}, - "outputs": [], - "source": [ - "def transform_datasets(datasets, func, **kwargs):\n", - "\n", - " new_datasets = {}\n", - "\n", - " for i, dataset in datasets.items():\n", - " new_datasets[i] = (\n", - " func(dataset[0], **kwargs),\n", - " func(dataset[1], **kwargs)\n", - " )\n", - " \n", - " return new_datasets" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "id": "1a0ca1a9-2f58-4d62-8d28-a5ee7715bce0", - "metadata": {}, - "outputs": [], - "source": [ - "def plot_original_quartet(datasets):\n", - " '''\n", - " From Matplotlib gallery: https://matplotlib.org/stable/gallery/specialty_plots/anscombe.html\n", - " '''\n", - "\n", - " fig, axs = plt.subplots(2, 2, sharex=True, sharey=True, figsize=(6, 6),\n", - " gridspec_kw={'wspace': 0.08, 'hspace': 0.08})\n", - " \n", - " for ax, (label, (x, y)) in zip(axs.flat, datasets.items()):\n", - " ax.text(0.1, 0.9, label, fontsize=20, transform=ax.transAxes, va='top')\n", - " ax.tick_params(direction='in', top=True, right=True)\n", - " ax.plot(x, y, 'o')\n", - "\n", - "\n", - " plt.show()" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "id": "6dc888d7-a2c1-4319-91d3-6aca0a966203", - "metadata": {}, - "outputs": [], - "source": [ - "def plot_scaled_quartets(float_dataset, discrete_dataset):\n", - "\n", - " fig, axs = plt.subplots(2, 4, sharex=True, sharey=True, figsize=(12, 6),\n", - " gridspec_kw={'wspace': 0.08, 'hspace': 0.08})\n", - " \n", - " colors = [\"firebrick\", \"teal\"]\n", - " \n", - " for i, j in enumerate(range(0, 4, 2)):\n", - " \n", - " if i == 0:\n", - " dataset = float_dataset\n", - " else: \n", - " dataset = discrete_dataset\n", - " \n", - " for ax, (label, (x, y)) in zip(axs[:, j:j+2].flat, dataset.items()):\n", - " ax.text(0.1, 0.9, label, fontsize=20, transform=ax.transAxes, va='top')\n", - " ax.tick_params(direction='in', top=True, right=True)\n", - " ax.plot(x, y, 'o', c=colors[i])\n", - " \n", - " axs[0, 0].set_title(\"Float\", loc=\"left\") \n", - " axs[0, 2].set_title(\"Discrete\", loc=\"left\") \n", - "\n", - " plt.show()" - ] - }, - { - "cell_type": "markdown", - "id": "b1b057d4-852e-4ba6-95ae-bcf193fdbec2", - "metadata": {}, - "source": [ - "#### Original Anscombe's Quartet datasets" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "8f95cb45-ccad-4576-87de-02b860bdff6b", - "metadata": {}, - "outputs": [], - "source": [ - "x = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]\n", - "y1 = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]\n", - "y2 = [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]\n", - "y3 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]\n", - "x4 = [8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8]\n", - "y4 = [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]\n", - "\n", - "datasets = {\n", - " 'I': (x, y1),\n", - " 'II': (x, y2),\n", - " 'III': (x, y3),\n", - " 'IV': (x4, y4)\n", - "}" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "id": "67dc8645-5d87-4ab5-8965-43d6ff63a892", - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "plot_original_quartet(datasets)" - ] - }, - { - "cell_type": "markdown", - "id": "e1172759-dcfc-4081-9163-f50f37665c4e", - "metadata": {}, - "source": [ - "#### Scaling to a given mean and standard deviation\n", - "No difference between float and discrete option" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "id": "bf8b7c19-4739-4071-a465-9b3b88aa6039", - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "mean_std_datasets_float = transform_datasets(datasets, scale_to_mean_std, discrete=False)\n", - "mean_std_datasets_discrete = transform_datasets(datasets, scale_to_mean_std, discrete=True)\n", - "\n", - "plot_scaled_quartets(mean_std_datasets_float, mean_std_datasets_discrete)" - ] - }, - { - "cell_type": "markdown", - "id": "3e80a982-14d8-4c78-8ea2-4554363383f1", - "metadata": {}, - "source": [ - "#### Scaling to given range\n", - "No difference between float and discrete option" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "id": "47edeedd-c098-46dc-95e3-724242d43973", - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "range_datasets_float = transform_datasets(datasets, scale_to_range, discrete=False, target_min=-20, target_max=200)\n", - "range_datasets_discrete = transform_datasets(datasets, scale_to_range, discrete=True, target_min=-20, target_max=200)\n", - "\n", - "plot_scaled_quartets(range_datasets_float, range_datasets_discrete)" - ] - }, - { - "cell_type": "markdown", - "id": "ecc30b65-54ed-409a-932e-4d605ac95732", - "metadata": {}, - "source": [ - "#### Scaling to given target sum\n", - "Discrete will change the shape of the distribution somewhat - particularly if the values are small" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "id": "698268bc-e5fc-4a4d-86e2-220ff4f276a5", - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "target_sum_datasets_float = transform_datasets(datasets, scale_to_sum, discrete=False)\n", - "target_sum_datasets_discrete = transform_datasets(datasets, scale_to_sum, discrete=True)\n", - "\n", - "plot_scaled_quartets(target_sum_datasets_float, target_sum_datasets_discrete)" - ] - }, - { - "cell_type": "markdown", - "id": "d7391d20-fb6e-4057-8e94-238a4a46be18", - "metadata": {}, - "source": [ - "#### Multivariate distribution\n", - "These examples are closer to the real use case where the final distribution is made up from a number of normal distributions with a varying mean.\n", - "\n", - "Code adapted from this SO [answer](https://stackoverflow.com/questions/47759577/creating-a-mixture-of-probability-distributions-for-sampling)." - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "id": "a6cb5d56-49d4-44fe-9bf7-eb4ed479f979", - "metadata": {}, - "outputs": [], - "source": [ - "def plot_distribution(float_dataset, discrete_dataset,):\n", - "\n", - " fig, axs = plt.subplots(1, 2, sharex=True, sharey=True, figsize=(12, 6),\n", - " gridspec_kw={'wspace': 0.08, 'hspace': 0.08})\n", - " \n", - " colors = [\"firebrick\", \"teal\"]\n", - " \n", - " for i in range(2):\n", - " \n", - " if i == 0:\n", - " dataset = float_dataset\n", - " else: \n", - " dataset = discrete_dataset\n", - " \n", - " axs[i].hist(dataset, bins=100, density=False, color=colors[i], alpha=0.5)\n", - " \n", - " \n", - " axs[0].set_title(\"Float\", loc=\"left\") \n", - " axs[1].set_title(\"Discrete\", loc=\"left\") \n", - "\n", - " plt.show()" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "id": "7a91216a-b05d-4d74-937d-4ba9a8842455", - "metadata": {}, - "outputs": [], - "source": [ - "distributions = [\n", - " {\"type\": np.random.normal, \"kwargs\": {\"loc\": 20, \"scale\": 2}},\n", - " {\"type\": np.random.normal, \"kwargs\": {\"loc\": 10, \"scale\": 3}},\n", - " {\"type\": np.random.normal, \"kwargs\": {\"loc\": 5, \"scale\": 1}},\n", - "]\n", - "coefficients = np.array([0.5, 0.2, 0.3])\n", - "coefficients /= coefficients.sum()\n", - "sample_size = 100000\n", - "\n", - "num_distr = len(distributions)\n", - "data = np.zeros((sample_size, num_distr))\n", - "for idx, distr in enumerate(distributions):\n", - " data[:, idx] = distr[\"type\"](size=(sample_size,), **distr[\"kwargs\"])\n", - "random_idx = np.random.choice(np.arange(num_distr), size=(sample_size,), p=coefficients)\n", - "sample = data[np.arange(sample_size), random_idx]" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "id": "4d7e667b-c281-44a1-a30a-1d87679b02fa", - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX8AAAD8CAYAAACfF6SlAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAVQklEQVR4nO3df6wd9Znf8fcnXkpQEhQoF+T4R023jlRAXadcuUhUFU3SxZtUNVmJlZG6uFokR8ioRIrUmPwT0pUltsqPFdVi1WkiTJuEWkpSrAR2l9CgNBLg2JTEGEKxiktubNlks1HMP24wT/84X8OROff63B8+954775c0OnOemTnnOxr7Od/7zHdmUlVIkrrlXYvdAEnS6Jn8JamDTP6S1EEmf0nqIJO/JHWQyV+SOmjo5J9kRZL/leS77f3lSR5P8nJ7vaxv3XuSHEnyUpKb++LXJznUlt2fJAu7O5KkYcym53838GLf+x3AE1W1HniivSfJNcAW4FpgE/BAkhVtm13ANmB9mzbNq/WSpDkZKvknWQ18HPjPfeHNwJ42vwe4pS/+cFWdrqpXgCPAxiQrgUur6qnqXVn2UN82kqQR+p0h1/tz4N8B7+uLXVVVxwGq6niSK1t8FfB033pTLfbbNn9ufEZXXHFFrVu3bshmSpIADh48+Muqmphu+XmTf5J/CZysqoNJbhriOwfV8WuG+KDv3EavPMTatWs5cODAEF8rSToryf+dafkwZZ8bgX+V5CjwMPDhJP8VONFKObTXk239KWBN3/argWMtvnpA/B2qandVTVbV5MTEtD9ckqQ5Om/yr6p7qmp1Va2jdyL3f1TVvwb2AVvbaluBR9r8PmBLkouTXE3vxO7+ViI6leSGNsrn9r5tJEkjNGzNf5D7gL1J7gBeBW4FqKrDSfYCLwBvANur6kzb5k7gQeAS4LE2SZJGLEv9ls6Tk5NlzV+SZifJwaqanG65V/hKUgeZ/CWpg0z+ktRBJn9J6iCTvyR10HyGemqJW7fje2/NH73v44vYEklLjcl/melP+JI0Hcs+ktRBJn9J6iCTvyR1kMlfkjrI5C9JHWTyl6QOMvlLUgeZ/CWpg0z+ktRBJn9J6iCTvyR10HmTf5J3J9mf5CdJDif5fIvfm+QXSZ5r08f6trknyZEkLyW5uS9+fZJDbdn97UHukqQRG+bGbqeBD1fV60kuAn6U5OyD179cVV/oXznJNcAW4FrgA8D3k3ywPcR9F7ANeBp4FNiED3GXpJE7b8+/el5vby9q00xPfd8MPFxVp6vqFeAIsDHJSuDSqnqqek+Nfwi4ZV6tlyTNyVC3dE6yAjgI/APgL6rqmSR/ANyV5HbgAPDpqvpbYBW9nv1ZUy322zZ/blzz5G2cJc3WUCd8q+pMVW0AVtPrxV9Hr4Tzu8AG4Djwxbb6oDp+zRB/hyTbkhxIcuC1114bpomSpFmY1Wifqvo18CSwqapOtB+FN4GvABvbalPAmr7NVgPHWnz1gPig79ldVZNVNTkxMTGbJkqShjDMaJ+JJO9v85cAHwV+1mr4Z30CeL7N7wO2JLk4ydXAemB/VR0HTiW5oY3yuR14ZOF2RZI0rGFq/iuBPa3u/y5gb1V9N8l/SbKBXunmKPBJgKo6nGQv8ALwBrC9jfQBuBN4ELiE3igfR/pI0iJIb+DN0jU5OVkHDhxY7GYsabM94evD3KXlL8nBqpqcbrlX+EpSB5n8JamDTP6S1EEmf0nqIJO/JHWQyV+SOsjkL0kdZPKXpA4y+UtSB5n8JamDTP6S1EEmf0nqIJO/JHWQyV+SOsjkL0kdZPKXpA4y+UtSB5n8JamDhnmA+7uT7E/ykySHk3y+xS9P8niSl9vrZX3b3JPkSJKXktzcF78+yaG27P72IHdJ0ogN0/M/DXy4qn4P2ABsSnIDsAN4oqrWA0+09yS5BtgCXAtsAh5oD38H2AVsA9a3adPC7YqkC23dju+9NWm8nTf5V8/r7e1FbSpgM7CnxfcAt7T5zcDDVXW6ql4BjgAbk6wELq2qp6r31PiH+raRJI3QUDX/JCuSPAecBB6vqmeAq6rqOEB7vbKtvgr4ed/mUy22qs2fG5ckjdhQyb+qzlTVBmA1vV78dTOsPqiOXzPE3/kBybYkB5IceO2114ZpoiRpFmY12qeqfg08Sa9Wf6KVcmivJ9tqU8Cavs1WA8dafPWA+KDv2V1Vk1U1OTExMZsmSpKGMMxon4kk72/zlwAfBX4G7AO2ttW2Ao+0+X3AliQXJ7ma3ond/a00dCrJDW2Uz+1922iEPGkn6XeGWGclsKeN2HkXsLeqvpvkKWBvkjuAV4FbAarqcJK9wAvAG8D2qjrTPutO4EHgEuCxNkmSRuy8yb+qfgp8aED8b4CPTLPNTmDngPgBYKbzBZKkERim5y+pw6YrD/bHj9738VE1RwvE2ztIUgeZ/CWpgyz7SJo3S0Djx56/JHWQyV+SOsiyj6QFZQloPJj8x5RX50qaD5O/pHewc7H8WfOXpA4y+UtSB1n2kQRY6ukae/6S1EEmf0nqIJO/JHWQyV+SOsgTvpIuGK/2Xbrs+UtSBw3zAPc1SX6Q5MUkh5Pc3eL3JvlFkufa9LG+be5JciTJS0lu7otfn+RQW3Z/e5C7JGnEhin7vAF8uqqeTfI+4GCSx9uyL1fVF/pXTnINsAW4FvgA8P0kH2wPcd8FbAOeBh4FNuFD3CVp5M7b86+q41X1bJs/BbwIrJphk83Aw1V1uqpeAY4AG5OsBC6tqqeqqoCHgFvmuwOSpNmbVc0/yTrgQ8AzLXRXkp8m+VqSy1psFfDzvs2mWmxVmz83LkkasaFH+yR5L/At4FNV9Zsku4A/Baq9fhH4E2BQHb9miA/6rm30ykOsXbt22CZKmiVv6dBdQ/X8k1xEL/F/vaq+DVBVJ6rqTFW9CXwF2NhWnwLW9G2+GjjW4qsHxN+hqnZX1WRVTU5MTMxmfyRJQzhvz7+NyPkq8GJVfakvvrKqjre3nwCeb/P7gG8k+RK9E77rgf1VdSbJqSQ30Csb3Q78x4XbFc2F47Clbhqm7HMj8MfAoSTPtdhngduSbKBXujkKfBKgqg4n2Qu8QG+k0PY20gfgTuBB4BJ6o3wc6SNJi+C8yb+qfsTgev2jM2yzE9g5IH4AuG42DZQkLTyv8JWkDjL5S1IHmfwlqYNM/pLUQd7SWeoYL+wS2POXpE6y5y9pJLygcGmx5y9JHWTyl6QOMvlLUgeZ/CWpg0z+ktRBJn9J6iCTvyR1kMlfkjrI5C9JHeQVvmPEe7JoufBq38Vnz1+SOsjkL0kddN7kn2RNkh8keTHJ4SR3t/jlSR5P8nJ7vaxvm3uSHEnyUpKb++LXJznUlt2fZNCzgSVJF9gwPf83gE9X1T8EbgC2J7kG2AE8UVXrgSfae9qyLcC1wCbggSQr2mftArYB69u0aQH3RZI0pPMm/6o6XlXPtvlTwIvAKmAzsKettge4pc1vBh6uqtNV9QpwBNiYZCVwaVU9VVUFPNS3jSRphGY12ifJOuBDwDPAVVV1HHo/EEmubKutAp7u22yqxX7b5s+ND/qebfT+QmDt2rWzaaKkARwppnMNfcI3yXuBbwGfqqrfzLTqgFjNEH9nsGp3VU1W1eTExMSwTZQkDWmo5J/kInqJ/+tV9e0WPtFKObTXky0+Bazp23w1cKzFVw+IS5JGbJjRPgG+CrxYVV/qW7QP2NrmtwKP9MW3JLk4ydX0TuzubyWiU0luaJ95e982kqQRGqbmfyPwx8ChJM+12GeB+4C9Se4AXgVuBaiqw0n2Ai/QGym0varOtO3uBB4ELgEea5MkacTOm/yr6kcMrtcDfGSabXYCOwfEDwDXzaaBkqSF5xW+ktRBJn9J6iCTvyR1kLd01lu8za7UHfb8JamDTP6S1EEmf0nqIJO/JHWQyV+SOsjRPpIWlaPMFoc9f0nqIJO/JHWQZR9pmfLpXZqJPX9J6iCTvyR1kMlfkjrI5C9JHWTyl6QOGuYB7l9LcjLJ832xe5P8IslzbfpY37J7khxJ8lKSm/vi1yc51Jbd3x7iLklaBMP0/B8ENg2If7mqNrTpUYAk1wBbgGvbNg8kWdHW3wVsA9a3adBnSpJG4LzJv6p+CPxqyM/bDDxcVaer6hXgCLAxyUrg0qp6qqoKeAi4ZY5tliTN03xq/ncl+WkrC13WYquAn/etM9Viq9r8uXFJ0iKY6xW+u4A/Baq9fhH4E2BQHb9miA+UZBu9EhFr166dYxOXB6/SlHQhzKnnX1UnqupMVb0JfAXY2BZNAWv6Vl0NHGvx1QPi033+7qqarKrJiYmJuTRRkjSDOSX/VsM/6xPA2ZFA+4AtSS5OcjW9E7v7q+o4cCrJDW2Uz+3AI/NotyRpHs5b9knyTeAm4IokU8DngJuSbKBXujkKfBKgqg4n2Qu8ALwBbK+qM+2j7qQ3cugS4LE2SdJbvLf/6Jw3+VfVbQPCX51h/Z3AzgHxA8B1s2qdJOmC8JbO0jLiAAENy+SvgfzzW1revLePJHWQyV+SOsjkL0kdZPKXpA4y+UtSB5n8JamDTP6S1EEmf0nqIJO/JHWQV/hKWpK8yvzCMvlLY877+WguLPtIUgeZ/CWpg0z+ktRBJn9J6iCTvyR10HmTf5KvJTmZ5Pm+2OVJHk/ycnu9rG/ZPUmOJHkpyc198euTHGrL7m8PcpckLYJhev4PApvOie0Anqiq9cAT7T1JrgG2ANe2bR5IsqJtswvYBqxv07mfqSVq3Y7vvTVJWh6GeYD7D5OsOye8Gbipze8BngQ+0+IPV9Vp4JUkR4CNSY4Cl1bVUwBJHgJuAR6b9x5IHeQPseZrrhd5XVVVxwGq6niSK1t8FfB033pTLfbbNn9uXJLOy6t9F95CX+E7qI5fM8QHf0iyjV6JiLVr1y5My8aIvTpJF9pcR/ucSLISoL2ebPEpYE3fequBYy2+ekB8oKraXVWTVTU5MTExxyZKkqYz1+S/D9ja5rcCj/TFtyS5OMnV9E7s7m8lolNJbmijfG7v20aSNGLnLfsk+Sa9k7tXJJkCPgfcB+xNcgfwKnArQFUdTrIXeAF4A9heVWfaR91Jb+TQJfRO9HqyV5IWyTCjfW6bZtFHpll/J7BzQPwAcN2sWidJuiC8wleSOsjkL0kd5MNcNCuOt5aWB3v+ktRB9vwljRX/+lwYJn9pDHjVtxaayV+Lbtie3HQJ0N6fNHsmfy0p5yZ4E7t0YZj8taQNU+6wBizNnslfczbbpGvdWlo6TP5acIvZEx/mB8a/DiSTvxaIvXppvJj8pSXKH1RdSCZ/XVAmMGlp8vYOktRBJn9J6iDLPuocrwuQTP6SxphXhM/dvJJ/kqPAKeAM8EZVTSa5HPhvwDrgKPBHVfW3bf17gDva+v+2qv5qPt8vzZd/BairFqLn/8+r6pd973cAT1TVfUl2tPefSXINsAW4FvgA8P0kH+x7wLu0qJbCD4GjozQqF6Lssxm4qc3vAZ4EPtPiD1fVaeCVJEeAjcBTF6ANY8f/9JJGab7Jv4C/TlLAf6qq3cBVVXUcoKqOJ7myrbsKeLpv26kWk5acpfBXgHQhzTf531hVx1qCfzzJz2ZYNwNiNXDFZBuwDWDt2rXzbKIk6VzzGudfVcfa60ngO/TKOCeSrARoryfb6lPAmr7NVwPHpvnc3VU1WVWTExMT82miJGmAOSf/JO9J8r6z88DvA88D+4CtbbWtwCNtfh+wJcnFSa4G1gP75/r9kqS5m0/Z5yrgO0nOfs43quovk/wY2JvkDuBV4FaAqjqcZC/wAvAGsN2RPpIn+xeS52qGN+fkX1X/B/i9AfG/AT4yzTY7gZ1z/U5pMZhQtBx5bx9J6iCTvyR1kPf2kWZhuvr8bMtB1vm12Ez+i8T//MuL5wU0bkz+kpYlf5BnZvKXFph/1WkceMJXkjrInr+kZc8S0DvZ85ekDjL5S1IHmfwlqYNM/pLUQSZ/SeogR/uMkOO/pcXnyJ8ee/6S1EEmf0nqIMs+kjqryyUgk/8FZp1fGg9d+yEYedknyaYkLyU5kmTHqL9fkjTinn+SFcBfAP8CmAJ+nGRfVb0wynZcaPb2pfHWhb8CRl322QgcaQ9/J8nDwGZgWSV/ScvHcv0hGHXyXwX8vO/9FPBPRtyGaU13kO3JS4LhcsG4/ECMOvlnQKzesVKyDdjW3r6e5KUL2qp3uiJ/xi9H/J0XwhXgfiwhy2U/YPnsy4LvR/5sIT9taIP24+/NtMGok/8UsKbv/Wrg2LkrVdVuYPeoGnWuJAeqanKxvn+huB9Ly3LZD1g++9Ll/Rj1aJ8fA+uTXJ3k7wBbgH0jboMkdd5Ie/5V9UaSu4C/AlYAX6uqw6NsgyRpES7yqqpHgUdH/b2ztGglpwXmfiwty2U/YPnsS2f3I1XvON8qSVrmvLGbJHWQyb/Pcrn1RJKjSQ4leS7JgcVuz2wk+VqSk0me74tdnuTxJC+318sWs43DmGY/7k3yi3ZcnkvyscVs4zCSrEnygyQvJjmc5O4WH6tjMsN+jOMxeXeS/Ul+0vbl8y0+q2Ni2adpt5743/TdegK4bRxvPZHkKDBZVWM3DjvJPwNeBx6qquta7D8Av6qq+9qP8mVV9ZnFbOf5TLMf9wKvV9UXFrNts5FkJbCyqp5N8j7gIHAL8G8Yo2Myw378EeN3TAK8p6peT3IR8CPgbuAPmcUxsef/trduPVFV/w84e+sJjVBV/RD41TnhzcCeNr+H3n/aJW2a/Rg7VXW8qp5t86eAF+ldqT9Wx2SG/Rg71fN6e3tRm4pZHhOT/9sG3XpiLP9x0PuH8NdJDrarpcfdVVV1HHr/iYErF7k983FXkp+2stCSLpWcK8k64EPAM4zxMTlnP2AMj0mSFUmeA04Cj1fVrI+Jyf9tQ916YkzcWFX/GPgDYHsrQWjx7QJ+F9gAHAe+uKitmYUk7wW+BXyqqn6z2O2ZqwH7MZbHpKrOVNUGendJ2Jjkutl+hsn/bUPdemIcVNWx9noS+A69ktY4O9FqtmdrtycXuT1zUlUn2n/aN4GvMCbHpdWVvwV8vaq+3cJjd0wG7ce4HpOzqurXwJPAJmZ5TEz+b1sWt55I8p52Qosk7wF+H3h+5q2WvH3A1ja/FXhkEdsyZ2f/YzafYAyOSzu5+FXgxar6Ut+isTom0+3HmB6TiSTvb/OXAB8FfsYsj4mjffq0YV5/ztu3nti5uC2avSR/n15vH3pXcH9jnPYjyTeBm+jdpfAE8DngvwN7gbXAq8CtVbWkT6ZOsx830SsvFHAU+OTZGu1SleSfAv8TOAS82cKfpVcvH5tjMsN+3Mb4HZN/RO+E7gp6Hfi9VfXvk/xdZnFMTP6S1EGWfSSpg0z+ktRBJn9J6iCTvyR1kMlfkjrI5C9JHWTyl6QOMvlLUgf9fxLTEM+XfaDLAAAAAElFTkSuQmCC\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "plt.hist(sample, bins=100)\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "id": "1e759c95-33bc-43d6-9188-4929184936e9", - "metadata": {}, - "source": [ - "#### Scale to positive mean and standard distribution" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "id": "257c5e10-2172-452f-b28c-7d681e227e6a", - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "mean_std_float = scale_to_mean_std(sample, discrete=False)\n", - "mean_std_discrete = scale_to_mean_std(sample, discrete=True)\n", - "\n", - "plot_distribution(mean_std_float, mean_std_discrete)" - ] - }, - { - "cell_type": "markdown", - "id": "621bc30a-880c-4099-9c31-97fafff4ca52", - "metadata": {}, - "source": [ - "#### Scale to negative mean and standard distribution" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "id": "2bb24e13-9c77-48d2-a749-3b2993b8f73a", - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "mean_std_float = scale_to_mean_std(sample, target_mean=-20, discrete=False)\n", - "mean_std_discrete = scale_to_mean_std(sample, target_mean=-20, discrete=True)\n", - "\n", - "plot_distribution(mean_std_float, mean_std_discrete)" - ] - }, - { - "cell_type": "markdown", - "id": "e1b6f2bb-0bb3-4839-aee5-07d966924deb", - "metadata": {}, - "source": [ - "#### Scale to a given range\n", - "Both ranges are given" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "id": "bcbd655e-6930-40cd-bcc3-2f0b26d42abb", - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "range_float = scale_to_range(sample, discrete=False, target_min=-100, target_max=200)\n", - "range_discrete = scale_to_range(sample, discrete=True, target_min=-100, target_max=200)\n", - "\n", - "plot_distribution(range_float, range_discrete)" - ] - }, - { - "cell_type": "markdown", - "id": "8ea5f81c-d72c-4924-b031-a0d026fac47f", - "metadata": {}, - "source": [ - "##### Special case if only one of the ranges is given (negative min)" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "id": "cc6c5ba5-1879-46d6-8366-48faeb237f90", - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "range_float = scale_to_range(sample, discrete=False, target_min=-100)\n", - "range_discrete = scale_to_range(sample, discrete=True, target_min=-100)\n", - "\n", - "plot_distribution(range_float, range_discrete)" - ] - }, - { - "cell_type": "markdown", - "id": "15f0c099-9ca9-49ee-8535-9dcb2935e027", - "metadata": {}, - "source": [ - "##### Special case if only one of the ranges is given (small max)" - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "id": "d16d1590-0c9b-4b82-8196-770d0e30a238", - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "range_float = scale_to_range(sample, discrete=False, target_max=10)\n", - "range_discrete = scale_to_range(sample, discrete=True, target_max=10)\n", - "\n", - "plot_distribution(range_float, range_discrete)" - ] - }, - { - "cell_type": "markdown", - "id": "de68a66e-a980-41f4-be8a-76c1c22a1665", - "metadata": {}, - "source": [ - "##### Special case for 0 and 1" - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "id": "fa96ae3d-d85f-4a9b-87e0-adfaf6f87bd7", - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "range_float = scale_to_range(sample, discrete=False, target_min=0, target_max=1)\n", - "range_discrete = scale_to_range(sample, discrete=True, target_min=0, target_max=1)\n", - "\n", - "plot_distribution(range_float, range_discrete)" - ] - }, - { - "cell_type": "markdown", - "id": "3726793b-93d0-4d75-8eb2-4139b319ade8", - "metadata": {}, - "source": [ - "#### Scale to a target sum\n", - "Given a large enough sum (corresponding to the 10_000 samples!), it's OK. But at smaller values zeroes will dominate." - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "id": "2a363f73-cb18-477f-a390-0e0956f619c9", - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "sum_float = scale_to_sum(sample, discrete=False, target_sum=1_000_000)\n", - "sum_discrete = scale_to_sum(sample, discrete=True, target_sum=1_000_000)\n", - "\n", - "plot_distribution(sum_float, sum_discrete)" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.5" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} +{ + "cells": [ + { + "cell_type": "markdown", + "id": "0c41daba-6432-4378-98e6-54cde5b9f03b", + "metadata": {}, + "source": [ + "### Background notes" + ] + }, + { + "cell_type": "markdown", + "id": "d5ca10d1-632d-4e50-ba45-4e4ffba47575", + "metadata": {}, + "source": [ + "In `exhibit` you can generate numerical values either from a uniform random distribution or from a normal distribution. These values can then be coerced to either floats or integers (as a hack to get discrete values like 0 and 1). It's unlikely you'd want to use generated values as is, so we apply linear scaling before returning the values to the user.\n", + "\n", + "To draw from a uniform distribution all we need is a starting value and the optional dispersion (noise) percentage to shift the final value around. This uniform value is still affected by feature weights, but with dispersion set to 0 you can get consistent values for the same row values. \n", + "\n", + "For normal distribution, you'd need the mean and the standard deviation. The two statistics, however, will change by the time you finish generating the dataset because for each row, the weights will affect the mean. Thus, you can use them as initial values, but will have to rescale at the end if you want to keep the original values. If the mean / std are commented out in the spec (user intending to scale to range, for example), then we'll use default mean of 1 (having it as zero will negate mean shifting by weights) and standard deviation of 1.\n", + "\n", + "Additionally, if the generated distribution has negative values, we will shift the whole set to be in the positive territory if the scaling is set to `target_sum`; other scaling options can be used with negative ranges / statistics.\n", + "\n", + "The scaling options are:\n", + "\n", + " - target range\n", + " - target statistic (`sum`, `mean`, `std`)\n", + " \n", + "The goal of the scaling options is to match user expectations and to preseve the shape of the data. Unfortunately, you can't have all matching statistics so by choosing to align to a certain metric, you leave others free-floating. This is so that we don't lose information from the weights and preseve the shape of the data.\n", + "\n", + "| scaling | min / max | sum | mean | standard deviation | weights |\n", + "| :- | :- | :- | :- | :- | :- |\n", + "| target range | preserved | free-floating | free-floating | free-floating | __*__ |\n", + "| target statistic (sum) | free-floating | preserved | free-floating | free-floating | preserved |\n", + "| target statistic (mean + std) | free-floating | free-floating | preserved | preserved | preserved |\n", + "\n", + "__\\*__ _if the ratio between target_min and target_max doesn't match the generated min and max, either the weights will change\n", + "or one of the target ranges will need to be adjusted. For example, assume you have 4 values influencing the weights - `0.05, 0.15, 0.3, 0.4`. You want to scale your generated data to between 10 and 150. Without scaling, the generated data can have values like `25, 75, 150, 200`. Now, the ratio `0.3 / 0.15` is the same as `150 / 75`. However, if we scale to between 10 and 150, we get `10, 50, 110, 150`. And here, we're dealing with intervals - `0.1 = 40` so values with weights 0.15 and 0.3 are separated by 60. Although this is not particularly intuitive, it's still preferable over adjusting the `target_max` like so `target_max = (target_min * generated.max()) / generated.min()` because often you have to have fixed min-max ranges and also adjusting the maximum in this way can lead to unexpected results, like pushing the maximum into the negative territory._\n", + "\n", + "When generating the specification from the original data, we populate all potential fields: `min`, `max`, `sum`, etc. Rather than creating separate functions for different statistics, we'll try to make sense of what we're given and issue a warning if an impossible or conflicting situation is given.\n", + "\n", + "For example:\n", + "\n", + "| min | max | sum | mean | standard deviation | outcome |\n", + "| :- | :- | :- | :- | :- |:- |\n", + "| given | missing | missing | missing | missing | derive the missing end of the range\n", + "| given | given | missing | missing | missing | valid\n", + "| given | given | given | given | given | warning - use sum as default\n", + "| missing | missing | given | given | given | warning - use sum as default\n", + "| missing | missing | missing | given | given | scale to both\n" + ] + }, + { + "cell_type": "markdown", + "id": "5c1ec07e-d3df-4bd3-a3d8-b7cee08294c0", + "metadata": {}, + "source": [ + "#### Scaling functions\n", + "Scaling floats works as expected. However, as soon as you ask for discrete values, we're at the mercy of rounding which is OK in the large registers, but will produce highly imbalanced datasets when values are low or when standard deviation is small. We try to compensate for it in both scaling to range and scaling to target sum, but if we only have mean / standard deviation to work with, the results will likely be quite imprecise." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "c5386e8b-ca06-46f4-b55f-d42501f5917f", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "cf4670b6-12dd-4a23-b1bd-61557ca96084", + "metadata": {}, + "outputs": [], + "source": [ + "def scale_to_mean_std(array, target_mean=100, target_std=20, discrete=False):\n", + " \n", + " arr = np.array(array)\n", + " \n", + " result = target_mean + (arr - arr.mean()) * target_std / arr.std()\n", + " \n", + " if discrete:\n", + " result = result.round()\n", + " \n", + " return result" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "fa835db8-eeed-4eec-9a07-a143098151f4", + "metadata": {}, + "outputs": [], + "source": [ + "def scale_to_range(array, target_min=None, target_max=None, discrete=False):\n", + " \n", + " X = np.array(array)\n", + " \n", + " # adjust for potential negative signs!\n", + " if not target_min:\n", + "\n", + " target_min = target_max - (abs(target_max) - abs(target_max * X.min() / X.max()))\n", + " \n", + " if not target_max:\n", + "\n", + " target_max = target_min + abs(target_min * X.max() / X.min()) - abs(target_min)\n", + "\n", + " if discrete:\n", + "\n", + " target_range = int(np.ceil(target_max) - np.floor(target_min))\n", + " bins = np.linspace(X.min(), X.max(), target_range + 2)\n", + " labels = np.arange(np.floor(target_min), np.ceil(target_max) + 1)\n", + "\n", + " result = pd.cut(X, bins=bins, right=True, include_lowest=True, labels=labels).to_numpy()\n", + " \n", + " return result\n", + " \n", + " result = (X - X.min()) / (X.max() - X.min()) * (target_max - target_min) + target_min\n", + "\n", + " return result" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "4d0823e7-a911-402e-b458-f2fecf879df0", + "metadata": {}, + "outputs": [], + "source": [ + "def scale_to_sum(array, target_sum=200, discrete=False):\n", + " \n", + " series = pd.Series(array)\n", + " \n", + " if any(series < 0):\n", + " series = series + abs(series.min())\n", + " \n", + " scaling_factor = target_sum / series.dropna().sum()\n", + " scaled_series = series * scaling_factor\n", + " \n", + " if discrete:\n", + " \n", + " row_diff = (target_sum - series.dropna().sum()) / len(series.dropna())\n", + " values = pd.Series(\n", + " np.where(\n", + " series + row_diff >= 0,\n", + " series + row_diff,\n", + " np.where(pd.isnull(series), np.nan, 0)\n", + " )\n", + " )\n", + " \n", + " #how many rows will need to be rounded up to get to target\n", + " boundary = int(target_sum - np.floor(values).sum())\n", + "\n", + " #because values are limited at the lower end at zero, sometimes it's not possible\n", + " #to adjust them to a lower target_sum; we floor them and return\n", + " if boundary < 0:\n", + " return pd.Series(np.floor(values)).to_numpy()\n", + "\n", + " #if series has NAs, then the calcualtion will be off\n", + " clean_values = values.dropna() #keep original index\n", + "\n", + " #np.ceil and floor return Series so index is preserved\n", + " values.update(np.maximum(np.ceil(clean_values.iloc[0:boundary]), 1))\n", + " values.update(np.floor(clean_values.iloc[boundary:]))\n", + "\n", + " #return a series of ints or cast to float if there are any NAs\n", + " #see https://github.com/pandas-dev/pandas/issues/29618\n", + " #before migrating to the new Pandas null-aware int dtype\n", + " scaled_series = values if values.isna().any() else values.astype(int)\n", + " \n", + " return scaled_series" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "176afd13-c1a2-42ca-818b-fe36fb80b31a", + "metadata": {}, + "outputs": [], + "source": [ + "def transform_datasets(datasets, func, **kwargs):\n", + "\n", + " new_datasets = {}\n", + "\n", + " for i, dataset in datasets.items():\n", + " new_datasets[i] = (\n", + " func(dataset[0], **kwargs),\n", + " func(dataset[1], **kwargs)\n", + " )\n", + " \n", + " return new_datasets" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "1a0ca1a9-2f58-4d62-8d28-a5ee7715bce0", + "metadata": {}, + "outputs": [], + "source": [ + "def plot_original_quartet(datasets):\n", + " '''\n", + " From Matplotlib gallery: https://matplotlib.org/stable/gallery/specialty_plots/anscombe.html\n", + " '''\n", + "\n", + " fig, axs = plt.subplots(2, 2, sharex=True, sharey=True, figsize=(6, 6),\n", + " gridspec_kw={'wspace': 0.08, 'hspace': 0.08})\n", + " \n", + " for ax, (label, (x, y)) in zip(axs.flat, datasets.items()):\n", + " ax.text(0.1, 0.9, label, fontsize=20, transform=ax.transAxes, va='top')\n", + " ax.tick_params(direction='in', top=True, right=True)\n", + " ax.plot(x, y, 'o')\n", + "\n", + "\n", + " plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "6dc888d7-a2c1-4319-91d3-6aca0a966203", + "metadata": {}, + "outputs": [], + "source": [ + "def plot_scaled_quartets(float_dataset, discrete_dataset):\n", + "\n", + " fig, axs = plt.subplots(2, 4, sharex=True, sharey=True, figsize=(12, 6),\n", + " gridspec_kw={'wspace': 0.08, 'hspace': 0.08})\n", + " \n", + " colors = [\"firebrick\", \"teal\"]\n", + " \n", + " for i, j in enumerate(range(0, 4, 2)):\n", + " \n", + " if i == 0:\n", + " dataset = float_dataset\n", + " else: \n", + " dataset = discrete_dataset\n", + " \n", + " for ax, (label, (x, y)) in zip(axs[:, j:j+2].flat, dataset.items()):\n", + " ax.text(0.1, 0.9, label, fontsize=20, transform=ax.transAxes, va='top')\n", + " ax.tick_params(direction='in', top=True, right=True)\n", + " ax.plot(x, y, 'o', c=colors[i])\n", + " \n", + " axs[0, 0].set_title(\"Float\", loc=\"left\") \n", + " axs[0, 2].set_title(\"Discrete\", loc=\"left\") \n", + "\n", + " plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "b1b057d4-852e-4ba6-95ae-bcf193fdbec2", + "metadata": {}, + "source": [ + "#### Original Anscombe's Quartet datasets" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "8f95cb45-ccad-4576-87de-02b860bdff6b", + "metadata": {}, + "outputs": [], + "source": [ + "x = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]\n", + "y1 = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]\n", + "y2 = [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]\n", + "y3 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]\n", + "x4 = [8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8]\n", + "y4 = [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]\n", + "\n", + "datasets = {\n", + " 'I': (x, y1),\n", + " 'II': (x, y2),\n", + " 'III': (x, y3),\n", + " 'IV': (x4, y4)\n", + "}" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "67dc8645-5d87-4ab5-8965-43d6ff63a892", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAW0AAAFiCAYAAAA9V4n3AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAaP0lEQVR4nO3df4wc5X3H8c+39qEcCHqgcxx8tDkHkosqk+IwqqispAYCdguFgxQJIhLjmFxR2iTKH05s8Qf9p7IlBzWRIlEt5lckQEIJuURAYpBdhIQg0R2HYhrHQXUIYe3Ei6gTKVzjH3z7x+2ez/dzdmZ2Z56Z90s63e3cztwze899bvb7zDxj7i4AQBj+LO8GAADiI7QBICCENgAEhNAGgIAQ2gAQEEIbAAKyvJMb7+/v98HBwdTbaTQaWrFiRfoG5ST09kvh7sP4+Pjb7t5Ww7Pot6G+XjOxD/lZrN92NLQHBwc1NjaWejtRFGWynbyE3n4p3H0ws1+3u04W/TbU12sm9iE/i/VbyiMAEBBCGwACEkRoj4yM5N2EVEJvv1SOfeimMrxe7EMxWSfnHomiyEOsJ6E8zGzc3aN21qHfIm+L9dsgjrQBAFMIbQAICKENAAEhtAEgIIQ2AASE0AaAgBDaABAQQhsAArJkaJvZg2Z21Mxem7Fsl5n9wsx+ZmbfN7O+jrYSACAp3pH2w5I2zlr2nKQ17v4xSb+UtD3jdgEA5rFkaLv7C5LembXsWXc/2Xz4sqSLOtA2AMAsWdS0Py/pRxlsZ1FmJjPr9I8BMrNQn6UvI41UN0Ews7slnZT06HzfbzQaiqLTc56MjIyUctYtFEutVlOtVms97G93ffot8hC338aa5c/MBiU95e5rZizbJOkuSVe7+7vzrZflbGmtI5NOzkqI8slzlr+F+ix9GUtZrN8mOtI2s42Svi7p7xYKbABA9uKc8ve4pJckDZnZW2a2RdK3JZ0r6Tkze9XM/rPD7QQAKMaRtrvfNs/iBzrQFgDAErgiEgACQmgDQEAIbQAICKENAAEhtAEgIIQ2AASE0AaAgBDaABCQVBNGdRPzNCA0C/VZ+jLSCCa0IY1O1LVrz0EdPjapVX292rphSMNrB/JuFoAuIrQDMTpR1/Yn92vyxClJUv3YpLY/uV+SCG6gQqhpB2LXnoPTgd0yeeKUdu05mFOLAOSB0A7E4WOTbS0HUE6EdiBW9fW2tRxAORHagdi6YUi9PcvOWNbbs0xbNwzl1CIAeWAgMhCtwUbOHgGqjdAOyPDaAUIaqDjKIwAQEEIbAAJCaANAQAhtAAjIkqFtZg+a2VEze23GsgvM7Dkze735+fzONhMAIMU70n5Y0sZZy7ZJ2uvuH5a0t/kYARmdqGvdzn1ave1prdu5T6MT9bybBCCGJUPb3V+Q9M6sxTdKeqT59SOShrNtFjqpNflU/dikXKcnnyK4geJLWtNe6e5HJKn5+f3ZNQmdxuRTQLg6enFNo9FQFEXTj0dGRjQyMtLJH4kYyj75VK1WU61Waz3sb3d9+i3yELffWpy7aJjZoKSn3H1N8/FBSevd/YiZXSjpeXefMwlGFEU+NjaWoPnopHU796k+T0AP9PXqxW1X5dCizjGzcXePln7mafRb5G2xfpu0PPJDSZuaX2+S9IOE20EOmHwKCNeS5REze1zSekn9ZvaWpHsk7ZT0hJltkfSmpFs62Uhki8mngHAtGdruftsC37o647agi5h8CggTV0QCQEAIbQAICKENAAHhJggAEhmdqDOYnQNCG0DbWlMhtK6sbU2FIIng7jBCG8AZ4hxBLzYVwuznckSeLUIbwLS4R9Bxp0LgiDx7DEQCmBZ3MrFVfb3zrj97OZOTZY/QBjAt7hF03KkQyj45WR4ojyREnQ5ltKqvd97JxGYfQcedCiHu9hAfoZ0AdTqEKM6BxtYNQ2f0bWnhycTiTIUQd3scBMVHaCfQzsg5UARxDzSynkwszvY4CGoPoZ0AdTqEpp0DjawnE1tqexwEtYeByATijpwDRVHkA40it62ICO0EuIkAQlPkA40it62ICO0EhtcOaMfNl2qgr1emqdt07bj5Ut7KobCKfKBR5LYVETXthLKs+zFyjk4r8t2Kity2IiK0c8bIObqlyHcrKnLbiobySM64zBdAOzjSzhkj58gCJbbqSBXaZvZVSXdKckn7JW129//LomFVwWW+SKtKJTb+OaUoj5jZgKQvS4rcfY2kZZJuzaphZTA6Ude6nfu0etvTWrdzn0Yn6nOew8g50qpKia31z6l+bFKu0/+c5vu7KrO0Ne3lknrNbLmksyUdTt+kcojbwTh9EGlVpcRWlX9OS0lcHnH3upl9Q9KbkiYlPevuz2bWssDledkwqqUqJbaq/HNaSpryyPmSbpS0WtIqSeeY2e0zn9NoNBRF0fRHrVZL19qA0MHyU6vVpvucpP521w+t31alxFb2Kyfj9ltz90Q/wMxukbTR3bc0H39O0hXu/sXWc6Io8rGxsUTbD926nfvmPfoZ6OvVi9uuyqFF1WRm4+4etbNOiP22CgN0swdcpal/TmUsJy7Wb9OcPfKmpCvM7GxNlUeulhRWT++gduYlBtKqQomNKyenpKlp/8TMvivpFUknJU1IKvb7yC6igwHZq8I/p6WkOk/b3e+RdE9GbSmdvDpYFd4qA1XFFZElU6ULLYAqYu6RkuFcVqDcCO2S4VRDoNwoj5RMVS60qArGJzBb8Efaceb3qJKqXGhRBcy1gfkEHdp06rmYy6Q8GJ/AfIIuj7Qzv0eVcC5rOTA+gfkEfaRNp0aZlX2uDSQTdGh3olNTI0dRMD6B+QQd2ll3amrkKBLGJzCfoGvaWc/vQY0cRcP4BGYLOrSlbDs1NXIARRd0eSRrDPwAKDpCewYGfgAUXfDlkSwxBzaAoiO0Z2HgB0CRUR4BgIAQ2gAQEEIbAAJCaANAQBiIBHLAzQ2QVKojbTPrM7PvmtkvzOyAmf1tVg0Dyoo5bpBG2vLItyT92N0/KumvJR1I3ySg3Li5AdJIXB4xs/MkfVLSHZLk7sclHc+mWUB5MccN0khzpP0hSQ1JD5nZhJntNrNzMmoXUFrMcYM00oT2ckkfl3Sfu6+V9EdJ22Y+odFoKIqi6Y9arZbixwHx1Gq16T4nqb/d9Tvdb5njBvOJ22/N3RP9ADP7gKSX3X2w+fgTkra5+3Wt50RR5GNjY4m2D2TBzMbdPWpnnW70W84ewWIW67eJa9ru/lsz+42ZDbn7QUlXS/p50u3Nh46NsmKOGySV9jztL0l61MzOknRI0ub0TZrSOi2qNcreOi1KEp0dQGWlOuXP3V9198jdP+buw+7+v1k1jNOiAGCuwl7GzmlRADBXYUOb06IAYK7ChjanRQHAXIWdMIpbfwHAXIUNbYnTogBgtsKWRwAAcxHaABAQQhsAAkJoA0BACj0QmSXmMQFQBpUIbeYxAVAWlQjtxeYxIbSRJd7RodMqEdrMY4Ju4B0duqESA5HMY4JuYGZKdEMlQpt5TNANvKNDN1QitIfXDmjHzZdqoK9XJmmgr1c7br6Ut6zIFO/o0A2VqGlLzGOCztu6YeiMmrbEOzpkrzKhDXQaM1OiG3IJbU6LQlnxjg6d1vXQ5rQoAEiu6wORnBYFAMmlDm0zW2ZmE2b2VJznc1oUACSXxZH2VyQdiPtkTosCgORShbaZXSTpOkm7467DhS4AkFzagchvSvqapHPjrsBpUQCQXOLQNrPrJR1193EzWz/fcxqNhqIomn48MjKikZERTotCR9VqNdVqtdbD/nbXX6jfAp0Ut9+auyf6AWa2Q9JnJZ2U9D5J50l60t1vbz0niiIfGxtLtH0gC2Y27u7R0s88jX6LvC3WbxPXtN19u7tf5O6Dkm6VtG9mYAMAsleJCaMAoCwyuSLS3Z+X9HwW2wIALIwjbQAICKENAAEhtAEgIIQ2AASE0AaAgHDnGiAmbt6BIiC0gRi4eQeKgvIIEAM370BRENpADNy8A0VBaAMxcPMOFAWhDcTAzTtQFAxEAjFw8w4URRBH2jMmBg9S6O2XyrEPaQ2vHdCL267Sr3Zepxe3XbVoYJfh9WIfionQ7oLQ2y+VYx+6qQyvF/tQTEGENgBgSuLbjcXauFlD0q8z2FS/pLcz2E5eQm+/FO4+fNDdV7SzQkb9NtTXayb2IT8L9tuOhjYAIFuURwAgIIQ2AASE0AaAgBDaABAQQhsAAkJoA0BACG0ACAihDQABIbQBICCENgAEpKPzaff39/vg4GDq7TQaDa1Y0db0EYUSevulcPdhfHz87XbnHsmi34b6es3EPuRnsX7b0dAeHBzU2NhY6u1EUZTJdvISevulcPfBzNqe+CmLfhvq6zUT+5Cfxfot5REACAihDQABCSK0R0ZG8m5CKqG3XyrHPnRTGV4v9qGYOjqfdhRFHmI9KXSjE3VuQNtkZuPuHrWzDv0WeVus33I39pIZnahr+5P7NXnilCSpfmxS25/cL0mVDW6gTIIojyC+XXsOTgd2y+SJU9q152BOLQKQJY60S+bwscm2lgPorrTlS460S2ZVX29bywF0T6t8WT82Kdfp8uXoRD32Ngjtktm6YUi9PcvOWNbbs0xbNwzl1CIALVmULymPlEzrbRZnjwDFk0X5ktAuoeG1A4Q0UECr+npVnyeg2ylfUh4BgC7JonzJkTYAdEkW5cslQ9vMHpR0vaSj7r6muWyXpH+UdFzS/0ja7O7H2t4DAKiYtOXLOOWRhyVtnLXsOUlr3P1jkn4paXviFgAAYlsytN39BUnvzFr2rLufbD58WdJFHWjbGcxMZpZ6OdAts/vgZz7zGZmZ7rvvviXXveaaa2RmGh0d7WALEaIsBiI/L+lHGWwHKLXWjHP333//os974403tHfvXl144YW6/vrru9E0BCRVaJvZ3ZJOSnp0vu83Gg1FUTT9UavV0vw4IJZarTbd5yT1t7t+p/rt+vXr9ZGPfEQTExN65ZVXFnze7t275e7avHmzli/nXIGqiN1v3X3JD0mDkl6btWyTpJcknb3QepdffrlnRZJPNTfdclSLpDGP0ce9A/12vj64a9cul+R33XXXvOucPHnSBwYG3Mz80KFDmbQD4Vms3yY60jazjZK+LukGd383yTaAKtq0aZPOOussPfbYY3r33bl/Os8884zq9bo+9alPafXq1Tm0EEW3ZGib2eOaOqIeMrO3zGyLpG9LOlfSc2b2qpn9Z4fbCZTCihUrNDw8rD/84Q964okn5nx/9+7dksp5xxVkI87ZI7e5+4Xu3uPuF7n7A+5+ibv/hbtf1vy4qxuNBcqgFcitgG45cuSInnnmGa1cuVI33nhjHk1DALiMHeiyq666ShdffLFefPFFHThwYHr5Qw89pJMnT+qOO+5QT09Pji1EkRHaQJeZme68805Jp4+23V0PPPDAGd8D5kNoAznYvHmzenp69J3vfEfHjx/Xvn37dOjQIV155ZW65JJL8m4eCozQBnKwcuVK3XDDDXr77bc1Ojo6fcENA5BYCqEN5OQLX/iCJOnee+/V6Oio+vv7ddNNN+XcKhQdoQ3k5Nprr9Xq1av105/+VH/605+mz+EGFkNoAzkxM23ZsmX6cevIG1hMMKHduoQz7XKgW+L0wbvvvnv6eUND3HwZSwsmtAEAhDYABIXQBoCAENoAEBBCGwACQmgDQEAIbQAICKENAAEhtAEgIIQ2AASE0AaAgBDaABCQOHdjf9DMjprZazOWXWBmz5nZ683P53e2mQAAKd6R9sOSNs5atk3SXnf/sKS9zccAgA5bMrTd/QVJ78xafKOkR5pfPyJpONtmAQDmk7SmvdLdj0hS8/P7s2sSAGAhyzu58UajoSiKph+PjIxw41J0XK1WU61Waz3sb3d9+i3yELffWpy7u5jZoKSn3H1N8/FBSevd/YiZXSjpeXefc9uNKIp8bGwsQfOBbJjZuLtHSz/zNPot8rZYv01aHvmhpE3NrzdJ+kHC7QAA2hDnlL/HJb0kacjM3jKzLZJ2SrrGzF6XdE3zMQCgw5asabv7bQt86+qM2wIAWAJXRAJAQAhtAAgIoQ0AASG0ASAghDYABITQBoCAENoAEBBCGwACQmgDQEA6Ossfimt0oq5dew7q8LFJrerr1dYNQxpeO5B3s5AQv8/qILQraHSiru1P7tfkiVOSpPqxSW1/cr8k8YceIH6f1UJ5pIJ27Tk4/QfeMnnilHbtOZhTi5AGv89qIbQr6PCxybaWo9j4fVYLoV1Bq/p621qOYuP3WS2EdgVt3TCk3p5lZyzr7VmmrRvm3HwIAeD3WS2EdgUNrx3Qjpsv1UBfr0zSQF+vdtx8KYNWgRpeO6BPXz6gZWaSpGVm+vTlA/w+S4qzRypqeC1/1GUxOlHX98brOtW83+spd31vvK7ogxfwOy4hjrSBwHH2SLUQ2kDgOHukWiiPBISr3jCfVX29qs8T0Jw9Uk4caQeiddVb/dikXKevehudqOfdNOSMs0eqJVVom9lXzey/zew1M3vczN6XVcNwJuqWWAhnA1VL4vKImQ1I+rKkv3L3STN7QtKtkh7OqG2YgbolFsPZQNWRtqa9XFKvmZ2QdLakw+mbhPlQt8RiGO+ojsTlEXevS/qGpDclHZH0e3d/NquG4UzULbEQxjuqJXFom9n5km6UtFrSKknnmNntM5/TaDQURdH0R61WS9fakhqdqGvdzn1ave1prdu5b94/NuqW8dVqtek+J6m/3fVD67eMd5RD3H5r3ryKql1mdoukje6+pfn4c5KucPcvtp4TRZGPjY0l2n5VzJ4LWZo6giaQs2Fm4+4etbNOaP129banNd9fsUn61c7rut0cZGCxfpvm7JE3JV1hZmebmUm6WtKBFNurJI6SkFbf2T1tLUfY0tS0fyLpu5JekbS/ua1iv48sIM4KQVoLvVlO+CYaBZfq7BF3v0fSPRm1pZI4KwRp/X7yRFvLETauiOygOAOMnBWCtLgJQrUQ2h0S9zQszgpBWvzjrxYmjOqQxQYYZwcyV7MhjVbf4eKaaiC0O4QBRnQT//irg/JIh1BnBNAJhHZCSw0yUmcE0AmURxKYfRVja5BROl1fpM6IbmLCqOogtBOIO8hInRHdEOcgAuVBeSQBBhlRJEyFUC2EdgIMMqJI5ruidrHlCBuhPQtXMSI0y8zaWo6wUdOeIW5tkEFGFMmpBWaGWmg5wkZoz8BVjAjRwAKTjg1QrislyiMzMMCIEG3dMKSeZWeWQnqWGeW6kiK0Z2CAEcGaXQmhMlJalQltBhhRVrv2HNSJ985M6RPvOaf8lVQlatoMMKLMKOtVSyVCmwFGlBl3P6qWSpRHOBJBmVHWq5ZKhDYDjCgz7n5ULanKI2bWJ2m3pDWaGq/+vLu/lEG7Yoszu9nWDUNn1LQljkRQLpT1qiNtTftbkn7s7v9kZmdJOjuDNsXGACOAqkkc2mZ2nqRPSrpDktz9uKTj2TQrHgYYAVRNmpr2hyQ1JD1kZhNmttvMzsmoXbEwwAigatKE9nJJH5d0n7uvlfRHSdtmPqHRaCiKoumPWq2W4sfNxQAj5lOr1ab7nKT+dtfvdL8F5hO335onnAnMzD4g6WV3H2w+/oSkbe5+Xes5URT52NhYou1LSw8yzq5pS1MDjIyco8XMxt09amedtP02D9xurFwW67eJa9ru/lsz+42ZDbn7QUlXS/p50u3Nxn0YgXi43Vi1pD175EuSHm2eOXJI0ub0TZrCfRiBeNoZkEf4UoW2u78qqa23nnExyAjEw99KtRT2ikgGGYF4/ry3p63lCFsuoc00qUB2FroVJLeILKeuz/LHVYxAto69e6Kt5Qhb10ObqxiBbDE1a7V0vTzCoAmQrSs/uqKt5Qhb10ObAUYgW//1i0ZbyxG2roc2A4xAtnj3Wi1dD20mbAeyxbvXasnlHpEMMALZ4SYf1VKJG/sCZcbpsdVCaAMlwLvX6ijsZewAgLkIbQAICKENAAEhtAEgIIQ2AASE0AaAgBDaABAQQhsAAkJoA0BACG0ACEjq0DazZWY2YWZPZdEgAMDCsjjS/oqkAxlsBwCwhFShbWYXSbpO0u5smgMAWEzaI+1vSvqapPfSNwUAsJTEoW1m10s66u7jCz2n0WgoiqLpj1qtlvTHAbHVarXpPiepv9316bfIQ9x+a+6e6AeY2Q5Jn5V0UtL7JJ0n6Ul3v731nCiKfGxsLNH2gSyY2bi7R+2sQ79F3hbrt4mPtN19u7tf5O6Dkm6VtG9mYAMAssd52gAQkExuN+buz0t6PottAQAWxpE2AASE0AaAgBDaABAQQhsAAkJoA0BACG0ACAihDQABIbQBICCENgAEhNAGgIAQ2gAQEEIbAAISRGiHPgl96O2XyrEP3VSG14t9KCZCuwtCb79Ujn3opjK8XuxDMQUR2gCAKYlvNxZr42YNSb/OYFP9kt7OYDt5Cb39Urj78EF3X9HOChn121Bfr5nYh/ws2G87GtoAgGxRHgGAgBDaABCQQoe2mb1hZvvN7FUzG8u7PXGY2YNmdtTMXpux7AIze87MXm9+Pj/PNi5lgX34NzOrN38Xr5rZP+TZxiKj3+ajKv220KHddKW7X+buUd4NielhSRtnLdsmaa+7f1jS3ubjIntYc/dBkv6j+bu4zN2f6XKbQkO/7b6HVYF+G0JoB8XdX5D0zqzFN0p6pPn1I5KGu9mmdi2wDygx+m04ih7aLulZMxs3s5G8G5PCSnc/IknNz+/PuT1J/auZ/az5NrTQb5VzRr8tllL126KH9jp3/7ikv5f0L2b2ybwbVGH3SbpY0mWSjki6N9fWFBv9tjhK128LHdrufrj5+aik70v6m3xblNjvzOxCSWp+Pppze9rm7r9z91Pu/p6k+xXu76Lj6LfFUcZ+W9jQNrNzzOzc1teSrpX02uJrFdYPJW1qfr1J0g9ybEsirT/eppsU7u+io+i3xVLGflvYKyLN7EOaOkqRpOWSHnP3f8+xSbGY2eOS1mvq8tnfSbpH0qikJyT9paQ3Jd3i7oUdMFlgH9Zr6i2mS3pD0j+36p04jX6bn6r028KGNgBgrsKWRwAAcxHaABAQQhsAAkJoA0BACG0ACAihDQABIbQBICCENgAE5P8BujQicfIBF0EAAAAASUVORK5CYII=\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "plot_original_quartet(datasets)" + ] + }, + { + "cell_type": "markdown", + "id": "e1172759-dcfc-4081-9163-f50f37665c4e", + "metadata": {}, + "source": [ + "#### Scaling to a given mean and standard deviation\n", + "No difference between float and discrete option" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "bf8b7c19-4739-4071-a465-9b3b88aa6039", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAsIAAAFyCAYAAAD/MLwxAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAA0A0lEQVR4nO3df3Sdd33g+fcHx1CZtoKMk4wUYzvFIT6AQ1u0bHtm3Q1VGGgPRml36IRmZtw0rba77PS0U+1AVmcHOLsa6FSdpmc70FWBku640Ey3jWIO5UfUYcl2wnBsfkT8MIsL2BFSwTRF0EYDjvnsH7pyZEU/7o9Hep57n/frHB3pfu9zn/u59378+PM89/sjMhNJkiSpbp5WdgCSJElSGSyEJUmSVEsWwpIkSaolC2FJkiTVkoWwJEmSaslCWJIkSbVUuUI4Ig5GREbEVWXHIkm9JCJ+NyL+17LjkKSqKLUQjogvR8RSRPztyg8wWOD+MyIOFbU/SaqyVcfUb0XENyLiP0XEL0XE0wAy85cy83/bwXhuiYi5nXo+SWpVFa4IH8vM7135AebLDkiSutixzPw+4ADwFuB1wDu244n85k5St6tCIbypiBiMiAci4rGIOBsRv7jqvpdExMONKx8LEfE7EfH0xn0faWz2qcbV5n9cyguQpBJk5mJmPgD8Y+B4RLwwIt4VEf87QETsjYj3No6fj0XEQytXjiPiORHxJxFxISL+OiJ+p9H+cxHxFxHxWxHxGPDGiHhGRExGxPmI+Gqj+0VfRDwT+DNgcNW3foMR8bSIeH1E/GVj3/dFxNUlvU2Saq7yhTDwbmCO5S4T/wj41xEx3LjvEvCrwF7gR4Fh4H8EyMwfa2zzosbV5j/a0aglqQIy82MsH0OPrrnr1xrt1wDXAf8LkBGxC3gvcA44CFwPvGfV4/5r4IvAtcAE8OvA84AfBA41tv9Xmfl3wE8A86u+9ZsHfhm4DfhvWT6u/w3w74p8zZLUrCoUwvc3rkh8IyLuX31HRDwH+G+A12Xmf8nMTwJvB/4pQGaezsyPZuYTmfll4P9k+eAqSXrSPLD2qutFYAA4kJkXM/OhzEzgJSwXqP9zZv5d49j7/67eV2b+H5n5BPBfgF8EfjUzH8vMbwH/Grh9k1j+e2A8M+cy89vAG4F/ZDcLSWWowoHntsx8cOVGRBxcdd8gsHJwXXEOGGps+zzg3zZu72H59Zze7oAlqctcDzy2pu03WC5CPxgRAFOZ+RbgOcC5RqG7nkdX/X0Ny8fe0419AASwa5NYDgB/GhHfXdV2ieWr0l/Z8pVIUoGqcEV4M/PA1RHxfava9vPkwfJtwBngxsz8fpa/2gskSQBExH/FciG8+qoumfmtzPy1zPwB4BjwLxrdzh4F9m9yhTZX/f11YAl4QWY+q/HT3xj4vHbbFY8CP7Fq+2dl5vdkpkWwpB1X6UI4Mx8F/hPw5oj4noi4GbgLONHY5PuAbwJ/GxGHgf9hzS6+CvzATsUrSVUREd8fEa9kuX/vv8/M2TX3vzIiDsXypdxvsnxV9hLwMWABeEtEPLNx7P0H6z1HZn4X+D3gtyLi2sZ+r4+Ilzc2+Srw9yKif9XDfheYiIgDje2viYiRol63JLWi0oVww2tYHrAxD/wp8IbM/FDjvjHgZ4FvsXwwXjsg7o3AvY3+xz+zI9FKUrlORsS3WL7yOs5y97E719nuRuBB4G+Bh4G3ZuaHM/MSy1eIDwHnWR5Qt9msO68DzgIfjYhvNvZ5E0BmnmF5wPMXG8fhQeC3gQdY7pLxLeCjLA/Ak6QdF8tjIyRJkqR66YYrwpIkSVLhLIQlSZJUSxbCkiRJqiULYUmSJNWShbAkSZJqqbSV5fbu3ZsHDx7seD8XLlzgmmuu6TygAlUxJjAugNOnT389M5t+sl7OU6hmXFWMCaqdp9DbuVrFmMC4oLxc9b1vTRXjqkyeZmYpPy9+8YuzCEXtp0hVjCnTuDIzgVNpnl5WxbiqGFNmtfM0ezxXqxhTpnFllpervvetqWJcVclTu0ZIkiSpliyEJUmSVEtdXwiPjo6WHcJTVDEmMK4yVfU1VjGuKsYE1Y2raFV8nVWMCYyrTFV9jcbVvKrEVNoSy0NDQ3nq1KlSnlv1FRGnM3Oo2e3NU5Wh1TwFc1XlMFfVDTbL066/IixJkiS1w0JYkiRJtWQhLEmSpFqyEJYkSVItWQhLkiSpliyEJUmSVEsWwpIkSaolC2FJkiTVkoWwJEmSaslCWJIkSbVkISxJkqRa6vpCOCKIiLLDkDa1UZ6av6oac1LdwGOqitL1hbAkSZLUDgthSZIk1ZKFsCRJkmrJQliSJEm1ZCEsSZKkWrIQliRJUi1tWQhHxDsj4msR8el17huLiIyIvava7o6IsxHx+Yh4edEBS5IkSUVo5orwu4BXrG2MiOcALwPOr2p7PnA78ILGY94aEbsKiVSSJEkq0JaFcGZ+BHhsnbt+C/iXQK5qGwHek5nfzswvAWeBlxQRqCRJklSktvoIR8SrgK9k5qfW3HU98Oiq23ONNkmSJKlSrmr1ARGxBxgH/uF6d6/Tluu0ceHCBYaGhi7fHh0dZXR0tNVwpC1NTU0xNTW1cnPvZtuuZZ5qp3SSp2CuaueYq+oGzeZpZK5bp165UcRB4L2Z+cKIOALMAI837t4HzLPcBeJOgMx8c+NxHwDemJkPr93n0NBQnjp1qtnXs1lsNJ6z432p90XE6cwc2nrLZdudp+av1tNqnoLHVJWjrFz1mKpWbJanLXeNyMzZzLw2Mw9m5kGWuz/8cGb+FfAAcHtEPCMibgBuBD7WQeySJEnStmhm+rR3Aw8DN0XEXETctdG2mfkZ4D7gs8D7gddm5qWigt3gOT3zU+VtlKfmr6rGnFQ38JiqomzZRzgzX7PF/QfX3J4AJjoLS82am57mzOQkSwsL9A0McHhsjH0jI2WHJUmSVHktD5ZTdcxNT/PI+DiXlpYAWJqf55HxcQCLYUlqw4nZWcZnZji/uMj+/n4mhoe548iRssOStE1cYrmLnZmcvFwEr7i0tMSZycmSIpKk7nVidpbRkyc5t7hIAucWFxk9eZITs7NlhyZpm1gId7GlhYWW2iVJGxufmeHxixevaHv84kXGZ2ZKikjSdrMQ7mJ9AwMttUuSNnZ+cbGldkndz0K4ix0eG2NXX98Vbbv6+jg8NlZSRJLUvfb397fULqn7WQh3sX0jI9w8MUHf4CBE0Dc4yM0TEw6Uk6Q2TAwPs2f37iva9uzezcTwcEkRSdpuzhrR5faNjFj4SlIBVmaHcNYIqT4shCVJarjjyBELX6lG7BohSZKkWrIQliRJUi1ZCEuSJKmWLIQlSZJUSxbCumxuepoHjx7l5KFDPHj0KHPT02WHJEmStG2cNULAchH8yPg4l5aWAFian+eR8XEAp2eTpDadmJ11OjapwrwiLADOTE5eLoJXXFpa4szkZEkRSVJ3OzE7y+jJk5xbXCSBc4uLjJ48yYnZ2bJDk9RgISwAlhYWWmqXJG1ufGaGxy9evKLt8YsXGZ+ZKSkiSWtZCAuAvoGBltolSZs7v7jYUruknWchLAAOj42xq6/virZdfX0cHhsrKSJJ6m77+/tbape08yyEBSwPiLt5YoK+wUGIoG9wkJsnJhwoJ0ltmhgeZs/u3Ve07dm9m4nh4ZIikrTWlrNGRMQ7gVcCX8vMFzbafgM4BnwH+Evgzsz8RuO+u4G7gEvAL2fmB7YndBVt38iIha8kFWRldghnjZCqq5np094F/A7wB6vaPgTcnZlPRMSvA3cDr4uI5wO3Ay8ABoEHI+J5mXmp2LAlSaq+O44csfCVKmzLrhGZ+RHgsTVtH8zMJxo3Pwrsa/w9ArwnM7+dmV8CzgIvKTBeSZIkqRBF9BH+eeDPGn9fDzy66r65RpskSZJUKR2tLBcR48ATwImVpnU2y/Uee+HCBYaGhi7fHh0dZXR0tJNwpHVNTU0xNTW1cnNvK481T7c2Nz3NmclJlhYW6BsY4PDYmH3N29BJnoK5qp1jrm4/VyTsXLN5Gpnr1qlXbhRxEHjvymC5Rttx4JeA4cx8vNF2N0Bmvrlx+wPAGzPz4bX7HBoaylOnTjX7eqRCRMTpzBzaestl5unm1i7NDcvT7jnjSGdazVMwV1UOc7V5zRa3KysSrl6MZc/u3UwdO3bF9hbLzdssT9vqGhERrwBeB7xqpQhueAC4PSKeERE3ADcCH2vnOSSVa256mgePHuXkoUM8ePQoc9PTT9mm2aW5m9mXJPWqVpbbbmZFQpfvLs6WhXBEvBt4GLgpIuYi4i6WZ5H4PuBDEfHJiPhdgMz8DHAf8Fng/cBrnTFC6j4rV3qX5uchk6X5eR4ZH39KAdvM0tzN7kuSelUry203syKhy3cXp5lZI16TmQOZuTsz92XmOzLzUGY+JzN/sPHzS6u2n8jM52bmTZn5Z5vtW1I1NXult5mluZvdlyT1qlaW225mRUKX7y6OK8tJeopmrvRCc0tzN7svSepVrSy33cyKhC7fXRwL4QLZD1K9opkrvdDc0tzN7kta68TsLAfvuYenvelNHLznHvs/qmu1stz2HUeOMHXsGAf6+wngQH//UwbKuXx3cTqaPk1PWjt6fqUfJODoeVXOVlOeHR4bW3c2iNVXeldstTR3K/tyKjatWDtyfmUwEODIeFXOVjM4tLrc9lYrEja7P2eW2JqFcEE26wfpf+SqkmZO2lZ+F1GUNrsvTya12maDgfyPXFXS7Elb0cttb7U/TyabYyFcEPtBqls0e9K21ZXeVjSzL08mtZqDgdQtqnrSVtW4qsY+wgWxH6S6RVVP2qoal8rhYCB1i6qetFU1rqqxEC5IM6PnpSqo6klbVeNSORwMpG5R1ZO2qsZVNRbCBWlm9LxUBVU9aatqXCpHMyPnpSqo6klbVeOqGvsIF6jIPpXSdilyIFwd4lJ5ih5c5Ah6bYdWZ4Soe1xVYyFcUU4jpe1U1ZO2qsal7ucIem2nok/ailLVuKrErhEVtDKN1NL8PGRenkbKBTokqT2bjaCXVF8WwhW02TRS0mZc3VBanyPo1S5XOOxtdo2oIKeRUjvqsiCF3Ya0WrP9fvf393NunaLXEfTaTF261NS5/7xXhEuw1VU7p5FSO+rwTYLdhrTaSpFybnGR5MkiZb0rdo6gVzvq0KWmlX9HvchCeIc18x+500ipHXX4JqEOxb6a10qR4nRsakcdutTUodjfjF0jdlgzy8g6jZTa0TcwsHyCtU57r6hDsa/mtVqkOIJerapDl5o6FPub8YrwDmv2P/J9IyPc+tBDHDt7llsfesgiWFuqwzcJdhvSaq6cpe1Why41df93ZCG8w/yPXNulDqsb1qHYV/PqUKSoXHXoUlP3f0d2jdhhh8fGrhjZD/5HruL0+oIUdhvSaq6cpZ3Q611q6v7vyEJ4h/kfudSZXi/21ZoyipQ6TzWl3tTrxf5mtiyEI+KdwCuBr2XmCxttVwN/BBwEvgz8TGb+TeO+u4G7gEvAL2fmB7Yl8i620/+RO++qJBWjLvPKSnXRTB/hdwGvWNP2emAmM28EZhq3iYjnA7cDL2g85q0RsauwaNUy512VpOLUfaopqddsWQhn5keAx9Y0jwD3Nv6+F7htVft7MvPbmfkl4CzwkmJCVTucd1WSilP3qaakXtPurBHXZeYCQOP3tY3264FHV20312hTSZx3VZKKU/eppqReU/T0abFOW6634YULFxgaGrr8MzU1VXAoAqdrA5iamrqcZ8DeVh5bpTzdamludbdO8hSqlau9rO5TTUFv5OqJ2VkO3nMPT3vTmzh4zz21WU64TprN08hct069cqOIg8B7Vw2W+zxwS2YuRMQA8OHMvKkxUI7MfHNjuw8Ab8zMh9fuc2hoKE+dOtX6K2uBg8Se7CO8drq2XptftlkRcTozh5rdfifytBl+jvXSap7CzuSqsyUs8314UlVzdTNrBzzC8slMr80PrCdtlqftTp/2AHAceEvj9/Sq9j+MiH8LDAI3Ah9r8zk6srZwWBkkBtSqcHC6tt7QzNLc0nZytoQn1XmqqV6w2YBHP9f6aWb6tHcDtwB7I2IOeAPLBfB9EXEXcB54NUBmfiYi7gM+CzwBvDYzL21T7JuycHiS8652P/t6q2wWD+oVDnjUalsWwpn5mg3uWrdDVGZOABOdBFUECwf1kr6BgeUp8NZpl3aCxYN6xf7+fs6tk7cOeKynogfLVUbRg8QcqKQyHR4bY1df3xVtLs2tnVT0bAkOVlJZHPCo1Xq2EC6ycHBRCpVt38gIN09M0Dc4CBH0DQ46UE47qsjiYaW/8bnFRZIn+xtbDGsn3HHkCFPHjnGgv58ADvT3O1CuxtodLFd5RQ4Ss7+xqsC+3irTSpFQxGwJ9jdW2RzwqBU9WwhDcYWD/Y0lqbjiwf7GkqqiZ7tGFMlFKSSpOK7OJqkqLISb4EAlSSqOg5UkVUVPd40oiotSSFJxiuxvLEmdsBBukgOVJKk4DlaSVAV2jZAkSVItWQhLkiSpliyEJUmSVEsWwpIkSaolB8tJJZubnnZGEkkqyInZWWckUdMshKUSzU1P88j4+OUlvJfm53lkfBzAYliSWnRidpbRkycvL+F9bnGR0ZMnASyGtS67RkglOjM5ebkIXnFpaYkzk5MlRSRJ3Wt8ZuZyEbzi8YsXGZ+ZKSkiVZ2FsFSipYWFltolSRs7v7jYUrtkISyVqG9goKV2SdLG9vf3t9QuWQhLJTo8Nsauvr4r2nb19XF4bKykiCSpe00MD7Nn9+4r2vbs3s3E8HBJEanqHCwnlWhlQJyzRkhS51YGxDlrhJrVUSEcEb8K/AKQwCxwJ7AH+CPgIPBl4Gcy8286inINp5tSL9k3MmL+qnROOaVecceRI+aumtZ214iIuB74ZWAoM18I7AJuB14PzGTmjcBM43ZhVqabWpqfh8zL003NTU8X+TSSVBsrU06dW1wkeXLKqROzs2WHJknbqtM+wlcBfRFxFctXgueBEeDexv33Ard1+BxXcLopSSqWU05Jqqu2C+HM/AowCZwHFoDFzPwgcF1mLjS2WQCuLSLQFU43JUnFcsopSXXVSdeIZ7N89fcGYBB4ZkT8k2Yff+HCBYaGhi7/TE1NNfU4p5tSq6ampi7nGbC3lce2m6dSqzrJU+gsV51ySq0oM1elZjWbp5GZbT1BRLwaeEVm3tW4/c+AHwGGgVsycyEiBoAPZ+ZNax8/NDSUp06davl51y5JC8vTTd08MeGAI20pIk5n5lCz27ebp1InWs1T6CxX1y5LC8tTTk0dO+agI21qp3NVasdmedpJH+HzwI9ExJ6ICJYL4M8BDwDHG9scBwodxbZvZISbJyboGxyECPoGBy2CJakDdxw5wtSxYxzo7yeAA/39FsGSaqHt6dMy8z9HxB8DHweeAD4BTAHfC9wXEXexXCy/uohAV3O6KUkqllNOSaqjjuYRzsw3AG9Y0/xtlq8OS5IkSZXlEsuSJEmqJQthSZIk1ZKFsCRJkmqp9oXw3PQ0Dx49yslDh3jw6FGXapYkSaqJjgbLdbu1cxIvzc/zyPg4gLNSSFIbTszOMj4zw/nFRfb39zMxPOxsFJIqq9ZXhM9MTl6xMAfApaUlzkxOlhSReo3fOKhOVhbmOLe4SALnFhcZPXmSE7OzZYemHnFidpaD99zD0970Jg7ec4+5pY7VuhBeWlhoqV1qxco3Dkvz85B5+RsHi2H1qvGZmStWpwN4/OJFxmdmSopIvcQTLW2HWhfCfQMDLbVLrfAbB9XN+cXFltqlVniipe1Q60L48NgYu/r6rmjb1dfH4bGxkiJSL/EbB9XN/v7+ltqlVniipe1Q60J438gIN09M0Dc4CBH0DQ5y88SEA+VUCL9xUN1MDA+zZ/fuK9r27N7NxLCLjapznmhpO9S6EIblYvjWhx7i2Nmz3PrQQxbBKozfOKhu7jhyhKljxzjQ308AB/r7mTp2zFkjVAhPtLQdaj19mrSdVk6qzkxOsrSwQN/AAIfHxjzZ0o4pYyqzO44csfDVtljJK6fnU5EqVwjPTU9bOKhn7BsZMX9VipUR9iuDi1ZG2AMWDupanmipaJXqGuF0U5JUDEfYS9LWKlUIO92UJBXDEfaStLVKFcJONyVJxXCEvSRtrVKFsNNNSVIxHGEvSVurVCHsdFOSVAynMpOkrVVq1ginm5Kk4jjCXpI2V6lCGJxuSpIkSTujo64REfGsiPjjiDgTEZ+LiB+NiKsj4kMR8YXG72cXFawkSZJUlE77CP828P7MPAy8CPgc8HpgJjNvBGYatyVJkqRKabsQjojvB34MeAdAZn4nM78BjAD3Nja7F7itsxAlSZKk4nVyRfgHgAvA70fEJyLi7RHxTOC6zFwAaPy+toA4JUmSpEJ1UghfBfww8LbM/CHg72ihG8SFCxcYGhq6/DM1NdVBKNLGpqamLucZsLeVx5qn2imd5CmYq9o55qq6QbN5GpnZ1hNExN8HPpqZBxu3j7JcCB8CbsnMhYgYAD6cmTetffzQ0FCeOnWqreeW2hURpzNzqNntzVOVodU8BXNV5TBX1Q02y9O2rwhn5l8Bj0bESpE7DHwWeAA43mg7Dky3+xySJEnSdul0HuF/DpyIiKcDXwTuZLm4vi8i7gLOA6/u8DkkSZKkwnVUCGfmJ4H1LjW7mL0kSWrKidlZxmdmOL+4yP7+fiaGh10VUTui03mEpVqam57mwaNHOXnoEA8ePcrctD2AJKkdJ2ZnGT15knOLiyRwbnGR0ZMnOTE7W3ZoqgELYalFc9PTPDI+ztL8PGSyND/PI+PjFsOS1IbxmRkev3jxirbHL15kfGampIhUJxbCUovOTE5yaWnpirZLS0ucmZwsKSJJ6l7nFxdbapeKZCEstWhpYaGldknSxvb397fULhXJQlhqUd/AQEvtkqSNTQwPs2f37iva9uzezcSw4+61/SyEpRYdHhtjV1/fFW27+vo4PDZWUkSS1L3uOHKEqWPHONDfTwAH+vuZOnbMWSO0IzqdR1iqnX0jI8ByX+GlhQX6BgY4PDZ2uV2S1Jo7jhyx8FUpuv6KcBXXKK9iTGBcRdo3MsKtDz3EsbNnufWhh7Ysgqv6GqsYVxVjgurGVbQqvs4qxgTGVaaqvkbjal5VYrIQ3gZVjAmMq0xVfY1VjKuKMUF14ypaFV9nFWMC4ypTVV+jcTWvKjF1fSEsSZIktSMys5wnjrgAnCtgV3uBrxewnyJVMSYwLoADmXlNsxv3eJ5CNeOqYkxQ4TyFns/VKsYExgXl5arvfWuqGFcl8rS0QliSJEkqk10jJEmSVEsWwpIkSaolC2FJkiTVkoWwJEmSaslCWJIkSbVkISxJkqRashCWJElSLVkIS5IkqZYshCVJklRLFsKSJEmqpavKeuK9e/fmwYMHO97PhQsXuOaalpY533ZVjAmMC+D06dNf32i98fX0cp5CNeOqYkxQ7TyF3s7VKsYExgXl5arvfWuqGFdl8jQzS/l58YtfnEUoaj9FqmJMmcaVmQmcSvP0sirGVcWYMqudp9njuVrFmDKNK7O8XPW9b00V46pKnto1QpIkSbVkISxJkqRa6vpCeHR0tOwQnqKKMYFxlamqr7GKcVUxJqhuXEWr4uusYkxgXGWq6ms0ruZVJaZY7jqx84aGhvLUqVOlPHfdzE1Pc2ZykqWFBfoGBjg8Nsa+kZGywypFRJzOzKFmtzdPVYZW8xTM1Z10YnaW8ZkZzi8usr+/n4nhYe44cqTssEphrqobbJanpc0aoZ0xNz3NI+PjXFpaAmBpfp5HxscBalsMS1K7TszOMnryJI9fvAjAucVFRk+eBKhtMSx1s67vGqHNnZmcvFwEr7i0tMSZycmSIpKk7jU+M3O5CF7x+MWLjM/MlBSRpE54RbjHLS0stNQuaXvZVam7nV9cbKld0vbqtKuSV4R7XN/AQEvtkrbPSlelpfl5yLzcVWluerrs0NSk/f39LbVL2j4rXZXOLS6SPNlV6cTsbNP7sBDucYfHxtjV13dF266+Pg6PjZUUkVRfdlXqfhPDw+zZvfuKtj27dzMxPFxSRFJ9FdFVya4RPW7lK1e/ipXKZ1el7rfylauzRkjlK6KrkoVwDewbGbHwlSqgb2BguVvEOu3qHnccOWLhK1XA/v5+zq1T9LbSVcmuEZK0Q+yqJEnFKaKrkleEJWmH2FVJkopTRFclC2FJ2kF2VZKk4nTaVcmuEZIkSaqlri+EI4KI6Lhd2k5r8+5nf/ZniQje9ra3bfnYl73sZUQE999//zZGKC3zmKpu4DFVRen6QljqRqOjowD83u/93qbbffnLX2ZmZoaBgQFe+cpX7kRoktR1PKaqXRbCUgluueUWnve85/GJT3yCj3/84xtu9/a3v53M5M477+Sqq+zSL0nr8ZiqdlkISyX5xV/8RWDjKxiXLl3iXe96FxHBL/zCL+xkaJLUdTymqh0WwlJJjh8/ztOf/nT+8A//kMcff/wp97/vfe/jK1/5Crfeeis33HBDCRFKUvfwmKp2WAhLJbnmmmu47bbb+OY3v8l99933lPvf/va3A0/2fZMkbcxjqtphISyVaOWAvHKAXrGwsMD73vc+rrvuOkacc1aSmuIxVa3ashCOiHdGxNci4tPr3DcWERkRe1e13R0RZyPi8xHx8qIDlnrJj//4j/Pc5z6Xv/iLv+Bzn/vc5fbf//3f54knnuDnfu7n2L1m+UhJ0vo8pqpVzVwRfhfwirWNEfEc4GXA+VVtzwduB17QeMxbI2JXIZFKPWj1oI2VKxiZyTve8Q4HdEhSizymqlVbFsKZ+RHgsXXu+i3gXwK5qm0EeE9mfjszvwScBV5SRKBSr7rzzjvZvXs3f/AHf8B3vvMd/vzP/5wvfvGLvPSlL+XQoUNlhydJXcVjqlrRVh/hiHgV8JXM/NSau64HHl11e67RJmkD1113Ha961av4+te/zv3333956h8HdEhS6zymqhUtF8IRsQcYB/7Venev05brtHHhwgWGhoYu/0xNTbUaitSUqampy3kG7N1q+9V2Kk9X5r/8zd/8Te6//3727t3LT/3UT23Lc6maOslT8JiqndMNueoxVc3maWSuW6deuVHEQeC9mfnCiDgCzAArk/TtA+ZZ7gJxJ0BmvrnxuA8Ab8zMh9fuc2hoKE+dOtXKa9ooNhrP2VG76iEiTmfmULPbb3eershMnvvc5/KlL30JgF/7tV9jcnKy4+dVd2o1T8FjqspRVq56TFUrNsvTlq8IZ+ZsZl6bmQcz8yDL3R9+ODP/CngAuD0inhERNwA3Ah/rIHapFiKCu+666/LtlasZkqTWeUxVs5qZPu3dwMPATRExFxF3bbRtZn4GuA/4LPB+4LWZeamoYDd4znXPCFttl7ZTM3k3Pj5+ebubbrpphyKTruQxVd3AY6qKctVWG2Tma7a4/+Ca2xPARGdhSZIkSdvLleUkSZJUSxbCkiRJqiULYUmSJNWShbAkSZJqyUJYkiRJtWQhLEmSpFqyEJYkSVItWQhLkiSpliyEJUmSVEsWwpIkSaolC2FJkiTVkoWwJEmSaslCWJIkSbVkISxJkqRashCWJElSLVkIS5IkqZYshCVJklRLFsKSJEmqJQthSZIk1ZKFsCRJkmppy0I4It4ZEV+LiE+vavuNiDgTEY9ExJ9GxLNW3Xd3RJyNiM9HxMu3KW5JkiSpI81cEX4X8Io1bR8CXpiZNwP/H3A3QEQ8H7gdeEHjMW+NiF2FRStJkiQVZMtCODM/Ajy2pu2DmflE4+ZHgX2Nv0eA92TmtzPzS8BZ4CUFxitJkiQV4qoC9vHzwB81/r6e5cJ4xVyj7SkuXLjA0NDQ5dujo6OMjo4WEI50pampKaamplZu7m3lseapdkoneQrmqnaOuapu0GyeRmZuubOIOAi8NzNfuKZ9HBgCfjozMyL+HfBwZv77xv3vAN6Xmf/32n0ODQ3lqVOnmnw5UjEi4nRmDm295TLzVGVoNU/BXFU5zFV1g83ytO0rwhFxHHglMJxPVtNzwHNWbbYPmG/3OSRJkqTt0tb0aRHxCuB1wKsy8/FVdz0A3B4Rz4iIG4AbgY91HqYkSZJUrC2vCEfEu4FbgL0RMQe8geVZIp4BfCgiAD6amb+UmZ+JiPuAzwJPAK/NzEvbFbwkSZLUri0L4cx8zTrN79hk+wlgopOgJEmSpO1WxKwR6hFz09OcmZxkaWGBvoEBDo+NsW9kpOyw1AI/Q6laTszOMj4zw/nFRfb39zMxPMwdR46UHZZa4GfY2yyEBSwXUI+Mj3NpaQmApfl5HhkfB7CQ6hJ+hlK1nJidZfTkSR6/eBGAc4uLjJ48CWAh1SX8DHtfW4Pl1HvOTE5eLqBWXFpa4szkZEkRqVV+hlK1jM/MXC6gVjx+8SLjMzMlRaRW+Rn2PgthAbC0sNBSu6rHz1CqlvOLiy21q3r8DHufhbAA6BsYaKld1eNnKFXL/v7+ltpVPX6Gvc9CWAAcHhtjV1/fFW27+vo4PDZWUkRqlZ+hVC0Tw8Ps2b37irY9u3czMTxcUkRqlZ9h77MQFrA8mOrmiQn6Bgchgr7BQW6emHCQVRfZNzLCvp/+aWLXLgBi1y72/fRP+xlKJbnjyBGmjh3jQH8/ARzo72fq2DEHWXWRO44c4fiLXsSu5TUT2BXB8Re9yM+whzhrhC7bNzJi0dTF5qanmfuTPyEvLa9hk5cuMfcnf8LVL36xn6tUkjuOHLFo6mInZme591Of4lImAJcyufdTn+If7N/v59ojvCIs9QhnjZCkYjlrRO+zEJZ6hLNGSFKxnDWi99k1QuoRfQMDLM3Pr9suqTmuIqbV9vf3c26dotdZI3qHV4S73Nz0NA8ePcrJQ4d48OhR5qanyw5JJXHWCKkzK6uInVtcJHlyFbETs7Nlh6aSOGtE77MQ7mIrS+ouzc9D5uUldS2G68mZP6TO2B9UaznzR++za0QX22xwlMVPPTnzh9Q++4NqPc780du8ItzFHByltewqI7XPVcS0nhOzsxy85x6e9qY3cfCee+wq02MshLuYS+pqNbvKSJ2xP6jWst9477MQrqhmruw5OEqrOY+wtLmtruzZH1Rr2W+899lHuIJWruytFDUrV/aAK/p/rvx9ZnKSpYUF+gYGODw2Zh/RmrKrjLSxlSt7K0XNypU94IpC1/6gWs1+473PQriCWhkE5+Aordj9rGdx8W/+Zt12qe42u7Jn4auNXN3Xx1+v+f94pV29YcuuERHxzoj4WkR8elXb1RHxoYj4QuP3s1fdd3dEnI2Iz0fEy7cr8F7mlT21JbO1dqlGvLInaT3N9BF+F/CKNW2vB2Yy80ZgpnGbiHg+cDvwgsZj3hoRuwqLtiYcBKd2XNzgP/SN2qVe0OyIfmeEUDseW+dq8Gbt6j5bFsKZ+RHgsTXNI8C9jb/vBW5b1f6ezPx2Zn4JOAu8pJhQe8dWA+EcBKd2eAKlumllRL8zQqgdnkD1vnZnjbguMxcAGr+vbbRfDzy6aru5RpsampniyhXC1A5PoFQ3rYzod0YItcMTqN5X9GC5WKdt3Q6KFy5cYGho6PLt0dFRRkdHCw6nepodCOcguOJMTU0xNTW1cnNvK4/tpjx1FpHu1kmeQnflalFa7ffrjBDFqFOuruTL+MwM5xcX2d/fz8TwsHnUBZrN08gmBtJExEHgvZn5wsbtzwO3ZOZCRAwAH87MmyLiboDMfHNjuw8Ab8zMh9fuc2hoKE+dOtXaq+oBJw8dWn/wUgTHzp7d+YBqJiJOZ+bQ1lsuq2ueqlyt5inUM1cP3nMP59Ypeg/09/PlX/mVnQ+ohsxVdYPN8rTdrhEPAMcbfx8Hple13x4Rz4iIG4AbgY+1+Rw9yX6cklQMv7aW1Klmpk97N/AwcFNEzEXEXcBbgJdFxBeAlzVuk5mfAe4DPgu8H3htZl7aruCrxtXgJKk4rgQnabtt2Uc4M1+zwV3rnnJn5gQw0UlQ3cjV4FQFc9PT5pZ6givBqSpOzM7aR7iHubJcQVwNTmVr9mRM6gauBKcqaPaETN2r3T7CWsPV4FS2zU7GpG7jSnCqglam6FN3shAuiIPgVLal+fmW2qUqcyEDVcF6s5Js1q7uYyHcJFeDU9XFrvVXM9+oXaoyZ4RQFeyK9ZZH2Lhd3cc+wk1opu+lg+BUtry0/gQtG7VLZdpqAJILGagKLm2w1sJG7eo+FsJNcDU4dYO+wcF1u0H0DQ6WEI20MWeEULc40N+/4aIt6g12jWiCA+HUDQ6PjRFrvkqO3bvtnqPKcQCSusXE8DBPX9O97Om7dtlFp4dYCDfBgXCSVBxnhFA3yTXdINbeVnerfSHsanDqFWcmJ8k1V9ny4kWnT9OO2mo1OHBGCHWP8ZkZLn73u1e0Xfzud/32oofUuhBeGQS3ND8PmZcHwa0thveNjHDzxMRyX8sI+gYHuXliwv7AqhS78KhsK31/zy0ukjzZ93dtMeyMEOoWfnvR+2o9WM7V4NRL+gYG1h8sZxce7ZBmV4NzRgh1i/0bDJbz24veUetC2Cto6iWHx8aumOYP7MKjndXK1TNnhFA3mBgevmKGE/Dbi15T664RDoJTL7ELj8pm31/1mjuOHGHq2DEO9PcTLE+bNnXsmCdxPaSnrwjPTU9vusCFV9DUa+zCozJ59Uy9yG8velvPFsKuBidJxXI1OEm9pmcLYVeDk6TiuBqcpF7Us32EHQgnScVxNThJvahnC2EHwklScZxPVVIv6spC2NXgpPU1829DWs9WK8I5I4TqqpnVEtW9uq4QdjU4aX3N/tuQ1mpmRThXg1MdNbtaorpXR4VwRPxqRHwmIj4dEe+OiO+JiKsj4kMR8YXG72cXFSxsPghurX0jI9z60EMcO3uWWx96yCJYPa2VfxvSas30/3U+VdWRfeN7X9uzRkTE9cAvA8/PzKWIuA+4HXg+MJOZb4mI1wOvB15XSLQ4CE7aiP821K5m+/86I4Tqxr7xva/TrhFXAX0RcRWwB5gHRoB7G/ffC9zW4XNcwUFw0vp2b9BXc6N2aYX9f6X1Xb1mrNFW7eo+bRfCmfkVYBI4DywAi5n5QeC6zFxobLMAXNvKfrca7OMgOGkDEa21Sw32/5VUV510jXg2y1d/bwC+AfyHiPgnzT7+woULDA0NXb49OjrKT153navBqXBTU1NMTU2t3NzbymPXy9PR0dECoyvOxW98o6V2VUsneQrr5+ozf/RHm1rlzRXh1IrtyNWqHlcfWzPuYqt2VUezeRqZ2dYTRMSrgVdk5l2N2/8M+BFgGLglMxciYgD4cGbetPbxQ0NDeerUqSvaHjx6dHnE+xp9g4Pc+tBDbcUprRYRpzNzaOstl62Xp1Xlv5/e0WqewlNzde1KcLB8ldcBbipSEblaZQfvuYdz6/QHPtDfz5d/5Vd2PiC1ZbM87aSP8HngRyJiT0QEywXw54AHgOONbY4DTc/d5GAfqX3XvvSlLbWrtznaXercT954Y0vt6j5td43IzP8cEX8MfBx4AvgEMAV8L3BfRNzFcrH86mb32TcwsP4VLQfCSVv62n/8jy21q7c52l3q3Pu+8IWW2tV9Opo1IjPfkJmHM/OFmflPM/PbmfnXmTmcmTc2fj/W7P4cCCe1z29UtJozQUid84Sy91VqZTlXg5Pa59SCWs2ZIKTOeULZ+ypVCIOrwUnt8hsVreZKcFLnPKHsfW33EZZULU4tqLVcCU7qjFML9j4LYamH7BsZsfCVpAJ5QtnbKtc1QpIkSdoJFsKSJEmqJQthSZIk1ZKFsCRJkmrJQliSJEm1ZCEsSZKkWrIQliRJUi1ZCEuSJKmWLIQlSZJUSxbCkiRJqiULYUmSJNWShbAkSZJqyUJYkiRJtWQhLEmSpFqyEJYkSVItdVQIR8SzIuKPI+JMRHwuIn40Iq6OiA9FxBcav59dVLCSJElSUTq9IvzbwPsz8zDwIuBzwOuBmcy8EZhp3JYkSZIqpe1COCK+H/gx4B0AmfmdzPwGMALc29jsXuC2zkKUJEmSitfJFeEfAC4Avx8Rn4iIt0fEM4HrMnMBoPH72gLilCRJkgrVSSF8FfDDwNsy84eAv6OFbhAXLlxgaGjo8s/U1FQHoUgbm5qaupxnwN5WHmueaqd0kqdgrmrnmKvqBs3maWRmW08QEX8f+GhmHmzcPspyIXwIuCUzFyJiAPhwZt609vFDQ0N56tSptp5baldEnM7MoWa3N09VhlbzFMxVlcNcVTfYLE/bviKcmX8FPBoRK0XuMPBZ4AHgeKPtODDd7nNIkiRJ2+WqDh//z4ETEfF04IvAnSwX1/dFxF3AeeDVHT6HJEmSVLiOCuHM/CSw3qXm4U72K0mSJG03V5aTJElSLVkIS5IkqZYshCVJklRLFsKSJEmqJQthSZIk1ZKFsCRJkmrJQliSJEm11PWFcBXXKK9iTGBcZarqa6xiXFWMCaobV9Gq+DqrGBMYV5mq+hqNq3lViclCeBtUMSYwrjJV9TVWMa4qxgTVjatoVXydVYwJjKtMVX2NxtW8qsTU9YWwJEmS1I7IzHKeOOICcK6AXe0Fvl7AfopUxZjAuAAOZOY1zW7c43kK1YyrijFBhfMUej5XqxgTGBeUl6u+962pYlyVyNPSCmFJkiSpTHaNkCRJUi1ZCEuSJKmWuqYQjoibIuKTq36+GRG/EhFvjIivrGr/yR2I5Z0R8bWI+PSqtqsj4kMR8YXG72evuu/uiDgbEZ+PiJfvYEy/ERFnIuKRiPjTiHhWo/1gRCytes9+dzti2iSuDT+znXivtltVcrWKebpJXObqDqtKnjZiMVc7i6ln8xSqk6vmaSFxVS9XM7PrfoBdwF8BB4A3AmM7/Pw/Bvww8OlVbf8GeH3j79cDv974+/nAp4BnADcAfwns2qGY/iFwVePvX18V08HV25XwXq37me3Ue7XDuVJarlYxTzeJy1wt8cdjavfkap3ztPGaPKY2F5fH1CZ/uuaK8BrDwF9mZhEjpFuWmR8BHlvTPALc2/j7XuC2Ve3vycxvZ+aXgLPAS3Yipsz8YGY+0bj5UWBf0c/bTlyb2JH3aoeVlqtVzNON4jJXS+cxtcm4ys7VmucpeExtKq6y83SjuDZRWq52ayF8O/DuVbf/p8bl/3eu/lpih12XmQsAjd/XNtqvBx5dtd1co22n/TzwZ6tu3xARn4iI/ycijpYQz3qfWVXeqyJVLVernqdgrpahankK5mqr6pCnUL1cNU9bV6lc7bpCOCKeDrwK+A+NprcBzwV+EFgAfrOcyDYU67Tt6Jx1ETEOPAGcaDQtAPsz84eAfwH8YUR8/w6GtNFnVvp7VaQuy9VKvPfm6s7rsjyFirz3FcvVns9T6LpcrcR7X7E8hQrmatcVwsBPAB/PzK8CZOZXM/NSZn4X+D3K+9rnqxExAND4/bVG+xzwnFXb7QPmdyqoiDgOvBK4IxsdcRpfPfx14+/TLPfFed5OxbTJZ1bqe7UNqpirlczTRjzmajmqmKdgrjatJnkK1cxV87QFVczVbiyEX8Oqr0VWErDhp4BPP+URO+MB4Hjj7+PA9Kr22yPiGRFxA3Aj8LGdCCgiXgG8DnhVZj6+qv2aiNjV+PsHGjF9cSdiajznRp9Zae/VNqlirlYuT8FcLVkV8xTM1VZiqkOeQjVz1TxtLa7q5ep2j8Yr8gfYA/w10L+q7f8CZoFHGm/kwA7E8W6WL+lfZPks5i7g7wEzwBcav69etf04y2ddnwd+YgdjOstyn5tPNn5+t7Htfwd8huURmh8Hju3we7XhZ7YT71VdcrWKeWquVuunCnlqrpqn3ZKr5mlv5qpLLEuSJKmWurFrhCRJktQxC2FJkiTVkoWwJEmSaslCWJIkSbVkISxJkqRashCWJElSLVkIS5IkqZYshCVJklRL/z8sMr/IpZjx+AAAAABJRU5ErkJggg==\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "mean_std_datasets_float = transform_datasets(datasets, scale_to_mean_std, discrete=False)\n", + "mean_std_datasets_discrete = transform_datasets(datasets, scale_to_mean_std, discrete=True)\n", + "\n", + "plot_scaled_quartets(mean_std_datasets_float, mean_std_datasets_discrete)" + ] + }, + { + "cell_type": "markdown", + "id": "3e80a982-14d8-4c78-8ea2-4554363383f1", + "metadata": {}, + "source": [ + "#### Scaling to given range\n", + "No difference between float and discrete option" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "47edeedd-c098-46dc-95e3-724242d43973", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "range_datasets_float = transform_datasets(datasets, scale_to_range, discrete=False, target_min=-20, target_max=200)\n", + "range_datasets_discrete = transform_datasets(datasets, scale_to_range, discrete=True, target_min=-20, target_max=200)\n", + "\n", + "plot_scaled_quartets(range_datasets_float, range_datasets_discrete)" + ] + }, + { + "cell_type": "markdown", + "id": "ecc30b65-54ed-409a-932e-4d605ac95732", + "metadata": {}, + "source": [ + "#### Scaling to given target sum\n", + "Discrete will change the shape of the distribution somewhat - particularly if the values are small" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "698268bc-e5fc-4a4d-86e2-220ff4f276a5", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "target_sum_datasets_float = transform_datasets(datasets, scale_to_sum, discrete=False)\n", + "target_sum_datasets_discrete = transform_datasets(datasets, scale_to_sum, discrete=True)\n", + "\n", + "plot_scaled_quartets(target_sum_datasets_float, target_sum_datasets_discrete)" + ] + }, + { + "cell_type": "markdown", + "id": "d7391d20-fb6e-4057-8e94-238a4a46be18", + "metadata": {}, + "source": [ + "#### Multivariate distribution\n", + "These examples are closer to the real use case where the final distribution is made up from a number of normal distributions with a varying mean.\n", + "\n", + "Code adapted from this SO [answer](https://stackoverflow.com/questions/47759577/creating-a-mixture-of-probability-distributions-for-sampling)." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "a6cb5d56-49d4-44fe-9bf7-eb4ed479f979", + "metadata": {}, + "outputs": [], + "source": [ + "def plot_distribution(float_dataset, discrete_dataset,):\n", + "\n", + " fig, axs = plt.subplots(1, 2, sharex=True, sharey=True, figsize=(12, 6),\n", + " gridspec_kw={'wspace': 0.08, 'hspace': 0.08})\n", + " \n", + " colors = [\"firebrick\", \"teal\"]\n", + " \n", + " for i in range(2):\n", + " \n", + " if i == 0:\n", + " dataset = float_dataset\n", + " else: \n", + " dataset = discrete_dataset\n", + " \n", + " axs[i].hist(dataset, bins=100, density=False, color=colors[i], alpha=0.5)\n", + " \n", + " \n", + " axs[0].set_title(\"Float\", loc=\"left\") \n", + " axs[1].set_title(\"Discrete\", loc=\"left\") \n", + "\n", + " plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "7a91216a-b05d-4d74-937d-4ba9a8842455", + "metadata": {}, + "outputs": [], + "source": [ + "distributions = [\n", + " {\"type\": np.random.normal, \"kwargs\": {\"loc\": 20, \"scale\": 2}},\n", + " {\"type\": np.random.normal, \"kwargs\": {\"loc\": 10, \"scale\": 3}},\n", + " {\"type\": np.random.normal, \"kwargs\": {\"loc\": 5, \"scale\": 1}},\n", + "]\n", + "coefficients = np.array([0.5, 0.2, 0.3])\n", + "coefficients /= coefficients.sum()\n", + "sample_size = 100000\n", + "\n", + "num_distr = len(distributions)\n", + "data = np.zeros((sample_size, num_distr))\n", + "for idx, distr in enumerate(distributions):\n", + " data[:, idx] = distr[\"type\"](size=(sample_size,), **distr[\"kwargs\"])\n", + "random_idx = np.random.choice(np.arange(num_distr), size=(sample_size,), p=coefficients)\n", + "sample = data[np.arange(sample_size), random_idx]" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "4d7e667b-c281-44a1-a30a-1d87679b02fa", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX8AAAD8CAYAAACfF6SlAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAVQklEQVR4nO3df6wd9Znf8fcnXkpQEhQoF+T4R023jlRAXadcuUhUFU3SxZtUNVmJlZG6uFokR8ioRIrUmPwT0pUltsqPFdVi1WkiTJuEWkpSrAR2l9CgNBLg2JTEGEKxiktubNlks1HMP24wT/84X8OROff63B8+954775c0OnOemTnnOxr7Od/7zHdmUlVIkrrlXYvdAEnS6Jn8JamDTP6S1EEmf0nqIJO/JHWQyV+SOmjo5J9kRZL/leS77f3lSR5P8nJ7vaxv3XuSHEnyUpKb++LXJznUlt2fJAu7O5KkYcym53838GLf+x3AE1W1HniivSfJNcAW4FpgE/BAkhVtm13ANmB9mzbNq/WSpDkZKvknWQ18HPjPfeHNwJ42vwe4pS/+cFWdrqpXgCPAxiQrgUur6qnqXVn2UN82kqQR+p0h1/tz4N8B7+uLXVVVxwGq6niSK1t8FfB033pTLfbbNn9ufEZXXHFFrVu3bshmSpIADh48+Muqmphu+XmTf5J/CZysqoNJbhriOwfV8WuG+KDv3EavPMTatWs5cODAEF8rSToryf+dafkwZZ8bgX+V5CjwMPDhJP8VONFKObTXk239KWBN3/argWMtvnpA/B2qandVTVbV5MTEtD9ckqQ5Om/yr6p7qmp1Va2jdyL3f1TVvwb2AVvbaluBR9r8PmBLkouTXE3vxO7+ViI6leSGNsrn9r5tJEkjNGzNf5D7gL1J7gBeBW4FqKrDSfYCLwBvANur6kzb5k7gQeAS4LE2SZJGLEv9ls6Tk5NlzV+SZifJwaqanG65V/hKUgeZ/CWpg0z+ktRBJn9J6iCTvyR10HyGemqJW7fje2/NH73v44vYEklLjcl/melP+JI0Hcs+ktRBJn9J6iCTvyR1kMlfkjrI5C9JHWTyl6QOMvlLUgeZ/CWpg0z+ktRBJn9J6iCTvyR10HmTf5J3J9mf5CdJDif5fIvfm+QXSZ5r08f6trknyZEkLyW5uS9+fZJDbdn97UHukqQRG+bGbqeBD1fV60kuAn6U5OyD179cVV/oXznJNcAW4FrgA8D3k3ywPcR9F7ANeBp4FNiED3GXpJE7b8+/el5vby9q00xPfd8MPFxVp6vqFeAIsDHJSuDSqnqqek+Nfwi4ZV6tlyTNyVC3dE6yAjgI/APgL6rqmSR/ANyV5HbgAPDpqvpbYBW9nv1ZUy322zZ/blzz5G2cJc3WUCd8q+pMVW0AVtPrxV9Hr4Tzu8AG4Djwxbb6oDp+zRB/hyTbkhxIcuC1114bpomSpFmY1Wifqvo18CSwqapOtB+FN4GvABvbalPAmr7NVgPHWnz1gPig79ldVZNVNTkxMTGbJkqShjDMaJ+JJO9v85cAHwV+1mr4Z30CeL7N7wO2JLk4ydXAemB/VR0HTiW5oY3yuR14ZOF2RZI0rGFq/iuBPa3u/y5gb1V9N8l/SbKBXunmKPBJgKo6nGQv8ALwBrC9jfQBuBN4ELiE3igfR/pI0iJIb+DN0jU5OVkHDhxY7GYsabM94evD3KXlL8nBqpqcbrlX+EpSB5n8JamDTP6S1EEmf0nqIJO/JHWQyV+SOsjkL0kdZPKXpA4y+UtSB5n8JamDTP6S1EEmf0nqIJO/JHWQyV+SOsjkL0kdZPKXpA4y+UtSB5n8JamDhnmA+7uT7E/ykySHk3y+xS9P8niSl9vrZX3b3JPkSJKXktzcF78+yaG27P72IHdJ0ogN0/M/DXy4qn4P2ABsSnIDsAN4oqrWA0+09yS5BtgCXAtsAh5oD38H2AVsA9a3adPC7YqkC23dju+9NWm8nTf5V8/r7e1FbSpgM7CnxfcAt7T5zcDDVXW6ql4BjgAbk6wELq2qp6r31PiH+raRJI3QUDX/JCuSPAecBB6vqmeAq6rqOEB7vbKtvgr4ed/mUy22qs2fG5ckjdhQyb+qzlTVBmA1vV78dTOsPqiOXzPE3/kBybYkB5IceO2114ZpoiRpFmY12qeqfg08Sa9Wf6KVcmivJ9tqU8Cavs1WA8dafPWA+KDv2V1Vk1U1OTExMZsmSpKGMMxon4kk72/zlwAfBX4G7AO2ttW2Ao+0+X3AliQXJ7ma3ond/a00dCrJDW2Uz+1922iEPGkn6XeGWGclsKeN2HkXsLeqvpvkKWBvkjuAV4FbAarqcJK9wAvAG8D2qjrTPutO4EHgEuCxNkmSRuy8yb+qfgp8aED8b4CPTLPNTmDngPgBYKbzBZKkERim5y+pw6YrD/bHj9738VE1RwvE2ztIUgeZ/CWpgyz7SJo3S0Djx56/JHWQyV+SOsiyj6QFZQloPJj8x5RX50qaD5O/pHewc7H8WfOXpA4y+UtSB1n2kQRY6ukae/6S1EEmf0nqIJO/JHWQyV+SOsgTvpIuGK/2Xbrs+UtSBw3zAPc1SX6Q5MUkh5Pc3eL3JvlFkufa9LG+be5JciTJS0lu7otfn+RQW3Z/e5C7JGnEhin7vAF8uqqeTfI+4GCSx9uyL1fVF/pXTnINsAW4FvgA8P0kH2wPcd8FbAOeBh4FNuFD3CVp5M7b86+q41X1bJs/BbwIrJphk83Aw1V1uqpeAY4AG5OsBC6tqqeqqoCHgFvmuwOSpNmbVc0/yTrgQ8AzLXRXkp8m+VqSy1psFfDzvs2mWmxVmz83LkkasaFH+yR5L/At4FNV9Zsku4A/Baq9fhH4E2BQHb9miA/6rm30ykOsXbt22CZKmiVv6dBdQ/X8k1xEL/F/vaq+DVBVJ6rqTFW9CXwF2NhWnwLW9G2+GjjW4qsHxN+hqnZX1WRVTU5MTMxmfyRJQzhvz7+NyPkq8GJVfakvvrKqjre3nwCeb/P7gG8k+RK9E77rgf1VdSbJqSQ30Csb3Q78x4XbFc2F47Clbhqm7HMj8MfAoSTPtdhngduSbKBXujkKfBKgqg4n2Qu8QG+k0PY20gfgTuBB4BJ6o3wc6SNJi+C8yb+qfsTgev2jM2yzE9g5IH4AuG42DZQkLTyv8JWkDjL5S1IHmfwlqYNM/pLUQd7SWeoYL+wS2POXpE6y5y9pJLygcGmx5y9JHWTyl6QOMvlLUgeZ/CWpg0z+ktRBJn9J6iCTvyR1kMlfkjrI5C9JHeQVvmPEe7JoufBq38Vnz1+SOsjkL0kddN7kn2RNkh8keTHJ4SR3t/jlSR5P8nJ7vaxvm3uSHEnyUpKb++LXJznUlt2fZNCzgSVJF9gwPf83gE9X1T8EbgC2J7kG2AE8UVXrgSfae9qyLcC1wCbggSQr2mftArYB69u0aQH3RZI0pPMm/6o6XlXPtvlTwIvAKmAzsKettge4pc1vBh6uqtNV9QpwBNiYZCVwaVU9VVUFPNS3jSRphGY12ifJOuBDwDPAVVV1HHo/EEmubKutAp7u22yqxX7b5s+ND/qebfT+QmDt2rWzaaKkARwppnMNfcI3yXuBbwGfqqrfzLTqgFjNEH9nsGp3VU1W1eTExMSwTZQkDWmo5J/kInqJ/+tV9e0WPtFKObTXky0+Bazp23w1cKzFVw+IS5JGbJjRPgG+CrxYVV/qW7QP2NrmtwKP9MW3JLk4ydX0TuzubyWiU0luaJ95e982kqQRGqbmfyPwx8ChJM+12GeB+4C9Se4AXgVuBaiqw0n2Ai/QGym0varOtO3uBB4ELgEea5MkacTOm/yr6kcMrtcDfGSabXYCOwfEDwDXzaaBkqSF5xW+ktRBJn9J6iCTvyR1kLd01lu8za7UHfb8JamDTP6S1EEmf0nqIJO/JHWQyV+SOsjRPpIWlaPMFoc9f0nqIJO/JHWQZR9pmfLpXZqJPX9J6iCTvyR1kMlfkjrI5C9JHWTyl6QOGuYB7l9LcjLJ832xe5P8IslzbfpY37J7khxJ8lKSm/vi1yc51Jbd3x7iLklaBMP0/B8ENg2If7mqNrTpUYAk1wBbgGvbNg8kWdHW3wVsA9a3adBnSpJG4LzJv6p+CPxqyM/bDDxcVaer6hXgCLAxyUrg0qp6qqoKeAi4ZY5tliTN03xq/ncl+WkrC13WYquAn/etM9Viq9r8uXFJ0iKY6xW+u4A/Baq9fhH4E2BQHb9miA+UZBu9EhFr166dYxOXB6/SlHQhzKnnX1UnqupMVb0JfAXY2BZNAWv6Vl0NHGvx1QPi033+7qqarKrJiYmJuTRRkjSDOSX/VsM/6xPA2ZFA+4AtSS5OcjW9E7v7q+o4cCrJDW2Uz+3AI/NotyRpHs5b9knyTeAm4IokU8DngJuSbKBXujkKfBKgqg4n2Qu8ALwBbK+qM+2j7qQ3cugS4LE2SdJbvLf/6Jw3+VfVbQPCX51h/Z3AzgHxA8B1s2qdJOmC8JbO0jLiAAENy+SvgfzzW1revLePJHWQyV+SOsjkL0kdZPKXpA4y+UtSB5n8JamDTP6S1EEmf0nqIJO/JHWQV/hKWpK8yvzCMvlLY877+WguLPtIUgeZ/CWpg0z+ktRBJn9J6iCTvyR10HmTf5KvJTmZ5Pm+2OVJHk/ycnu9rG/ZPUmOJHkpyc198euTHGrL7m8PcpckLYJhev4PApvOie0Anqiq9cAT7T1JrgG2ANe2bR5IsqJtswvYBqxv07mfqSVq3Y7vvTVJWh6GeYD7D5OsOye8Gbipze8BngQ+0+IPV9Vp4JUkR4CNSY4Cl1bVUwBJHgJuAR6b9x5IHeQPseZrrhd5XVVVxwGq6niSK1t8FfB033pTLfbbNn9uXJLOy6t9F95CX+E7qI5fM8QHf0iyjV6JiLVr1y5My8aIvTpJF9pcR/ucSLISoL2ebPEpYE3fequBYy2+ekB8oKraXVWTVTU5MTExxyZKkqYz1+S/D9ja5rcCj/TFtyS5OMnV9E7s7m8lolNJbmijfG7v20aSNGLnLfsk+Sa9k7tXJJkCPgfcB+xNcgfwKnArQFUdTrIXeAF4A9heVWfaR91Jb+TQJfRO9HqyV5IWyTCjfW6bZtFHpll/J7BzQPwAcN2sWidJuiC8wleSOsjkL0kd5MNcNCuOt5aWB3v+ktRB9vwljRX/+lwYJn9pDHjVtxaayV+Lbtie3HQJ0N6fNHsmfy0p5yZ4E7t0YZj8taQNU+6wBizNnslfczbbpGvdWlo6TP5acIvZEx/mB8a/DiSTvxaIvXppvJj8pSXKH1RdSCZ/XVAmMGlp8vYOktRBJn9J6iDLPuocrwuQTP6SxphXhM/dvJJ/kqPAKeAM8EZVTSa5HPhvwDrgKPBHVfW3bf17gDva+v+2qv5qPt8vzZd/BairFqLn/8+r6pd973cAT1TVfUl2tPefSXINsAW4FvgA8P0kH+x7wLu0qJbCD4GjozQqF6Lssxm4qc3vAZ4EPtPiD1fVaeCVJEeAjcBTF6ANY8f/9JJGab7Jv4C/TlLAf6qq3cBVVXUcoKqOJ7myrbsKeLpv26kWk5acpfBXgHQhzTf531hVx1qCfzzJz2ZYNwNiNXDFZBuwDWDt2rXzbKIk6VzzGudfVcfa60ngO/TKOCeSrARoryfb6lPAmr7NVwPHpvnc3VU1WVWTExMT82miJGmAOSf/JO9J8r6z88DvA88D+4CtbbWtwCNtfh+wJcnFSa4G1gP75/r9kqS5m0/Z5yrgO0nOfs43quovk/wY2JvkDuBV4FaAqjqcZC/wAvAGsN2RPpIn+xeS52qGN+fkX1X/B/i9AfG/AT4yzTY7gZ1z/U5pMZhQtBx5bx9J6iCTvyR1kPf2kWZhuvr8bMtB1vm12Ez+i8T//MuL5wU0bkz+kpYlf5BnZvKXFph/1WkceMJXkjrInr+kZc8S0DvZ85ekDjL5S1IHmfwlqYNM/pLUQSZ/SeogR/uMkOO/pcXnyJ8ee/6S1EEmf0nqIMs+kjqryyUgk/8FZp1fGg9d+yEYedknyaYkLyU5kmTHqL9fkjTinn+SFcBfAP8CmAJ+nGRfVb0wynZcaPb2pfHWhb8CRl322QgcaQ9/J8nDwGZgWSV/ScvHcv0hGHXyXwX8vO/9FPBPRtyGaU13kO3JS4LhcsG4/ECMOvlnQKzesVKyDdjW3r6e5KUL2qp3uiJ/xi9H/J0XwhXgfiwhy2U/YPnsy4LvR/5sIT9taIP24+/NtMGok/8UsKbv/Wrg2LkrVdVuYPeoGnWuJAeqanKxvn+huB9Ly3LZD1g++9Ll/Rj1aJ8fA+uTXJ3k7wBbgH0jboMkdd5Ie/5V9UaSu4C/AlYAX6uqw6NsgyRpES7yqqpHgUdH/b2ztGglpwXmfiwty2U/YPnsS2f3I1XvON8qSVrmvLGbJHWQyb/Pcrn1RJKjSQ4leS7JgcVuz2wk+VqSk0me74tdnuTxJC+318sWs43DmGY/7k3yi3ZcnkvyscVs4zCSrEnygyQvJjmc5O4WH6tjMsN+jOMxeXeS/Ul+0vbl8y0+q2Ni2adpt5743/TdegK4bRxvPZHkKDBZVWM3DjvJPwNeBx6qquta7D8Av6qq+9qP8mVV9ZnFbOf5TLMf9wKvV9UXFrNts5FkJbCyqp5N8j7gIHAL8G8Yo2Myw378EeN3TAK8p6peT3IR8CPgbuAPmcUxsef/trduPVFV/w84e+sJjVBV/RD41TnhzcCeNr+H3n/aJW2a/Rg7VXW8qp5t86eAF+ldqT9Wx2SG/Rg71fN6e3tRm4pZHhOT/9sG3XpiLP9x0PuH8NdJDrarpcfdVVV1HHr/iYErF7k983FXkp+2stCSLpWcK8k64EPAM4zxMTlnP2AMj0mSFUmeA04Cj1fVrI+Jyf9tQ916YkzcWFX/GPgDYHsrQWjx7QJ+F9gAHAe+uKitmYUk7wW+BXyqqn6z2O2ZqwH7MZbHpKrOVNUGendJ2Jjkutl+hsn/bUPdemIcVNWx9noS+A69ktY4O9FqtmdrtycXuT1zUlUn2n/aN4GvMCbHpdWVvwV8vaq+3cJjd0wG7ce4HpOzqurXwJPAJmZ5TEz+b1sWt55I8p52Qosk7wF+H3h+5q2WvH3A1ja/FXhkEdsyZ2f/YzafYAyOSzu5+FXgxar6Ut+isTom0+3HmB6TiSTvb/OXAB8FfsYsj4mjffq0YV5/ztu3nti5uC2avSR/n15vH3pXcH9jnPYjyTeBm+jdpfAE8DngvwN7gbXAq8CtVbWkT6ZOsx830SsvFHAU+OTZGu1SleSfAv8TOAS82cKfpVcvH5tjMsN+3Mb4HZN/RO+E7gp6Hfi9VfXvk/xdZnFMTP6S1EGWfSSpg0z+ktRBJn9J6iCTvyR1kMlfkjrI5C9JHWTyl6QOMvlLUgf9fxLTEM+XfaDLAAAAAElFTkSuQmCC\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "plt.hist(sample, bins=100)\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "1e759c95-33bc-43d6-9188-4929184936e9", + "metadata": {}, + "source": [ + "#### Scale to positive mean and standard distribution" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "257c5e10-2172-452f-b28c-7d681e227e6a", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAswAAAF1CAYAAAD8/Lw6AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAa70lEQVR4nO3df7CtV1kf8O9DovzSCDE3NCSBxBJjgVGUNGJt7Z3GTkKLhj+kxlZJW2oqpfXWaoXIVK/TRnF0tJcqWIo2oTrQjFVJGWmNae+oFUxvFIQQbokCyTVpEqVIsEpJfPrHfpPse3POuufcnHP2Pmd/PjN79rvXft+917uSu+Z71l7vequ7AwAArO1Ji64AAAAsM4EZAAAGBGYAABgQmAEAYEBgBgCAAYEZAAAGdm1grqoLqqqr6vRF1wVgVVTVT1bVv1h0PQB20q4IzFX1sar6k6r69COPJM/ews/vqnreVn0ewG41198+WFWfrKrfqKpvq6onJUl3f1t3/8sdrM/+qjq2U98HsJZdEZgnX9fdn/fII8k9i64QwB71dd39+Umem+QNSV6b5Ke244v8SgjsBrspMA9V1bOr6qaq+kRV3VlV3zr33qVV9Z5ptOTeqvrxqvrc6b1fnXZ7/zR6/Y0LOQGAJdPdf9TdNyX5xiRXV9ULq+r6qvpXSVJVZ1XVu6a+9RNV9WuPjERX1flV9fNV9UBV/WFV/fhU/ner6n9U1Y9V1SeSHKyqJ1fVj1TVXVV13zTt46lV9fQk707y7LlfGJ9dVU+qqtdV1e9On31jVZ25oGYCVsCeCcxJ3p7kWGZTNb4hyQ9U1WXTew8n+Y4kZyX5qiSXJflHSdLdXzPt82XT6PV/3NFaAyy57r41s/71r5zw1ndO5fuSPCvJ9yTpqjotybuSfDzJBUnOTfKOueO+MsnvJTk7yXVJfijJFyd5UZLnTft/b3f/cZKXJrln7hfGe5J8e5KXJ/mrmfX5/yfJT2zlOQPM202B+RenUYxPVtUvzr9RVecn+ctJXtvdf9rd70vy1iTfkiTdfVt3v7e7H+rujyX5t5l1tABszD1JThzF/WySc5I8t7s/292/1t2d5NLMguw/7+4/nvrlX5//rO7+N939UJI/TfKtSb6juz/R3Q8m+YEkVw3q8g+TvL67j3X3Z5IcTPINpncA22U3dS4v7+5feeRFVV0w996zkzzS0T7i40kumfb94iQ/Or1+Wmbnfdt2VxhgDzk3ySdOKPvhzMLqL1dVkrylu9+Q5PwkH58C8Vruntvel1m/fNv0GUlSSU4b1OW5SX6hqv5sruzhzEa5f/+kZwKwSbtphHnkniRnVtXnz5U9J491nG9O8uEkF3X3GZn9bFgB4KSq6i9mFpjnR4nT3Q9293d29xcl+bok/2yaCnd3kucMRnx7bvsPkvxJkhd09zOmxxdMF3efuO8j7k7y0rn9n9HdT+luYRnYFnsiMHf33Ul+I8kPVtVTqupLk7wqyc9Ou3x+kk8l+XRVfUmSV5/wEfcl+aKdqi/AblBVZ1TVyzKbf/wz3f2BE95/WVU9r2ZDw5/KbJT34SS3Jrk3yRuq6ulTv/zVa31Hd/9Zkn+X5Meq6uzpc8+tqsunXe5L8oVV9QVzh/1kkuuq6rnT/vuq6sqtOm+AE+2JwDz5pswuLrknyS8k+b7uvnl677uS/O0kD2bWMZ94Yd/BJDdM86P/1o7UFmB5/eeqejCzkdzXZzal7e+tsd9FSX4lyaeTvCfJm7r7cHc/nNmI8/OS3JXZhYGjFYhem+TOJO+tqk9Nn3lxknT3hzO7qPv3pj762UkOJbkps6kgDyZ5b2YXEgJsi5pdnwEAAKxlL40wAwDAlhOYAQBgQGAGAIABgRkAAAYEZgAAGFj6O/2dddZZfcEFFyy6GgALc9ttt/1Bd+/bzu/Q1wKrbtTXLn1gvuCCC3LkyJFFVwNgYarq49v9HfpaYNWN+lpTMgAAYEBgBgCAAYEZAAAGBGYAABgQmAEAYEBgBgCAAYEZAAAGBGYAABgQmAEAYEBgBgCAAYEZAAAGBGYAABgQmAEAYOD0RVcAAHaLg4cPP7a9f//C6gHsLCPMAAAwIDADAMCAwAwAAAMCMwAADAjMAAAwIDADAMCAZeUAYGB+KTlgNRlhBgCAAYEZAAAGBGYAABgQmAEAYEBgBgCAAYEZAAAGBGYAABgQmAEAYEBgBgCAAYEZAAAGBGYAABgQmAEAYEBgBgCAAYEZAAAGBGYAABgQmAEAYEBgBgCAAYEZAAAGBGYAABgQmAEAYEBgBgCAgQ0H5qo6rap+u6reNb0+s6purqqPTM/PnNv32qq6s6qOVtXlc+UvrqoPTO+9sapqa08HAAC21mZGmA8kuWPu9euS3NLdFyW5ZXqdqnp+kquSvCDJFUneVFWnTce8Ock1SS6aHlc8odoDwBI4ePjwow9g79lQYK6q85L8zSRvnSu+MskN0/YNSV4+V/6O7v5Md380yZ1JLq2qc5Kc0d3v6e5O8ra5YwAAYCltdIT5Xyf57iR/Nlf2rO6+N0mm57On8nOT3D2337Gp7Nxp+8Tyx6mqa6rqSFUdeeCBBzZYRQA2Q18LsDEnDcxV9bIk93f3bRv8zLXmJfeg/PGF3W/p7ku6+5J9+/Zt8GsB2Ax9LcDGnL6Bfb46yddX1d9I8pQkZ1TVzyS5r6rO6e57p+kW90/7H0ty/tzx5yW5Zyo/b41yAABYWicdYe7ua7v7vO6+ILOL+f5bd39zkpuSXD3tdnWSd07bNyW5qqqeXFUXZnZx363TtI0Hq+ol0+oYr5w7BgAAltJGRpjX84YkN1bVq5LcleQVSdLdt1fVjUk+lOShJK/p7oenY16d5PokT03y7ukBAABLa1OBubsPJzk8bf9hksvW2e+6JNetUX4kyQs3W0kAAFgUd/oDAIABgRkAAAYEZgAAGBCYAQBgQGAGAIABgRkAAAYEZgAAGHgiNy5hCR09dOjR7YsPHFhgTQAA9gaBeQ+YD8kAAGwtUzIAAGBAYAYAgAGBGQAABgRmAAAYEJgBAGBAYAYAgAGBGQAABgRmAAAYEJgBAGBAYAYAgAGBGQAABgRmAAAYEJgBAGDg9EVXAACWzcHDhxddBWCJGGEGAIABgRkAAAYEZgAAGBCYAQBgQGAGAIABq2TsUkcPHVp0FQAAVoIRZgAAGDDCDABbaH4N54P79y+sHiwv/4/sPgIzAMA2czOc3c2UDAAAGBCYAQBgQGDew44eOmQ1DQCAJ0hgBgCAAYEZAAAGBGYAABgQmAEAYEBgBgCAAYEZAAAG3OkPAGAbuLvf3iEwAwAsgfmAfXD//oXVg8czJQMAAAYEZgAAGDAlAwBgyZiesVyMMAMAwIDADAAAAwIzAAAMCMwAADAgMAMAwIDADAAAA5aVAwDYIm6HvTcZYQYAgAGBGQAABgRmAAAYEJgBAGBAYAYAgAGBGQAABgRmAAAYOGlgrqqnVNWtVfX+qrq9qr5/Kj+zqm6uqo9Mz8+cO+baqrqzqo5W1eVz5S+uqg9M772xqmp7TgvYrKOHDj36AAAes5ER5s8k+Wvd/WVJXpTkiqp6SZLXJbmluy9Kcsv0OlX1/CRXJXlBkiuSvKmqTps+681Jrkly0fS4YutOBQAAtt5JA3PPfHp6+TnTo5NcmeSGqfyGJC+ftq9M8o7u/kx3fzTJnUkurapzkpzR3e/p7k7ytrljAABgKW1oDnNVnVZV70tyf5Kbu/s3kzyru+9Nkun57Gn3c5PcPXf4sans3Gn7xHIAAFhaGwrM3f1wd78oyXmZjRa/cLD7WvOSe1D++A+ouqaqjlTVkQceeGAjVQRgk/S1ABuzqVUyuvuTSQ5nNvf4vmmaRabn+6fdjiU5f+6w85LcM5Wft0b5Wt/zlu6+pLsv2bdv32aqCMAG6WsBNmYjq2Tsq6pnTNtPTfK1ST6c5KYkV0+7XZ3kndP2TUmuqqonV9WFmV3cd+s0bePBqnrJtDrGK+eOYRtZ/QAA4NSdvoF9zklyw7TSxZOS3Njd76qq9yS5sapeleSuJK9Iku6+vapuTPKhJA8leU13Pzx91quTXJ/kqUnePT0AAFjHwcOHH9vev39h9VhlJw3M3f07Sb58jfI/THLZOsdcl+S6NcqPJBnNfwYAgKXiTn/A45jCAwCP2ciUDGCPEooBnrj5KRPsTUaYAQBgQGAGAIABgRkAAAbMYQbWNT/H+eIDBxZYEwBYHCPMAAAwIDADAMCAKRnAhpieAcCqMsIMAAADRph3ETeZAADYeQIzrBh/eAHA5piSAQAAAwIzAAAMCMwAADBgDjOsAPOWAeDUCcwAAKfg4OHDi64CO8SUDAAAGDDCDADbZH4E8uD+/QurB/DEGGEGAIABgRkAAAZMyQCAuIALWJ/ADGza/DJ1Fx84sMCaAMD2MyUDAAAGBGYAABgwJQMAYJewVOFiGGEGAIABgRkAAAYEZgAAGDCHGfao+aXfAIBTZ4QZAAAGBGYAABgwJWPFuEMbAMDmGGEGAIABgRkAAAYEZgAAGBCYAQBgQGAGAIABgRkAAAYEZgAAGLAOM+whbocNAFvPCDMAAAwIzAAAMCAwAwDAgDnMwBMyP2/64gMHFlgTANgeRpgBAGDACDMAwC508PDhx7b3719YPVaBEWYAABgQmAEAYEBgBgCAAXOYAQAGzBXGCDMAAAwIzAAAMCAwAwDAgMAMAAADAjMAAAxYJWPJHT10aNFVgA2b///14gMHFlgTANg6RpgBAGBAYAYAgAGBGQAABgRmAAAYOGlgrqrzq+q/V9UdVXV7VR2Yys+sqpur6iPT8zPnjrm2qu6sqqNVdflc+Yur6gPTe2+sqtqe0wIAgK2xkRHmh5J8Z3f/hSQvSfKaqnp+ktcluaW7L0pyy/Q603tXJXlBkiuSvKmqTps+681Jrkly0fS4YgvPBQAAttxJA3N339vdvzVtP5jkjiTnJrkyyQ3Tbjckefm0fWWSd3T3Z7r7o0nuTHJpVZ2T5Izufk93d5K3zR0DAABLaVPrMFfVBUm+PMlvJnlWd9+bzEJ1VZ097XZukvfOHXZsKvvstH1iOfAEWKsbALbXhi/6q6rPS/KfkvzT7v7UaNc1ynpQvtZ3XVNVR6rqyAMPPLDRKgKwCfpagI3ZUGCuqs/JLCz/bHf//FR83zTNItPz/VP5sSTnzx1+XpJ7pvLz1ih/nO5+S3df0t2X7Nu3b6PnAsAm6GsBNmYjq2RUkp9Kckd3/+jcWzcluXravjrJO+fKr6qqJ1fVhZld3HfrNH3jwap6yfSZr5w7BgD2tIOHDz/6AHaXjcxh/uok35LkA1X1vqnse5K8IcmNVfWqJHcleUWSdPftVXVjkg9ltsLGa7r74em4Vye5PslTk7x7egAAwNI6aWDu7l/P2vOPk+SydY65Lsl1a5QfSfLCzVQQAAAWyZ3+AABgQGAGAIABgRkAAAYEZgAAGBCYAQBgYFO3xgYAWAXWy2aewLzCjh469Oj2xQcOLLAmAADLy5QMAAAYEJgBAGBAYAYAgAGBGQAABlz0B8BKsxoCcDJGmAEAdrmDhw/7428bGWEGtoVlCwHYK4wwAwDAgMAMAAADpmTALjQ/3QEA2F5GmAEAYEBgBgCAAYEZAAAGBGYAABgQmAEAYEBgBgCAAYEZAAAGBGYAABgQmAEAYEBgBgCAAYEZAAAGBGYAABgQmAEAYOD0RVeAxzt66NCiqwAAwMQIMwAADAjMAAAwYEoGsO3mpxldfODAAmsCsL6Dhw8vugosKYEZAGAPmv8D4OD+/Qurx15gSgYAAAwYYYZdwuopALAYRpgBAGDACDNJXJQFALAeI8wAADBghBkAdpjVC2B3McIMAAADAjMAAAwIzAAAMCAwAwDAgMAMAAADAjMAAAwIzAAAMGAdZgBgZc2viQ3rMcIMAAADAjMAAAyYkgFL7OihQ4uuwpabP6eLDxxYYE0AYGOMMAMAwIDADAAAA6ZkAADscfOrgRzcv39h9ditjDADAMCAwAwAAAMCMwAADJjDDMDKcXc3YDOMMAMAwIARZgBYIKsXwPI76QhzVf10Vd1fVR+cKzuzqm6uqo9Mz8+ce+/aqrqzqo5W1eVz5S+uqg9M772xqmrrT4etcPTQoUcfAACrbiNTMq5PcsUJZa9Lckt3X5Tklul1qur5Sa5K8oLpmDdV1WnTMW9Ock2Si6bHiZ8JAABL56RTMrr7V6vqghOKr0yyf9q+IcnhJK+dyt/R3Z9J8tGqujPJpVX1sSRndPd7kqSq3pbk5Une/YTPAPYYI/sAsFxOdQ7zs7r73iTp7nur6uyp/Nwk753b79hU9tlp+8TyNVXVNZmNRuc5z3nOKVYRWHbzfxxcfODAAmuymvS1ABuz1Rf9rTUvuQfla+rutyR5S5Jccskl6+63lxhVBHbaKva1AKfiVAPzfVV1zjS6fE6S+6fyY0nOn9vvvCT3TOXnrVEOALCjrMPNZp3qOsw3Jbl62r46yTvnyq+qqidX1YWZXdx36zR948Gqesm0OsYr544BAIClddIR5qp6e2YX+J1VVceSfF+SNyS5sapeleSuJK9Iku6+vapuTPKhJA8leU13Pzx91KszW3HjqZld7OeCPwBYMOtAw8ltZJWMb1rnrcvW2f+6JNetUX4kyQs3VTsAYEsIxnDq3OkPAGCF+ONp8wRmACCJIAXrEZgBYA8RemHrCcwAsGS2O/RaVg02R2BmyJ3YAIBVJzADwC5nxBi2l8AMLAW/ZsByWSuEz08PMVeaVSIwA8ASWy+YGlWGnSMws3TmRxrn7eVRx/XOGWCekAyLITCzo/zsDgDsNgIzCyM8A7BTjM7zRAjMLAVTEgCAZSUws2GbHRHe6hFkI9IAwCIIzOwII8gAe5cl5tjrBGaesEWM/G4kgBuFBgC2gsDMKTFiDACsCoEZANgypmewFwnMsEBG6gFg+QnMbCkBEADYa5606AoAAMAyM8IMAGwL85mXn/9GGyMws2e50QnA6hEA2Q6mZAAAwIARZmDp+HUAgGUiMLMSBDAA4FQJzKycR8LzooKzpfdg55jPCmwFgXmBBCcAVoU/XtjNXPQHAAADAjMAAAyYksHKciEgALARAjMAsKvNz4+G7WBKBgAADAjMAAAwYEoG7ABLCJ66Ra+bDWw9S8wtJ/9d1icwQ1wACLAMBDaWlcAMACwd4ZllYg4zAAAMCMwAADBgSgacYL0L9DY7t9mFfgBbz5rLLILADBvkwkCAxRCSWTSBeYcZdQQAlp2LLo8nMAO7ghF+ABZFYIYnyK8GALC3CcxwCoRkAFgdAjMAe4oLxICtJjADu475zADsJDcuAQCAAYEZAAAGBGYAABgwhxkAgA1Z1RuaGGEGAIABgRkAAAZMyQB2NUvMAbDdBOYd4K5wAAC7l8AMwK7n7n7AdhKYAQDYtFVaMUNgBvYM85kB2A4CM7AnCc8AO2evjzYLzNvEhX4AAHuDwAzAruRCP2Cn7HhgrqorkhxKclqSt3b3G3a6DtvJyDIsH9MzAHbOI3/M7qWpGTt6p7+qOi3JTyR5aZLnJ/mmqnr+TtYBAAA2Y6dHmC9Ncmd3/16SVNU7klyZ5EM7XI9Tst4olVFl2D2MNi+vjVw0ZBoG7B7r/XvdjSPPOx2Yz01y99zrY0m+cofrsKbNhl4hGXa/jfw7XuuPY0H7idtM8BWSYe/aLaG6unvnvqzqFUku7+5/ML3+liSXdvc/OWG/a5JcM728OMnRbajOWUn+YBs+d7fSHsfTHsfTHo9ZRFs8t7v3bfWH7lBfm/j/Z562OJ72OJ72ON5Ot8e6fe1OB+avSnKwuy+fXl+bJN39gztWicfqcqS7L9np711W2uN42uN42uMx2mLztNljtMXxtMfxtMfxlqk9dvSivyT/M8lFVXVhVX1ukquS3LTDdQAAgA3b0TnM3f1QVf3jJP81s2Xlfrq7b9/JOgAAwGbs+DrM3f1LSX5pp793DW9ZdAWWjPY4nvY4nvZ4jLbYPG32GG1xPO1xPO1xvKVpjx2dwwwAALvNTs9hBgCAXWUlAnNVPaOqfq6qPlxVd1TVV1XVmVV1c1V9ZHp+5qLruVOq6juq6vaq+mBVvb2qnrJK7VFVP11V91fVB+fK1j3/qrq2qu6sqqNVdfliar191mmPH57+vfxOVf1CVT1j7r2Va4+5976rqrqqzpor29PtsVn62+Ppb/W38/S3x9tN/e1KBOYkh5L8l+7+kiRfluSOJK9Lckt3X5Tklun1nldV5yb59iSXdPcLM7v48qqsVntcn+SKE8rWPP+a3br9qiQvmI55U81u8b6XXJ/Ht8fNSV7Y3V+a5H8luTZZ6fZIVZ2f5K8nuWuubBXaY7P0txP9bRL97Ymuj/523vXZJf3tng/MVXVGkq9J8lNJ0t3/r7s/mdktuW+YdrshycsXUb8FOT3JU6vq9CRPS3JPVqg9uvtXk3zihOL1zv/KJO/o7s9090eT3JnZLd73jLXao7t/ubsfml6+N8l50/ZKtsfkx5J8d5L5Cz/2fHtshv52Tfpb/e2j9LfH20397Z4PzEm+KMkDSf59Vf12Vb21qp6e5FndfW+STM9nL7KSO6W7fz/Jj2T2V9u9Sf6ou385K9oec9Y7/7Vu537uDtdt0f5+kndP2yvZHlX19Ul+v7vff8JbK9keA/rbOfrbdelv16e/XdL+dhUC8+lJviLJm7v7y5P8cfb2z19D01yxK5NcmOTZSZ5eVd+82FottVqjbGWWlqmq1yd5KMnPPlK0xm57uj2q6mlJXp/ke9d6e42yPd0eJ6G/naO/3bSV/vekv13u/nYVAvOxJMe6+zen1z+XWYd+X1WdkyTT8/0Lqt9O+9okH+3uB7r7s0l+Pslfyuq2xyPWO/9jSc6f2++8zH5S3fOq6uokL0vyd/qx9SdXsT3+fGaB5/1V9bHMzvm3qurPZTXbY0R/ezz97dr0tyfQ3z5qafvbPR+Yu/t/J7m7qi6eii5L8qHMbsl99VR2dZJ3LqB6i3BXkpdU1dOqqjJrjzuyuu3xiPXO/6YkV1XVk6vqwiQXJbl1AfXbUVV1RZLXJvn67v6/c2+tXHt09we6++zuvqC7L8is0/6KqW9ZufYY0d8+jv52bfrbOfrbxyx1f9vde/6R5EVJjiT5nSS/mOSZSb4ws6tzPzI9n7noeu5ge3x/kg8n+WCS/5DkyavUHknentl8ws9m9o/xVaPzz+znod9NcjTJSxdd/x1qjzszmyv2vunxk6vcHie8/7EkZ61Ke5xC++lvj28P/a3+9mTtob/dBf2tO/0BAMDAnp+SAQAAT4TADAAAAwIzAAAMCMwAADAgMAMAwIDADAAAAwIzAAAMCMwAADDw/wE1IpVZpXNjCQAAAABJRU5ErkJggg==\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "mean_std_float = scale_to_mean_std(sample, discrete=False)\n", + "mean_std_discrete = scale_to_mean_std(sample, discrete=True)\n", + "\n", + "plot_distribution(mean_std_float, mean_std_discrete)" + ] + }, + { + "cell_type": "markdown", + "id": "621bc30a-880c-4099-9c31-97fafff4ca52", + "metadata": {}, + "source": [ + "#### Scale to negative mean and standard distribution" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "2bb24e13-9c77-48d2-a749-3b2993b8f73a", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "mean_std_float = scale_to_mean_std(sample, target_mean=-20, discrete=False)\n", + "mean_std_discrete = scale_to_mean_std(sample, target_mean=-20, discrete=True)\n", + "\n", + "plot_distribution(mean_std_float, mean_std_discrete)" + ] + }, + { + "cell_type": "markdown", + "id": "e1b6f2bb-0bb3-4839-aee5-07d966924deb", + "metadata": {}, + "source": [ + "#### Scale to a given range\n", + "Both ranges are given" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "bcbd655e-6930-40cd-bcc3-2f0b26d42abb", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAswAAAF1CAYAAAD8/Lw6AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAiIElEQVR4nO3df7Cc1X3f8ffHhGBqmxiCYIR+GBzLNMA02KgKGVJXDWnANKlwJ07kTg1tmciheCInTmuIp40yqRLSxnZFE5Pg2AMkrolmYgfVA00wjcZxA1ZEigGBVcsGgyyNIHZci0xCQf72j32E15e9R3uv7r27e+/7NbNznz3P8+yew957+Ojsec6TqkKSJEnSYC8bdQUkSZKkcWZgliRJkhoMzJIkSVKDgVmSJElqMDBLkiRJDQZmSZIkqWFiA3OSs5NUku8YdV0kaalI8ltJ/v2o6yFJC2kiAnOSJ5L8TZJnjz6As+bw9SvJ6+bq9SRpUvX1t4eTfD3JnyX56SQvA6iqn66qX17A+qxPsn+h3k+SBpmIwNz5sap65dEHcGDUFZKkRerHqupVwGuAG4H3AB+ejzfyW0JJk2CSAnNTkrOS7EjytST7kvxU3751Se7rRksOJvmNJN/Z7ft0d9jnutHrnxxJAyRpzFTV/62qHcBPAlcnuSDJrUn+I0CS05N8sutbv5bkT4+ORCdZleTjSZ5J8tUkv9GV/8sk/yvJB5J8DdiS5KQkv57kySSHumkfJyd5BXA3cFbfN4xnJXlZkuuTfLF77e1JThvRfyZJS8CiCczAx4D99KZq/DjwK0ku7fYdAX4WOB34AeBS4N8AVNWbumO+rxu9/v0FrbUkjbmq2kWvf/0HU3a9uytfBpwJ/AJQSU4APgl8GTgbWAHc0Xfe9wNfAs4AtgK/BrweuBB4XXf8f6iqvwbeDBzo+4bxAPAzwJXAP6TX5/8V8Jtz2WZJ6jdJgfkPu1GMryf5w/4dSVYBPwi8p6r+tqoeBH4HeDtAVT1QVfdX1QtV9QTw2/Q6WknScA4AU0dxnweWA6+pquer6k+rqoB19ILsv62qv+765c/0v1ZV/deqegH4W+CngJ+tqq9V1WHgV4CNjbq8A3hvVe2vqueALcCPO71D0nyZpM7lyqr61NEnSc7u23cWcLSjPerLwNru2NcD7++e/x167X5gvissSYvICuBrU8r+M72w+sdJAG6pqhuBVcCXu0A8yFN928vo9csPdK8BEOCERl1eA3wiyTf7yo7QG+X+yjFbIkkzNEkjzC0HgNOSvKqvbDXf6jhvBj4PrKmqU+h9bRgkSceU5O/TC8z9o8RU1eGqendVvRb4MeDnuqlwTwGrGyO+1bf9l8DfAOdX1au7x3d1F3dPPfaop4A39x3/6qp6eVUZliXNi0URmKvqKeDPgF9N8vIkfw+4Bvhod8irgG8Azyb5u8C1U17iEPDahaqvJE2CJKck+VF6849/r6oenrL/R5O8Lr2h4W/QG+U9AuwCDgI3JnlF1y9fMug9quqbwIeADyQ5o3vdFUku6w45BHx3ku/qO+23gK1JXtMdvyzJhrlqtyRNtSgCc+dt9C4uOQB8AvjFqrqn2/fzwD8HDtPrmKde2LcFuK2bH/0TC1JbSRpf/z3JYXojue+lN6XtXw04bg3wKeBZ4D7gg1W1s6qO0Btxfh3wJL0LA1srEL0H2Afcn+Qb3WueC1BVn6d3UfeXuj76LGAbsIPeVJDDwP30LiSUpHmR3vUZkiRJkgZZTCPMkiRJ0pwzMEuSJEkNBmZJkiSpYejAnOSEJP87ySe756cluSfJF7qfp/Yde0N3e+q9fVc6k+SiJA93+25K36KbkiRJ0jiayQjzZuCxvufXA/dW1Rrg3u45Sc6jd4em84HLgQ92t0mF3nrIm+hdWb2m2y9JkiSNraHu9JdkJfBPgK3Az3XFG4D13fZtwE56SwNtAO7oblf6eJJ9wLokTwCnVNV93WveDlwJ3N1679NPP73OPvvsYdsjSYvOAw888JdVtWw+38O+VtJS1+prh7019n8B/h29G4AcdWZVHQSoqoNHF5yndzeo+/uO29+VPd9tTy1/iSSb6I1Es3r1anbv3j1kNSVp8Uny5Xl6XftaSeq0+tpjTsno7vL0dFU9MOz7DSirRvlLC6tuqaq1VbV22bJ5HVSRpCXLvlaShjPMCPMlwD9NcgXwcuCUJL8HHEqyvBtdXg483R2/H1jVd/5Kenff299tTy2XJEmSxtYxR5ir6oaqWllVZ9O7mO9/VtW/oHdb0qu7w64G7uy2dwAbk5yU5Bx6F/ft6qZvHE5ycbc6xlV950iSJEljadg5zIPcCGxPcg3wJPBWgKrak2Q78CjwAnBdVR3pzrkWuBU4md7Ffs0L/iRJkqRRm1Fgrqqd9FbDoKq+Clw6zXFb6a2oMbV8N3DBTCspSZIkjYp3+pMkSZIaDMySJElSg4FZkiRJajAwS5IkSQ0GZkmSJKnBwCxJkiQ1GJglSZKkBgOzJEmS1HA8d/rTGNq7bduL2+du3jzCmkjS4rVl585vba9fP7J6SFoYjjBLkiRJDY4wLwL9o8qSpPnTP7IsaelwhFmSJElqMDBLkiRJDQZmSZIkqcHALEmSJDUYmCVJkqQGA7MkSZLUYGCWJEmSGgzMkiRJUoOBWZIkSWowMEuSJEkNBmZJkiSpwcAsSZIkNRiYJUmSpAYDsyRJktRgYJYkSZIaDMySJElSg4FZkiRJajAwS5IkSQ0GZkmSJKnBwCxJkiQ1HDMwJ3l5kl1JPpdkT5Jf6sq3JPlKkge7xxV959yQZF+SvUku6yu/KMnD3b6bkmR+miVJkiTNje8Y4pjngB+qqmeTnAh8Jsnd3b4PVNWv9x+c5DxgI3A+cBbwqSSvr6ojwM3AJuB+4C7gcuBuJEmSpDF1zMBcVQU82z09sXtU45QNwB1V9RzweJJ9wLokTwCnVNV9AEluB67EwDwre7dtG3UVJEmSloSh5jAnOSHJg8DTwD1V9dlu1zuTPJTkI0lO7cpWAE/1nb6/K1vRbU8tlyRJksbWUIG5qo5U1YXASnqjxRfQm17xPcCFwEHgfd3hg+YlV6P8JZJsSrI7ye5nnnlmmCpKkmbIvlaShjOjVTKq6uvATuDyqjrUBelvAh8C1nWH7QdW9Z22EjjQla8cUD7ofW6pqrVVtXbZsmUzqaIkaUj2tZI0nGFWyViW5NXd9snADwOfT7K877C3AI902zuAjUlOSnIOsAbYVVUHgcNJLu5Wx7gKuHPumiJJkiTNvWFWyVgO3JbkBHoBe3tVfTLJ7ya5kN60iieAdwBU1Z4k24FHgReA67oVMgCuBW4FTqZ3sZ8X/EmSJGmsDbNKxkPAGwaUv71xzlZg64Dy3cAFM6yjJEmSNDLDjDBrQvUvPXfu5s0jrIkkTa4tO3cOvX/L+vXzWhdJo+GtsSVJkqQGA7MkSZLUYGCWJEmSGgzMkiRJUoOBWZIkSWowMEuSJEkNBmZJkiSpwcAsSZIkNRiYJUmSpAYDsyRJktRgYJYkSZIaDMySJElSg4FZkiRJajAwS5IkSQ0GZkmSJKnBwCxJkiQ1GJglSZKkBgOzJEmS1GBgliRJkhoMzJIkSVKDgVmSJElqMDBLkiRJDQZmSZIkqcHALEmSJDUYmCVJkqQGA7MkSZLUYGCWJEmSGgzMkiRJUoOBWZIkSWr4jlFXQNJ42Ltt24vb527ePMKaSJI0XhxhliRJkhqOGZiTvDzJriSfS7InyS915acluSfJF7qfp/adc0OSfUn2Jrmsr/yiJA93+25KkvlpliRJkjQ3hpmS8RzwQ1X1bJITgc8kuRv4Z8C9VXVjkuuB64H3JDkP2AicD5wFfCrJ66vqCHAzsAm4H7gLuBy4e85bJUnSCGzZufNb2+vXj6weGm/+nkyeY44wV8+z3dMTu0cBG4DbuvLbgCu77Q3AHVX1XFU9DuwD1iVZDpxSVfdVVQG3950jSZIkjaWhLvpLcgLwAPA64Der6rNJzqyqgwBVdTDJGd3hK+iNIB+1vyt7vtueWj7o/TbRG4lm9erVw7dGkjQ0+1pp9BxtngxDXfRXVUeq6kJgJb3R4gsahw+al1yN8kHvd0tVra2qtcuWLRumipKkGbKvlaThzGiVjKr6OrCT3tzjQ900C7qfT3eH7QdW9Z22EjjQla8cUC5JkiSNrWFWyViW5NXd9snADwOfB3YAV3eHXQ3c2W3vADYmOSnJOcAaYFc3feNwkou71TGu6jtH82zvtm0vPiRJkjS8YeYwLwdu6+YxvwzYXlWfTHIfsD3JNcCTwFsBqmpPku3Ao8ALwHXdChkA1wK3AifTWx3DFTIkSZI01o4ZmKvqIeANA8q/Clw6zTlbga0DyncDrfnPkiRJ0ljxTn+SJElSw1DLyklanKab095ffu7mzQtVHUmSxpIjzJIkSVKDgVmSJElqMDBLkiRJDc5hliRJGgPeJnt8GZglNXkBoCRpqTMwS5IkzbP+0WNNHucwS5IkSQ2OMEuSJI0Z5zOPFwOzpKE5n1mStBQZmCfIdHdlkyRJ0vxxDrMkSZLU4AiztMT4TYUkSTNjYJYkaQqXAJPUz8AsSZI0D/yH1+LhHGZJkiSpwRFmaQlw3rIkSbPnCLMkSZLU4AizJEnSHHHe8uLkCLMkSZLUYGCWJEmSGgzMkiRJUoOBWZIkSWrwoj9JkqQx1n8h4Zb160dWj6XMwCxpVvrXdj538+YR1kSSpPnllAxJkiSpwcAsSZIkNRiYJUmSpAYDsyRJktRgYJYkSZIajhmYk6xK8idJHkuyJ8nmrnxLkq8kebB7XNF3zg1J9iXZm+SyvvKLkjzc7bspSeanWZIkSdLcGGZZuReAd1fVXyR5FfBAknu6fR+oql/vPzjJecBG4HzgLOBTSV5fVUeAm4FNwP3AXcDlwN1z0xRJkiRp7h0zMFfVQeBgt304yWPAisYpG4A7quo54PEk+4B1SZ4ATqmq+wCS3A5ciYFZkrQIebMJafGY0Y1LkpwNvAH4LHAJ8M4kVwG76Y1C/xW9MH1/32n7u7Lnu+2p5ZLmQf+NRSRJ0uwNfdFfklcCfwC8q6q+QW96xfcAF9IbgX7f0UMHnF6N8kHvtSnJ7iS7n3nmmWGrKEmaAftaSRrOUIE5yYn0wvJHq+rjAFV1qKqOVNU3gQ8B67rD9wOr+k5fCRzoylcOKH+JqrqlqtZW1dply5bNpD2SpCHZ10rScIZZJSPAh4HHqur9feXL+w57C/BIt70D2JjkpCTnAGuAXd1c6MNJLu5e8yrgzjlqh2Zg77Ztfl0vSZI0pGHmMF8CvB14OMmDXdkvAG9LciG9aRVPAO8AqKo9SbYDj9JbYeO6boUMgGuBW4GT6V3s5wV/kiRpovVf4KnFaZhVMj7D4PnHdzXO2QpsHVC+G7hgJhWUJEmSRsk7/UmSJEkNBmZJkiSpwcAsSZIkNRiYJUmSpAYDsyRJktRgYJYkSZoQW3budBm7ETAwS5IkSQ3D3LhE0gTxLo6SJM0tR5glSZKkBgOzJEmS1GBgliRJkhqcwyzpuPXPmz538+YR1kSSpLnnCLMkSZLU4AizJEnSDLkW8tLiCLMkSZLUYGCWJEmSGgzMkiRJUoOBWZIkSWowMEuSJEkNBmZJkiSpwcAsSZIkNRiYJUmSpAYDsyRJktTgnf4kSZImTP+dBresXz+yeiwVjjBLkiRJDY4wj7m927aNugqSJElLmoFZ0pzq/0feuZs3j7AmkiTNDadkSJIkSQ2OMEuSxLdfRCVJ/RxhliRJkhoMzJIkSVKDgVmSJElqOGZgTrIqyZ8keSzJniSbu/LTktyT5Avdz1P7zrkhyb4ke5Nc1ld+UZKHu303Jcn8NEuSJEmaG8OMML8AvLuqvhe4GLguyXnA9cC9VbUGuLd7TrdvI3A+cDnwwSQndK91M7AJWNM9Lp/DtkiSJElz7piBuaoOVtVfdNuHgceAFcAG4LbusNuAK7vtDcAdVfVcVT0O7APWJVkOnFJV91VVAbf3nSNJkiSNpRktK5fkbOANwGeBM6vqIPRCdZIzusNWAPf3nba/K3u+255aPuh9NtEbiWb16tUzqaK0JHlHSM2Gfa0kDWfoi/6SvBL4A+BdVfWN1qEDyqpR/tLCqluqam1VrV22bNmwVZQkzYB9rSQNZ6jAnOREemH5o1X18a74UDfNgu7n0135fmBV3+krgQNd+coB5ZIkSdLYGmaVjAAfBh6rqvf37doBXN1tXw3c2Ve+MclJSc6hd3Hfrm76xuEkF3eveVXfOZIkSdJYGmYO8yXA24GHkzzYlf0CcCOwPck1wJPAWwGqak+S7cCj9FbYuK6qjnTnXQvcCpwM3N09JEmSpLF1zMBcVZ9h8PxjgEunOWcrsHVA+W7ggplUUJIkSRol7/QnSZIkNcxoWTlJkqSlasvOnaOugkbEEWZJkiSpwcAsSZIkNRiYJUmSpAYDsyRJktTgRX+SJM2z/ovFtqxfP7J6SJodR5glSZKkBkeYl7C927a9uH3u5s0jrIkkSZotv8GYf44wS5IkSQ0GZkmSJKnBwCxJkiQ1GJglSZKkBgOzJEmS1GBgliRJkhoMzJIkSVKD6zBLmjeu9S1JWgwcYZYkSZIaDMySJElSg4FZkiRJanAOszSh+ucHS5Kk+eMIsyRJktRgYJYkSZIaDMySJElSg3OYJUmSprFl585RV0FjwBFmSZIkqcHALEmSJDUYmCVJkqQGA7MkSZLUYGCWJEmSGgzMkiRJUoOBWZIkSWo4ZmBO8pEkTyd5pK9sS5KvJHmwe1zRt++GJPuS7E1yWV/5RUke7vbdlCRz3xxJkiRpbg0zwnwrcPmA8g9U1YXd4y6AJOcBG4Hzu3M+mOSE7vibgU3Amu4x6DUlSZKksXLMO/1V1aeTnD3k620A7qiq54DHk+wD1iV5Ajilqu4DSHI7cCVw92wqLUmSpJfqvzPhlvXrR1aPxeZ45jC/M8lD3ZSNU7uyFcBTfcfs78pWdNtTywdKsinJ7iS7n3nmmeOooiRpOva1kjScY44wT+Nm4JeB6n6+D/jXwKB5ydUoH6iqbgFuAVi7du20xy1We7dtG3UVJC0BS72vlaRhzWqEuaoOVdWRqvom8CFgXbdrP7Cq79CVwIGufOWAckmSJGmszSowJ1ne9/QtwNEVNHYAG5OclOQcehf37aqqg8DhJBd3q2NcBdx5HPWWJEmSFsQxp2Qk+RiwHjg9yX7gF4H1SS6kN63iCeAdAFW1J8l24FHgBeC6qjrSvdS19FbcOJnexX5e8CdJGqn+C6QkaTrDrJLxtgHFH24cvxXYOqB8N3DBjGonadHon5t/7ubNI6yJJEkz453+JEmSpAYDsyRJktQw22XlJI2ASw5Kk88bS0iTxxFmSZIkqcERZgFekCVJkjQdR5glSZKkBgOzJEmS1GBgliRJkhqcwyxJktTHO0BqKkeYJUmSpAYDsyRJktRgYJYkSZIaDMySJElSgxf9SZIkLULehn3uOMIsSZIkNRiYJUmSpAYDsyRJktRgYJYkSZIaDMySJElSg6tkSGNu77Zto67CnDvapnM3bx5xTSRJOjZHmCVJkqQGA7MkSZLUYGCWJEmSGpzDLEmSxLffGU/q5wizJEmS1GBgliRJkhoMzJIkSVKDgVmSJElq8KI/SZJGpP8isy3r14+sHpLaHGGWJEmSGgzMkiRJUsMxA3OSjyR5OskjfWWnJbknyRe6n6f27bshyb4ke5Nc1ld+UZKHu303JcncN0dzYe+2bS8+JEmSlrphRphvBS6fUnY9cG9VrQHu7Z6T5DxgI3B+d84Hk5zQnXMzsAlY0z2mvqYkSZI0do550V9VfTrJ2VOKNwDru+3bgJ3Ae7ryO6rqOeDxJPuAdUmeAE6pqvsAktwOXAncfdwtkBYhR/clSRofs10l48yqOghQVQeTnNGVrwDu7ztuf1f2fLc9tVzSEtb/D4NzN28eYU0kaXFzRZbjM9cX/Q2al1yN8sEvkmxKsjvJ7meeeWbOKidJ+hb7WkkazmxHmA8lWd6NLi8Hnu7K9wOr+o5bCRzoylcOKB+oqm4BbgFYu3bttMF6MfEreEkLbSn2tdJU/SOv0nRmO8K8A7i6274auLOvfGOSk5KcQ+/ivl3d9I3DSS7uVse4qu8cSZIkaWwdc4Q5ycfoXeB3epL9wC8CNwLbk1wDPAm8FaCq9iTZDjwKvABcV1VHupe6lt6KGyfTu9jPC/4kSQvOEUVJMzXMKhlvm2bXpdMcvxXYOqB8N3DBjGonSZIkjZh3+pMkSZIaDMySJElSg4FZkiRJapjtsnKSJEmaQN7EZOYcYZYkSZIaDMySJElSg1My1NR/B8JzN28eYU0kaXHza3JpfDnCLEmSJDU4wixpLPhthiRpXDnCLEmSJDU4wiyNif4RVkmaS86Plo6PgVkLbpiv3geFR7+mlyRJo2BgliRpieofef628kU+Cj1du6XpGJg1Ul7oJUnHzykX0vwyMGtsOIdXkubO8YyiGsClb2dg1sRwNFrSUjGfgdXpCNLMGZg1tJkGVgOuJElaDAzMWhDzOd3CYC5JC8OpGlqqDMw6bqMIrNMFcMOzpMVsmMA611MunMIhGZg1S16gJ0kLY1wDq6PNWkq8NbYkSZLU4AizNEKO1EuSNP4cYZYkSZIaHGHWnHLEVJIWxjjNbZ6E+czj9N9Lk8cRZkmSJKnBwCxJkiQ1OCVDi5ZrMkuS1DYJ02nGgSPMkiRJUoMjzJLGjt8OSJLGiYFZS4IBTJIkzZaBWVpgLr0naTFzTqwWIwOzlhxHmyVpYRietVgcV2BO8gRwGDgCvFBVa5OcBvw+cDbwBPATVfVX3fE3ANd0x/9MVf3R8bz/pHOkUZKk+ePNSjRX5mKVjH9UVRdW1dru+fXAvVW1Bri3e06S84CNwPnA5cAHk5wwB+8vSZIkzZv5mJKxAVjfbd8G7ATe05XfUVXPAY8n2QesA+6bhzpIkvQiRxolHY/jDcwF/HGSAn67qm4BzqyqgwBVdTDJGd2xK4D7+87d35W9RJJNwCaA1atXH2cVpek5n1lLmX2tFtLRf7Q4l3l8Oed8esc7JeOSqnoj8GbguiRvahybAWU16MCquqWq1lbV2mXLlh1nFSVJg9jXStJwjiswV9WB7ufTwCfoTbE4lGQ5QPfz6e7w/cCqvtNXAgeO5/0lSZKk+TbrwJzkFUledXQb+BHgEWAHcHV32NXAnd32DmBjkpOSnAOsAXbN9v0lSdLk2bJz54sPaVIczxzmM4FPJDn6Ov+tqv5Hkj8Htie5BngSeCtAVe1Jsh14FHgBuK6qjhxX7SVJkqR5NuvAXFVfAr5vQPlXgUunOWcrsHW27ylNKtfcnj0vzJQkjZp3+pM6BjNJkjSIgVmSJC0azo3WfJiLO/1JkiRJi5aBWZIkSWowMEuSJEkNzmGWBphuVYuZXAzoyhiS1DbMfGNv0axxYGCWZsCVNCRp/Hihn+abgXmBOeooSQvDECVprhiYJU0MR/glHeU/iOZX/39fp8UYmKU55TcIkiQtPgZmaZYMx5I0/xzpHD0/AwOzJEmaEE7D0Ki4DrMkSZLU4AizpInkBYCSpIXiCLMkSZLUYGCWJEmSGgzMkiRJUoOBWZIkSWowMEuSJEkNrpIhSZKkoSzVm5g4wixJkiQ1OMIsaeK5JrMkaT4ZmBdA///MJUnzx1snS5oPTsmQJEmSGgzMkiRJUoNTMiRJkjRjS2nFDAOzpEXFCwAlSXPNwCxp0TI8S9LCODravFhHmg3M88jVMSRp/rkyhqT55kV/kiRJUoMjzJIkSZoTi/VCwAUPzEkuB7YBJwC/U1U3LnQd5pPTMKTx5HzmxcVpGJIW0oIG5iQnAL8J/GNgP/DnSXZU1aMLWQ9JS5vhWZLm32IabV7oEeZ1wL6q+hJAkjuADcBEBObp/ifrqLI0uQzP42e6/8k6qixNrkkPzwsdmFcAT/U93w98/wLXYaCZhl5DsrT4zOTv2nA9ezMJvoZkafGZ6d/1OATshQ7MGVBWLzko2QRs6p4+m2TvLN7rdOAvZ3HeOLEN42HS2zDp9YdxbMO73jXTM46nDa+Z5XlNc9TXwjh+PjNnG0Zv0usPtmFe/NLMT5ltG6bta1P1krw6b5L8ALClqi7rnt8AUFW/Og/vtbuq1s716y4k2zAeJr0Nk15/sA3jbjG0zTaM3qTXH2zDuJiPNiz0Osx/DqxJck6S7wQ2AjsWuA6SJEnS0BZ0SkZVvZDkncAf0VtW7iNVtWch6yBJkiTNxIKvw1xVdwF3LcBb3bIA7zHfbMN4mPQ2THr9wTaMu8XQNtswepNef7AN42LO27Cgc5glSZKkSbPQc5glSZKkibIoAnOStybZk+SbSdZO2XdDkn1J9ia5rK/8oiQPd/tuSjJoybuRSLIlyVeSPNg9rujbN7A94ybJ5V0d9yW5ftT1GVaSJ7rfiweT7O7KTktyT5IvdD9PHXU9+yX5SJKnkzzSVzZtncfxd2iaNkzM30GSVUn+JMljXV+0uSufqM/hWOxrx5P97cKZ9P520vtaGGF/W1UT/wC+FzgX2Ams7Ss/D/gccBJwDvBF4IRu3y7gB+itDX038OZRt6Ov3luAnx9QPm17xulB74LOLwKvBb6zq/N5o67XkHV/Ajh9Stl/Aq7vtq8Hfm3U9ZxSvzcBbwQeOVadx/V3aJo2TMzfAbAceGO3/Srg/3T1nKjPYYh22teO2cP+dsHrPNH97aT3tV29RtLfLooR5qp6rKoGLbi/Abijqp6rqseBfcC6JMuBU6rqvur917wduHLhajxrA9sz4joN8uIt0Kvq/wFHb4E+qTYAt3XbtzFmvytV9Wnga1OKp6vzWP4OTdOG6YxdG6rqYFX9Rbd9GHiM3p1NJ+pzOBb72rFkf7uAJr2/nfS+FkbX3y6KwNww6FbcK7rH/gHl4+SdSR7qvj45+rXCdO0ZN5NSz0EK+OMkD6R3FzSAM6vqIPT+UIEzRla74U1X50n7bCbu7yDJ2cAbgM+yeD6HY7GvHZ1JqutU9rfjYyL/Dhayv52YwJzkU0keGfBo/Ut6ultxD3WL7vl0jPbcDHwPcCFwEHjf0dMGvNQ4LnMyKfUc5JKqeiPwZuC6JG8adYXm2CR9NhP3d5DklcAfAO+qqm+0Dh1QNi5tsK8dg3rPwCTVdSr72/EwkX8HC93fLvg6zLNVVT88i9P2A6v6nq8EDnTlKweUL5hh25PkQ8Anu6fTtWfcTEo9X6KqDnQ/n07yCXpf2xxKsryqDnZfMT890koOZ7o6T8xnU1WHjm5Pwt9BkhPpdd4fraqPd8UT9znY1wJj/PkMMEl1/Tb2t+Nh0vpaGE1/OzEjzLO0A9iY5KQk5wBrgF3dUP3hJBcnCXAVcOcoK9qv+6CPegtw9GrWge1Z6PoNYSJvgZ7kFUledXQb+BF6/+13AFd3h13NGP2uNExX50n5HZqov4OuH/kw8FhVvb9v18R/DkOyrx0d+9vRm+i/80n7OxhZfzvKKx3n6kHvA94PPAccAv6ob9976V0RuZe+q7OBtfR+Kb4I/AbdTVzG4QH8LvAw8FD3QS8/VnvG7QFcQe/K1S8C7x11fYas82vpXUn7OWDP0XoD3w3cC3yh+3naqOs6pd4fo/c12vPd38E1rTqP4+/QNG2YmL8D4AfpfcX3EPBg97hi0j6HIdppXzuGD/vbBa33RPe3k97XdnUaSX/rnf4kSZKkhsU+JUOSJEk6LgZmSZIkqcHALEmSJDUYmCVJkqQGA7MkSZLUYGCWJEmSGgzMkiRJUoOBWZIkSWr4/yUHzyVqHsEKAAAAAElFTkSuQmCC\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "range_float = scale_to_range(sample, discrete=False, target_min=-100, target_max=200)\n", + "range_discrete = scale_to_range(sample, discrete=True, target_min=-100, target_max=200)\n", + "\n", + "plot_distribution(range_float, range_discrete)" + ] + }, + { + "cell_type": "markdown", + "id": "8ea5f81c-d72c-4924-b031-a0d026fac47f", + "metadata": {}, + "source": [ + "##### Special case if only one of the ranges is given (negative min)" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "cc6c5ba5-1879-46d6-8366-48faeb237f90", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "range_float = scale_to_range(sample, discrete=False, target_min=-100)\n", + "range_discrete = scale_to_range(sample, discrete=True, target_min=-100)\n", + "\n", + "plot_distribution(range_float, range_discrete)" + ] + }, + { + "cell_type": "markdown", + "id": "15f0c099-9ca9-49ee-8535-9dcb2935e027", + "metadata": {}, + "source": [ + "##### Special case if only one of the ranges is given (small max)" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "d16d1590-0c9b-4b82-8196-770d0e30a238", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "range_float = scale_to_range(sample, discrete=False, target_max=10)\n", + "range_discrete = scale_to_range(sample, discrete=True, target_max=10)\n", + "\n", + "plot_distribution(range_float, range_discrete)" + ] + }, + { + "cell_type": "markdown", + "id": "de68a66e-a980-41f4-be8a-76c1c22a1665", + "metadata": {}, + "source": [ + "##### Special case for 0 and 1" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "fa96ae3d-d85f-4a9b-87e0-adfaf6f87bd7", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "range_float = scale_to_range(sample, discrete=False, target_min=0, target_max=1)\n", + "range_discrete = scale_to_range(sample, discrete=True, target_min=0, target_max=1)\n", + "\n", + "plot_distribution(range_float, range_discrete)" + ] + }, + { + "cell_type": "markdown", + "id": "3726793b-93d0-4d75-8eb2-4139b319ade8", + "metadata": {}, + "source": [ + "#### Scale to a target sum\n", + "Given a large enough sum (corresponding to the 10_000 samples!), it's OK. But at smaller values zeroes will dominate." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "2a363f73-cb18-477f-a390-0e0956f619c9", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "sum_float = scale_to_sum(sample, discrete=False, target_sum=1_000_000)\n", + "sum_discrete = scale_to_sum(sample, discrete=True, target_sum=1_000_000)\n", + "\n", + "plot_distribution(sum_float, sum_discrete)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.5" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/environment.yml b/environment.yml index 25fe430..bbe8705 100644 --- a/environment.yml +++ b/environment.yml @@ -1,17 +1,18 @@ +# note that this file is provided for convenience, see exact versions in setup.cfg name: exhibit_latest channels: - defaults dependencies: - - dill=0.3.8 - - numpy=1.26.4 - - pandas=2.2.2 - - pip=24.2 - - python=3.12.4 - - pyyaml=6.0.1 - - scipy=1.13.1 - - shapely=2.0.5 - - sqlalchemy=2.0.30 + - dill + - numpy + - pandas + - pip + - python + - pyyaml + - scipy + - shapely + - sqlalchemy - pip: - - h3==3.7.7 - - pyarrow==17.0.0 - - sql-metadata==2.12.0 + - h3 + - pyarrow + - sql-metadata diff --git a/exhibit/core/generate/missing.py b/exhibit/core/generate/missing.py index 8575c50..01388e5 100644 --- a/exhibit/core/generate/missing.py +++ b/exhibit/core/generate/missing.py @@ -93,7 +93,7 @@ def add_missing_data(self): miss_pct = self.spec_dict["columns"][col_name]["miss_probability"] rands = rng.random(size=self.nan_data.shape[0]) # pylint: disable=no-member col_type = self.spec_dict["columns"][col_name]["type"] - miss_value = pd.NaT if col_type == "date" else np.NaN + miss_value = pd.NaT if col_type == "date" else np.nan repl_column = self.nan_data[col_name] # numpy default type detection messes up date columns in Pandas @@ -123,7 +123,7 @@ def add_missing_data(self): self.nan_data.loc[:, list(cols)] = np.where( (rands < miss_pct)[..., None], - (np.NaN, ) * len(cols), + (np.nan, ) * len(cols), self.nan_data.loc[:, list(cols)] ) @@ -139,7 +139,7 @@ def add_missing_data(self): self.nan_data.loc[:, geo_cols] = np.where( (rands < miss_pct)[..., None], - (np.NaN, ) * len(geo_cols), + (np.nan, ) * len(geo_cols), self.nan_data.loc[:, geo_cols] ) @@ -147,7 +147,7 @@ def add_missing_data(self): make_null_idx = self._find_make_null_idx() for idx, col_name in make_null_idx: - self.nan_data.loc[idx, col_name] = np.NaN + self.nan_data.loc[idx, col_name] = np.nan #5) Re-introduce the saved no_nulls rows from the original data not_null_idx = self._find_not_null_idx() diff --git a/exhibit/core/generate/weights.py b/exhibit/core/generate/weights.py index c888a1f..46e8b55 100644 --- a/exhibit/core/generate/weights.py +++ b/exhibit/core/generate/weights.py @@ -1,315 +1,315 @@ -''' -Mini module for generating the weights table & related outputs -''' - -# Standard library imports -from collections import namedtuple - -# External library imports -import pandas as pd -import numpy as np - -# Exhibit import -from ..constants import MISSING_DATA_STR -from ..utils import exceeds_inline_limit, is_paired -from ..sql import query_exhibit_database - -# EXPORTABLE METHODS -# ================== -def generate_weights_table(spec_dict, target_cols): - ''' - Lookup table for weights - - Parameters - ---------- - spec_dict : dict - original user spec - target_cols: - a subset of columns meant for the weights_table - - Returns - ------- - dictionary where index levels are keys and - the weight column is the lookup value (as namedtuple) - - Weights and probabilities should be at least 0.001; - even if the original, non-anonymised data has a smaller - probability. - ''' - - tuple_list = [] - - #second element in the tuple is the column's equal weight - #in case we're fitting a distribution - Weights = namedtuple("Weights", ["weight", "equal_weight"]) - - num_cols = ( - set(spec_dict["metadata"]["numerical_columns"]) - - set(spec_dict["derived_columns"]) - ) - - for cat_col in target_cols: - - val_count = spec_dict["columns"][cat_col]["uniques"] - equal_weight = 1 / val_count - full_anon_flag = False - - #if column is put into exhibit DB, weights are always uniform - if exceeds_inline_limit(spec_dict, cat_col): - - full_anon_flag = True - ws_df = _generate_weights_dataframe_from_sql(cat_col, spec_dict, num_cols) - - else: - #meaning, there are original_values, including weights - ws_df = spec_dict["columns"][cat_col]["original_values"] - - #get weights and values, from whatever WS was created - for num_col in num_cols: - - #if numerical column is missing from original_values DF (can happen when - #spec is generated outside of CLI / manually), use equal_weights. - if num_col not in ws_df.columns: #pragma: no cover - ws_df[num_col] = equal_weight - - ws = ws_df[num_col].astype(float) - # because we might've taken the FULL anon_set (150 or more), - # we need to make sure the weights are correct! - if not full_anon_flag: - ws /= ws.sum() - ws_vals = ws_df[cat_col] - - for val, weight in zip(ws_vals, ws): - - tuple_list.append( - (num_col, cat_col, val, Weights(weight, equal_weight)) - ) - - #collect everything into output_df - output_df = pd.DataFrame(tuple_list, - columns=["num_col", "cat_col", "cat_value", "weights"]) - - #move the indexed dataframe to dict for perfomance - result = ( - output_df - .set_index(["num_col", "cat_col", "cat_value"]) - .to_dict(orient="index") - ) - - return result - -def generate_weights(df, cat_col, num_col, ew=False): - ''' - Weights are generated for a each value in each categorical column - where 1 means 100% of the numerical column is allocated to that value - - Parameters - ---------- - df : pd.DataFrame - source dataframe - cat_col : str - categorical column - num_col : str - numerical column - ew : Boolean - equal_weights parameter from CLI - - Returns - ------- - List of weights in ascending order of values rounded to 3 digits. - ''' - - # since weights are used at a per-row basis, we need to take an average of the - # numerical column per each categorical value to avoid over- and under-sized effects - # of duplicate rows. All NAs in numerical columns are treated as zeroes. - weights = ( - df - .fillna({cat_col:MISSING_DATA_STR}) - .groupby([cat_col], observed=True)[num_col].mean().fillna(0) - ) - - temp_output = weights.sort_index(kind="mergesort") - - if MISSING_DATA_STR not in temp_output: - temp_output = pd.concat([temp_output, pd.Series( - index=[MISSING_DATA_STR], - data=0 - )]) - - #pop and reinsert missing data placeholder at the end of the list - else: - cached = temp_output[temp_output.index.str.contains(MISSING_DATA_STR)] - temp_output = temp_output.drop(MISSING_DATA_STR) - temp_output = pd.concat([temp_output, cached]) - - #equalise the weights if equal_weights is True, except for Missing data - - temp_output = temp_output.transform(_weights_transform, weights=temp_output) - - if ew: - temp_output.iloc[:-1] = round(1 / (temp_output.shape[0] - 1), 3) - - #last item in the list must be Missing data weight for the num_col, - #regardless of whether Missing data is a value in cat_col - output = temp_output.to_list() - - return output - -def target_columns_for_weights_table(spec_dict): - ''' - Helper function to determine which columns should be used - in the weights table. - - Time columns and paired columns are excluded because they - don't in themselves contribute a different weight depending - on their value (time values are equal and paired columns have - the same weight as their parent columns). - - Parameters - ---------- - spec_dict : dict - original user specification - - Returns - ------- - A set of column names - ''' - - fixed_sql_sets = ["random", "mountains", "birds", "patients"] - - cat_cols = spec_dict["metadata"]["categorical_columns"] #includes linked - cat_cols_set = set(cat_cols) - - #drop columns, like(paired / regex columns) that we don't expect to have num. weights - for cat_col in cat_cols: - anon_set = spec_dict["columns"][cat_col]["anonymising_set"] - - # if we're missing original_values, there can be no weights - orig_vals = spec_dict["columns"][cat_col]["original_values"] - if orig_vals is None or (isinstance(orig_vals, pd.DataFrame) and orig_vals.empty): #pragma: no cover - cat_cols_set.remove(cat_col) - continue - - # skip the checks for custom functions - if callable(anon_set): - continue - if ( - is_paired(spec_dict, cat_col) or - # we keep the columns if they are in fixed sets or have custom SQL; - # because regex can be very variable, we assume that if anoymising set is not in - # fixed sets, and doesn't start with SELECT, it's regex and shouldn't have weights - (anon_set.split(".")[0] not in fixed_sql_sets and - anon_set.strip().upper()[:6] != "SELECT") - ): - cat_cols_set.remove(cat_col) - - return cat_cols_set - -# INNER MODULE METHODS -# ==================== -def _generate_weights_dataframe_from_sql(cat_col, spec_dict, num_cols): - ''' - Function to create a weights dataframe for a categorical column - whose values are drawn from exhibit DB. - - There are 4 of possible scenarios: - - random shuffle of existing values in a linked column - - random shuffle of existing values in a standalone column - - values drawn from an anonymising set for a linked column - - values drawn from an anonymising set for a standaline column - - Anonymising set for a linked group is often given just by its name, - like "mountains" which means we need to loop over ALL linked groups - and ALL linked columns within them to find the exact right linked column. - - ''' - - table_id = spec_dict["metadata"]["id"] - linked_groups = spec_dict.get("linked_columns", []) - anon_set = spec_dict["columns"][cat_col]["anonymising_set"] - val_count = spec_dict["columns"][cat_col]["uniques"] - - #determine the source of the data (table_name and sql_column) - if anon_set != "random": - - table_name, *sql_column = anon_set.split(".") - - #if column is part of linked group and set is multi-column - #table_name will still be equal to anon_set, but sql_column - #will have to depend on column's position in linked group - if not sql_column: - - for linked_group in linked_groups: - for i, col in enumerate(linked_group[1]): - if col == cat_col: - col_pos = i - - ws_df = pd.DataFrame( - data=( - query_exhibit_database(table_name) - .iloc[:, col_pos] - .drop_duplicates() - ) - ) - - #rename columns to match the source - ws_df.columns = [cat_col] - - else: - ws_df = pd.DataFrame( - data=query_exhibit_database(table_name, sql_column) - ) - #rename columns to match the source - ws_df.columns = [cat_col] - - else: - #two options: - #either column is part if a linked group which means - #the table_name is for the linked group, not column - #or column is saved into db under its own name - - for linked_group in linked_groups: - - # skip the zero-th linked group reserved for user defined linkage - if linked_group[0] == 0: - continue - - if cat_col in linked_group[1]: - - table_name = f"temp_{table_id}_{linked_group[0]}" - sql_column = cat_col.replace(" ", "$") - - ws_df = pd.DataFrame( - data=query_exhibit_database(table_name, sql_column) - ) - break - - else: - - table_name = f"temp_{table_id}_{cat_col.replace(' ', '$')}" - ws_df = pd.DataFrame( - data=query_exhibit_database(table_name) - ) - - #Finally, generate equal weights for the column and put into weights_df - for num_col in num_cols: - ws_df[num_col] = 1 / val_count - - return ws_df - -def _weights_transform(x, weights): - ''' - Transform weights values, including zeroes and NaNs, to - be betweeen 0.001 and 1. - - Vectorise this function! - ''' - - if x == 0: - return 0 - - if np.isnan(x): # pragma: no cover - return np.NaN - - return max(0.001, round(x / weights.sum(), 3)) +''' +Mini module for generating the weights table & related outputs +''' + +# Standard library imports +from collections import namedtuple + +# External library imports +import pandas as pd +import numpy as np + +# Exhibit import +from ..constants import MISSING_DATA_STR +from ..utils import exceeds_inline_limit, is_paired +from ..sql import query_exhibit_database + +# EXPORTABLE METHODS +# ================== +def generate_weights_table(spec_dict, target_cols): + ''' + Lookup table for weights + + Parameters + ---------- + spec_dict : dict + original user spec + target_cols: + a subset of columns meant for the weights_table + + Returns + ------- + dictionary where index levels are keys and + the weight column is the lookup value (as namedtuple) + + Weights and probabilities should be at least 0.001; + even if the original, non-anonymised data has a smaller + probability. + ''' + + tuple_list = [] + + #second element in the tuple is the column's equal weight + #in case we're fitting a distribution + Weights = namedtuple("Weights", ["weight", "equal_weight"]) + + num_cols = ( + set(spec_dict["metadata"]["numerical_columns"]) - + set(spec_dict["derived_columns"]) + ) + + for cat_col in target_cols: + + val_count = spec_dict["columns"][cat_col]["uniques"] + equal_weight = 1 / val_count + full_anon_flag = False + + #if column is put into exhibit DB, weights are always uniform + if exceeds_inline_limit(spec_dict, cat_col): + + full_anon_flag = True + ws_df = _generate_weights_dataframe_from_sql(cat_col, spec_dict, num_cols) + + else: + #meaning, there are original_values, including weights + ws_df = spec_dict["columns"][cat_col]["original_values"] + + #get weights and values, from whatever WS was created + for num_col in num_cols: + + #if numerical column is missing from original_values DF (can happen when + #spec is generated outside of CLI / manually), use equal_weights. + if num_col not in ws_df.columns: #pragma: no cover + ws_df[num_col] = equal_weight + + ws = ws_df[num_col].astype(float) + # because we might've taken the FULL anon_set (150 or more), + # we need to make sure the weights are correct! + if not full_anon_flag: + ws /= ws.sum() + ws_vals = ws_df[cat_col] + + for val, weight in zip(ws_vals, ws): + + tuple_list.append( + (num_col, cat_col, val, Weights(weight, equal_weight)) + ) + + #collect everything into output_df + output_df = pd.DataFrame(tuple_list, + columns=["num_col", "cat_col", "cat_value", "weights"]) + + #move the indexed dataframe to dict for perfomance + result = ( + output_df + .set_index(["num_col", "cat_col", "cat_value"]) + .to_dict(orient="index") + ) + + return result + +def generate_weights(df, cat_col, num_col, ew=False): + ''' + Weights are generated for a each value in each categorical column + where 1 means 100% of the numerical column is allocated to that value + + Parameters + ---------- + df : pd.DataFrame + source dataframe + cat_col : str + categorical column + num_col : str + numerical column + ew : Boolean + equal_weights parameter from CLI + + Returns + ------- + List of weights in ascending order of values rounded to 3 digits. + ''' + + # since weights are used at a per-row basis, we need to take an average of the + # numerical column per each categorical value to avoid over- and under-sized effects + # of duplicate rows. All NAs in numerical columns are treated as zeroes. + weights = ( + df + .fillna({cat_col:MISSING_DATA_STR}) + .groupby([cat_col], observed=True)[num_col].mean().fillna(0) + ) + + temp_output = weights.sort_index(kind="mergesort") + + if MISSING_DATA_STR not in temp_output: + temp_output = pd.concat([temp_output, pd.Series( + index=[MISSING_DATA_STR], + data=0 + )]) + + #pop and reinsert missing data placeholder at the end of the list + else: + cached = temp_output[temp_output.index.str.contains(MISSING_DATA_STR)] + temp_output = temp_output.drop(MISSING_DATA_STR) + temp_output = pd.concat([temp_output, cached]) + + #equalise the weights if equal_weights is True, except for Missing data + + temp_output = temp_output.transform(_weights_transform, weights=temp_output) + + if ew: + temp_output.iloc[:-1] = round(1 / (temp_output.shape[0] - 1), 3) + + #last item in the list must be Missing data weight for the num_col, + #regardless of whether Missing data is a value in cat_col + output = temp_output.to_list() + + return output + +def target_columns_for_weights_table(spec_dict): + ''' + Helper function to determine which columns should be used + in the weights table. + + Time columns and paired columns are excluded because they + don't in themselves contribute a different weight depending + on their value (time values are equal and paired columns have + the same weight as their parent columns). + + Parameters + ---------- + spec_dict : dict + original user specification + + Returns + ------- + A set of column names + ''' + + fixed_sql_sets = ["random", "mountains", "birds", "patients"] + + cat_cols = spec_dict["metadata"]["categorical_columns"] #includes linked + cat_cols_set = set(cat_cols) + + #drop columns, like(paired / regex columns) that we don't expect to have num. weights + for cat_col in cat_cols: + anon_set = spec_dict["columns"][cat_col]["anonymising_set"] + + # if we're missing original_values, there can be no weights + orig_vals = spec_dict["columns"][cat_col]["original_values"] + if orig_vals is None or (isinstance(orig_vals, pd.DataFrame) and orig_vals.empty): #pragma: no cover + cat_cols_set.remove(cat_col) + continue + + # skip the checks for custom functions + if callable(anon_set): + continue + if ( + is_paired(spec_dict, cat_col) or + # we keep the columns if they are in fixed sets or have custom SQL; + # because regex can be very variable, we assume that if anoymising set is not in + # fixed sets, and doesn't start with SELECT, it's regex and shouldn't have weights + (anon_set.split(".")[0] not in fixed_sql_sets and + anon_set.strip().upper()[:6] != "SELECT") + ): + cat_cols_set.remove(cat_col) + + return cat_cols_set + +# INNER MODULE METHODS +# ==================== +def _generate_weights_dataframe_from_sql(cat_col, spec_dict, num_cols): + ''' + Function to create a weights dataframe for a categorical column + whose values are drawn from exhibit DB. + + There are 4 of possible scenarios: + - random shuffle of existing values in a linked column + - random shuffle of existing values in a standalone column + - values drawn from an anonymising set for a linked column + - values drawn from an anonymising set for a standaline column + + Anonymising set for a linked group is often given just by its name, + like "mountains" which means we need to loop over ALL linked groups + and ALL linked columns within them to find the exact right linked column. + + ''' + + table_id = spec_dict["metadata"]["id"] + linked_groups = spec_dict.get("linked_columns", []) + anon_set = spec_dict["columns"][cat_col]["anonymising_set"] + val_count = spec_dict["columns"][cat_col]["uniques"] + + #determine the source of the data (table_name and sql_column) + if anon_set != "random": + + table_name, *sql_column = anon_set.split(".") + + #if column is part of linked group and set is multi-column + #table_name will still be equal to anon_set, but sql_column + #will have to depend on column's position in linked group + if not sql_column: + + for linked_group in linked_groups: + for i, col in enumerate(linked_group[1]): + if col == cat_col: + col_pos = i + + ws_df = pd.DataFrame( + data=( + query_exhibit_database(table_name) + .iloc[:, col_pos] + .drop_duplicates() + ) + ) + + #rename columns to match the source + ws_df.columns = [cat_col] + + else: + ws_df = pd.DataFrame( + data=query_exhibit_database(table_name, sql_column) + ) + #rename columns to match the source + ws_df.columns = [cat_col] + + else: + #two options: + #either column is part if a linked group which means + #the table_name is for the linked group, not column + #or column is saved into db under its own name + + for linked_group in linked_groups: + + # skip the zero-th linked group reserved for user defined linkage + if linked_group[0] == 0: + continue + + if cat_col in linked_group[1]: + + table_name = f"temp_{table_id}_{linked_group[0]}" + sql_column = cat_col.replace(" ", "$") + + ws_df = pd.DataFrame( + data=query_exhibit_database(table_name, sql_column) + ) + break + + else: + + table_name = f"temp_{table_id}_{cat_col.replace(' ', '$')}" + ws_df = pd.DataFrame( + data=query_exhibit_database(table_name) + ) + + #Finally, generate equal weights for the column and put into weights_df + for num_col in num_cols: + ws_df[num_col] = 1 / val_count + + return ws_df + +def _weights_transform(x, weights): + ''' + Transform weights values, including zeroes and NaNs, to + be betweeen 0.001 and 1. + + Vectorise this function! + ''' + + if x == 0: + return 0 + + if np.isnan(x): # pragma: no cover + return np.nan + + return max(0.001, round(x / weights.sum(), 3)) \ No newline at end of file diff --git a/exhibit/core/tests/test_reference.py b/exhibit/core/tests/test_reference.py index ac886e5..46059a9 100644 --- a/exhibit/core/tests/test_reference.py +++ b/exhibit/core/tests/test_reference.py @@ -385,7 +385,7 @@ def test_reference_inpatient_il10_random_data(self): replace=False) linked_cols = ["hb_code", "hb_name", "loc_code", "loc_name"] - test_dataframe.loc[rand_idx, linked_cols] = (np.NaN, np.NaN, np.NaN, np.NaN) + test_dataframe.loc[rand_idx, linked_cols] = (np.nan, np.nan, np.nan, np.nan) # Gives us ~10% chance of missing data rand_idx2 = rng.choice( @@ -394,7 +394,7 @@ def test_reference_inpatient_il10_random_data(self): replace=False) na_cols = ["sex"] - test_dataframe.loc[rand_idx2, na_cols] = np.NaN + test_dataframe.loc[rand_idx2, na_cols] = np.nan # modify CLI namespace fromdata_namespace = { @@ -457,7 +457,7 @@ def test_reference_inpatient_il50_random_data(self): replace=False) linked_cols = ["hb_code", "hb_name", "loc_code", "loc_name"] - test_dataframe.loc[rand_idx, linked_cols] = (np.NaN, np.NaN, np.NaN, np.NaN) + test_dataframe.loc[rand_idx, linked_cols] = (np.nan, np.nan, np.nan, np.nan) # modify CLI namespace fromdata_namespace = { @@ -525,7 +525,7 @@ def test_reference_inpatient_il10_mountains_data(self): replace=False) linked_cols = ["loc_code", "loc_name"] - test_dataframe.loc[rand_idx, linked_cols] = (np.NaN, np.NaN) + test_dataframe.loc[rand_idx, linked_cols] = (np.nan, np.nan) # modify CLI namespace fromdata_namespace = { @@ -605,7 +605,7 @@ def test_reference_inpatient_il50_mountains_data(self): replace=False) linked_cols = ["hb_code", "hb_name", "loc_code", "loc_name"] - test_dataframe.loc[rand_idx, linked_cols] = (np.NaN, np.NaN, np.NaN, np.NaN) + test_dataframe.loc[rand_idx, linked_cols] = (np.nan, np.nan, np.nan, np.nan) # modify CLI namespace fromdata_namespace = {