From 16883ef7acdb2fc6abba50b6d81dc601c5398360 Mon Sep 17 00:00:00 2001 From: Allen Downey Date: Sat, 16 Nov 2024 11:13:34 -0500 Subject: [PATCH] Revisions --- examples/zipf.ipynb | 1697 ++++++------------------------------------- 1 file changed, 214 insertions(+), 1483 deletions(-) diff --git a/examples/zipf.ipynb b/examples/zipf.ipynb index f7acab96..a74bbf41 100644 --- a/examples/zipf.ipynb +++ b/examples/zipf.ipynb @@ -1,14 +1,5 @@ { "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can order print and ebook versions of *Think Bayes 2e* from\n", - "[Bookshop.org](https://bookshop.org/a/98697/9781492089469) and\n", - "[Amazon](https://amzn.to/334eqGo)." - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -27,79 +18,29 @@ "\n", "To answer the first question, we'll do some Bayesian statistics.\n", "My solution is based on a model that's not very realistic, so we should not take the result too seriously, but it demonstrates some interesting methods, I think.\n", - "And as you'll see, there is a connection to Zipf's law, [which I wrote about last week](https://www.allendowney.com/blog/2024/11/10/zipfs-law/)." + "And as you'll see, there is a connection to Zipf's law, [which I wrote about last week](https://www.allendowney.com/blog/2024/11/10/zipfs-law/).\n", + "\n", + "Since last week's post was at the beginner level, I should warn you that this one is more advanced -- in rapid succession, it involves the beta distribution, the $t$ distribution, the negative binomial, and the binomial." ] }, { - "cell_type": "code", - "execution_count": 1, + "cell_type": "markdown", "metadata": {}, - "outputs": [ - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 1;\n", - " var nbb_unformatted_code = \"%load_ext nb_black\";\n", - " var nbb_formatted_code = \"%load_ext nb_black\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], "source": [ - "%load_ext nb_black" + "This post is based on *Think Bayes 2e*, which is available from\n", + "[Bookshop.org](https://bookshop.org/a/98697/9781492089469) and\n", + "[Amazon](https://amzn.to/334eqGo)." ] }, { "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [ - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 2;\n", - " var nbb_unformatted_code = \"try:\\n import empiricaldist\\nexcept ImportError:\\n !pip install empiricaldist\";\n", - " var nbb_formatted_code = \"try:\\n import empiricaldist\\nexcept ImportError:\\n !pip install empiricaldist\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "execution_count": 1, + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [], "source": [ "try:\n", " import empiricaldist\n", @@ -109,37 +50,13 @@ }, { "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 3;\n", - " var nbb_unformatted_code = \"# download thinkdsp.py\\n\\nfrom os.path import basename, exists\\n\\n\\ndef download(url):\\n filename = basename(url)\\n if not exists(filename):\\n from urllib.request import urlretrieve\\n\\n local, _ = urlretrieve(url, filename)\\n print(\\\"Downloaded \\\" + local)\\n\\n\\ndownload(\\\"https://github.com/AllenDowney/ThinkBayes2/raw/master/soln/utils.py\\\")\";\n", - " var nbb_formatted_code = \"# download thinkdsp.py\\n\\nfrom os.path import basename, exists\\n\\n\\ndef download(url):\\n filename = basename(url)\\n if not exists(filename):\\n from urllib.request import urlretrieve\\n\\n local, _ = urlretrieve(url, filename)\\n print(\\\"Downloaded \\\" + local)\\n\\n\\ndownload(\\\"https://github.com/AllenDowney/ThinkBayes2/raw/master/soln/utils.py\\\")\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "execution_count": 2, + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [], "source": [ "# download thinkdsp.py\n", "\n", @@ -160,37 +77,13 @@ }, { "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 4;\n", - " var nbb_unformatted_code = \"import numpy as np\\nimport pandas as pd\\nimport matplotlib.pyplot as plt\\n\\nfrom empiricaldist import Pmf\\nfrom utils import decorate\\n\\nplt.rcParams[\\\"figure.dpi\\\"] = 75\\nplt.rcParams[\\\"figure.figsize\\\"] = [6, 3.5]\";\n", - " var nbb_formatted_code = \"import numpy as np\\nimport pandas as pd\\nimport matplotlib.pyplot as plt\\n\\nfrom empiricaldist import Pmf\\nfrom utils import decorate\\n\\nplt.rcParams[\\\"figure.dpi\\\"] = 75\\nplt.rcParams[\\\"figure.figsize\\\"] = [6, 3.5]\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "execution_count": 3, + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", @@ -220,74 +113,18 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 4, "metadata": {}, - "outputs": [ - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 5;\n", - " var nbb_unformatted_code = \"download(\\\"https://www.corpusdata.org/coca/samples/coca-samples-text.zip\\\")\";\n", - " var nbb_formatted_code = \"download(\\\"https://www.corpusdata.org/coca/samples/coca-samples-text.zip\\\")\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "download(\"https://www.corpusdata.org/coca/samples/coca-samples-text.zip\")" ] }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 5, "metadata": {}, - "outputs": [ - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 6;\n", - " var nbb_unformatted_code = \"import zipfile\\n\\n\\ndef generate_lines(zip_path=\\\"coca-samples-text.zip\\\"):\\n with zipfile.ZipFile(zip_path, \\\"r\\\") as zip_file:\\n file_list = zip_file.namelist()\\n for file_name in file_list:\\n with zip_file.open(file_name) as file:\\n lines = file.readlines()\\n for line in lines:\\n yield (line.decode(\\\"utf-8\\\"))\";\n", - " var nbb_formatted_code = \"import zipfile\\n\\n\\ndef generate_lines(zip_path=\\\"coca-samples-text.zip\\\"):\\n with zipfile.ZipFile(zip_path, \\\"r\\\") as zip_file:\\n file_list = zip_file.namelist()\\n for file_name in file_list:\\n with zip_file.open(file_name) as file:\\n lines = file.readlines()\\n for line in lines:\\n yield (line.decode(\\\"utf-8\\\"))\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "import zipfile\n", "\n", @@ -311,37 +148,9 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 6, "metadata": {}, - "outputs": [ - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 7;\n", - " var nbb_unformatted_code = \"import re\\nfrom collections import Counter\\n\\ncounter = Counter()\\n\\npattern = r\\\"[ /\\\\n]+|--\\\"\\n\\nfor line in generate_lines():\\n words = re.split(pattern, line)[1:]\\n counter.update(word.lower() for word in words if word)\";\n", - " var nbb_formatted_code = \"import re\\nfrom collections import Counter\\n\\ncounter = Counter()\\n\\npattern = r\\\"[ /\\\\n]+|--\\\"\\n\\nfor line in generate_lines():\\n words = re.split(pattern, line)[1:]\\n counter.update(word.lower() for word in words if word)\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "import re\n", "from collections import Counter\n", @@ -364,7 +173,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 7, "metadata": {}, "outputs": [ { @@ -373,41 +182,13 @@ "(188086, 11503819)" ] }, - "execution_count": 8, + "execution_count": 7, "metadata": {}, "output_type": "execute_result" - }, - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 8;\n", - " var nbb_unformatted_code = \"num_words = counter.total()\\nlen(counter), num_words\";\n", - " var nbb_formatted_code = \"num_words = counter.total()\\nlen(counter), num_words\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" } ], "source": [ - "num_words = counter.total()\n", - "len(counter), num_words" + "len(counter), counter.total()" ] }, { @@ -419,37 +200,9 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 8, "metadata": {}, - "outputs": [ - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 9;\n", - " var nbb_unformatted_code = \"for s in list(counter.keys()):\\n if not s[0].isalpha() or not s[-1].isalpha():\\n del counter[s]\";\n", - " var nbb_formatted_code = \"for s in list(counter.keys()):\\n if not s[0].isalpha() or not s[-1].isalpha():\\n del counter[s]\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "for s in list(counter.keys()):\n", " if not s[0].isalpha() or not s[-1].isalpha():\n", @@ -465,7 +218,7 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 9, "metadata": {}, "outputs": [ { @@ -474,36 +227,9 @@ "(151414, 8889694)" ] }, - "execution_count": 10, + "execution_count": 9, "metadata": {}, "output_type": "execute_result" - }, - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 10;\n", - " var nbb_unformatted_code = \"num_words = counter.total()\\nlen(counter), num_words\";\n", - " var nbb_formatted_code = \"num_words = counter.total()\\nlen(counter), num_words\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" } ], "source": [ @@ -520,7 +246,7 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 10, "metadata": {}, "outputs": [ { @@ -548,36 +274,9 @@ " ('we', 47694)]" ] }, - "execution_count": 11, + "execution_count": 10, "metadata": {}, "output_type": "execute_result" - }, - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 11;\n", - " var nbb_unformatted_code = \"counter.most_common(20)\";\n", - " var nbb_formatted_code = \"counter.most_common(20)\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" } ], "source": [ @@ -593,7 +292,7 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 11, "metadata": {}, "outputs": [ { @@ -602,36 +301,9 @@ "(72159, 0.811715228893143)" ] }, - "execution_count": 12, + "execution_count": 11, "metadata": {}, "output_type": "execute_result" - }, - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 12;\n", - " var nbb_unformatted_code = \"singletons = [word for (word, freq) in counter.items() if freq == 1]\\nlen(singletons), len(singletons) / counter.total() * 100\";\n", - " var nbb_formatted_code = \"singletons = [word for (word, freq) in counter.items() if freq == 1]\\nlen(singletons), len(singletons) / counter.total() * 100\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" } ], "source": [ @@ -648,68 +320,40 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "array(['leakylibrary', 'vegans-people', 'aurelle', 'dispelling', 'abijah',\n", - " 'nation-wide', 'zenker', 'dwigmore.com', 'snake-charmers',\n", - " 'andorganization', 'religion-are', 'www.indulgedecor.com',\n", - " 'self-enchantment', 're-asserted', 'subvertive', 'guzzles',\n", - " 'pdp-4270hd', 'interrogatory', 'peregoy', \"everything'sback\",\n", - " 'celexa', 'ychou', \"qu'a\", 'up.according', 'emission-reduction',\n", - " 'needthese', 'smker', 'dry-salt', 'kouhei-kun', 'mtv-hyped',\n", - " 'downspout', 'favelas', 'machinists', 'non-aspirin',\n", - " 'counter-heteronormative', 'mordoh', 'roulon', 'treecovered',\n", - " 'alabamians', 'blems', 'collusive', 'concensus', 'unenjoyable',\n", - " 'restrictiveness', 'flugsicherung', 'timberline', 'yoo-suk',\n", - " 'tania', 'yedu', 'panter', 'dramatica', 'praddock',\n", - " 'goldman-fergusons', 'snootful', 'diesel-electric', 'commerice',\n", - " 'progrssive', 'maltreating', 'white-clapboard', 'agricutural',\n", - " 'then-popular', \"sorry-i'm-late\", 'gamble-all', 'squeamishness',\n", - " 'edisons', 'meatside', 'alto-based', 'engine-tuning', 'folk-music',\n", - " 'x-i', 'hickam', 'octavius', 'participacin', 'perceivedness',\n", - " 'everywhere-and', 'switchblades', 'bocks', 'carbohydr', 'padfield',\n", - " 'ctb', 'expropriate', 'yeton', 'hualapai', 'unsinkable',\n", - " 'out-island', 'dallal', 'prostate-specific', 'paramecium',\n", - " 'one-to-nothing', 'irishamerican', 'zhezha', 'people.theyre',\n", - " 'dunphy', 'reclassify', 'southwind', 'roarsandscreeches',\n", - " 'dove-gray', 'lololol', 'www.votersfirstohio.com', 'bastide'],\n", + "array(['xcor', 'metress', 'commonspace', 'attilan', 'nutritus',\n", + " 'under-estimated', 'danci', 'thoughness', 'gmulder', 'multigrade',\n", + " 'tazzarine', 'well-remembered', 'snapchat', 'yt',\n", + " \"everything'sback\", 'moonclown', 'maschek', 'infront', 'meowing',\n", + " 'unhorses', 'waitressed', 'getbuckyballs.com', 'eye-rolling',\n", + " 'right.follow', 'al-maliki', 'where-it', 'candelabras',\n", + " 'trillion-dollar', 'poltically', 'way-stone', 'end-of-empire',\n", + " 'antiforgiveness', 'noncommunicative', 'astronomical-sized',\n", + " 'ms-like', 'colicky', 'mightly', 'lynsey', 'fifield',\n", + " 'consummately', 'oursega', 'steplewski', 'businessleaders',\n", + " 'pacifies', 'post-linsanity', 'high-born', 'okay.bye', 'mini-camp',\n", + " 'than-expected', 'x-lab', 'www.visitturin2006.com',\n", + " 'decreasing-there', 'kleiner', 'cosher', 'drm', 'castleman',\n", + " 'treelet', 'ostapowicz', 'gerrymander', 'kibitzing', 'resequenced',\n", + " 'goat-man', 'drenaje', 'tionist', 'betrothed', 'cannondale.com',\n", + " 'praddock', 'napoleanic', 'tiltfilter', 'gowe', 'marchev', 'fugly',\n", + " 'mouthit', 'blumberg', 'langone', 'self-superiority', 'etsu',\n", + " 'friesian', 'blixt', 'couzens', 'firestorms', 'headachy',\n", + " 'bisected', 'reel-to-reel', 'obscurant', 'with.the', 'yelp.com',\n", + " 'frostily', 'hemsehus', 'pa-lin', 'sun-blocking', 'baerga',\n", + " 'rightinfrontofyou', 'guangxi', 'jeiky', 'babbacombe',\n", + " 'deborahlchamberlain', 'counterbalancing', 'baupoint', 'gowned'],\n", " dtype='" - ] - }, - "metadata": {}, - "output_type": "display_data" } ], "source": [ @@ -734,74 +378,18 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 13, "metadata": {}, - "outputs": [ - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 14;\n", - " var nbb_unformatted_code = \"freqs = np.array(sorted(counter.values(), reverse=True))\";\n", - " var nbb_formatted_code = \"freqs = np.array(sorted(counter.values(), reverse=True))\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "freqs = np.array(sorted(counter.values(), reverse=True))" ] }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 14, "metadata": {}, - "outputs": [ - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 15;\n", - " var nbb_unformatted_code = \"n = len(freqs)\\nranks = range(1, n + 1)\";\n", - " var nbb_formatted_code = \"n = len(freqs)\\nranks = range(1, n + 1)\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "n = len(freqs)\n", "ranks = range(1, n + 1)" @@ -816,7 +404,7 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 15, "metadata": {}, "outputs": [ { @@ -828,33 +416,6 @@ }, "metadata": {}, "output_type": "display_data" - }, - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 16;\n", - " var nbb_unformatted_code = \"plt.plot(ranks, freqs)\\n\\ndecorate(\\n title=\\\"Zipf plot\\\", xlabel=\\\"Rank\\\", ylabel=\\\"Frequency\\\", xscale=\\\"log\\\", yscale=\\\"log\\\"\\n)\";\n", - " var nbb_formatted_code = \"plt.plot(ranks, freqs)\\n\\ndecorate(\\n title=\\\"Zipf plot\\\", xlabel=\\\"Rank\\\", ylabel=\\\"Frequency\\\", xscale=\\\"log\\\", yscale=\\\"log\\\"\\n)\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" } ], "source": [ @@ -875,7 +436,7 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 16, "metadata": {}, "outputs": [ { @@ -884,36 +445,9 @@ "-5.664633515191604" ] }, - "execution_count": 17, + "execution_count": 16, "metadata": {}, "output_type": "execute_result" - }, - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 17;\n", - " var nbb_unformatted_code = \"rise = np.log10(freqs[-1]) - np.log10(freqs[0])\\nrise\";\n", - " var nbb_formatted_code = \"rise = np.log10(freqs[-1]) - np.log10(freqs[0])\\nrise\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" } ], "source": [ @@ -923,7 +457,7 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 17, "metadata": {}, "outputs": [ { @@ -932,36 +466,9 @@ "5.180166032638616" ] }, - "execution_count": 18, + "execution_count": 17, "metadata": {}, "output_type": "execute_result" - }, - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 18;\n", - " var nbb_unformatted_code = \"run = np.log10(ranks[-1]) - np.log10(ranks[0])\\nrun\";\n", - " var nbb_formatted_code = \"run = np.log10(ranks[-1]) - np.log10(ranks[0])\\nrun\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" } ], "source": [ @@ -971,7 +478,7 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 18, "metadata": {}, "outputs": [ { @@ -980,36 +487,9 @@ "-1.0935235433575892" ] }, - "execution_count": 19, + "execution_count": 18, "metadata": {}, "output_type": "execute_result" - }, - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 19;\n", - " var nbb_unformatted_code = \"rise / run\";\n", - " var nbb_formatted_code = \"rise / run\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" } ], "source": [ @@ -1032,118 +512,17 @@ "Given the number of times each word appear in the corpus, we can compute the rates, which is the number of times we expect each word to appear in a sample of a given size, and the inverse rates, which are the number of words we need to see before we expect a given word to appear.\n", "\n", "We will find it most convenient to work with the distribution of inverse rates on a log scale.\n", - "Here are the inverse rates:" - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "metadata": {}, - "outputs": [ - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 20;\n", - " var nbb_unformatted_code = \"def describe(seq):\\n return pd.Series(seq).describe()\";\n", - " var nbb_formatted_code = \"def describe(seq):\\n return pd.Series(seq).describe()\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "def describe(seq):\n", - " return pd.Series(seq).describe()" - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "metadata": {}, - "outputs": [ - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 21;\n", - " var nbb_unformatted_code = \"num_words = counter.total()\\nrates = np.array(freqs) / num_words\";\n", - " var nbb_formatted_code = \"num_words = counter.total()\\nrates = np.array(freqs) / num_words\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "num_words = counter.total()\n", - "rates = np.array(freqs) / num_words" + "The first step is to use the observed frequencies to estimate word rates -- we'll estimate the rate at which each word would appear in a random sample. \n", + "\n", + "We'll do that by creating a beta distribution that represents the posterior distribution of word rates, given the observed frequencies (see [this section of *Think Bayes*](https://allendowney.github.io/ThinkBayes2/chap18.html#the-conjugate-prior)) -- and then drawing a random sample from the posterior.\n", + "So words that have the same frequency will not generally have the same inferred rate." ] }, { "cell_type": "code", - "execution_count": 70, + "execution_count": 19, "metadata": {}, - "outputs": [ - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 70;\n", - " var nbb_unformatted_code = \"from scipy.stats import beta\\n\\nnp.random.seed(17)\\nalphas = freqs + 1\\nbetas = num_words - freqs + 1\\ninferred_rates = beta(alphas, betas).rvs()\";\n", - " var nbb_formatted_code = \"from scipy.stats import beta\\n\\nnp.random.seed(17)\\nalphas = freqs + 1\\nbetas = num_words - freqs + 1\\ninferred_rates = beta(alphas, betas).rvs()\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "from scipy.stats import beta\n", "\n", @@ -1154,46 +533,43 @@ ] }, { - "cell_type": "code", - "execution_count": 71, + "cell_type": "markdown", "metadata": {}, - "outputs": [ - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 71;\n", - " var nbb_unformatted_code = \"# rates = np.where(freqs <= 500, inferred_rates, rates)\";\n", - " var nbb_formatted_code = \"# rates = np.where(freqs <= 500, inferred_rates, rates)\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], "source": [ - "# rates = np.where(freqs <= 500, inferred_rates, rates)" + "Now we can compute the inverse rates, which are the number of words we have to sample before we expect to see each word once." ] }, { "cell_type": "code", - "execution_count": 72, + "execution_count": 20, "metadata": {}, + "outputs": [], + "source": [ + "inverse_rates = 1 / inferred_rates" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": { + "tags": [ + "remove-cell" + ] + }, + "outputs": [], + "source": [ + "def describe(seq):\n", + " return pd.Series(seq).describe()" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": { + "tags": [ + "remove-cell" + ] + }, "outputs": [ { "data": { @@ -1209,40 +585,12 @@ "dtype: float64" ] }, - "execution_count": 72, + "execution_count": 22, "metadata": {}, "output_type": "execute_result" - }, - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 72;\n", - " var nbb_unformatted_code = \"inverse_rates = 1 / inferred_rates\\ndescribe(inverse_rates)\";\n", - " var nbb_formatted_code = \"inverse_rates = 1 / inferred_rates\\ndescribe(inverse_rates)\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" } ], "source": [ - "inverse_rates = 1 / inferred_rates\n", "describe(inverse_rates)" ] }, @@ -1255,8 +603,21 @@ }, { "cell_type": "code", - "execution_count": 73, + "execution_count": 23, "metadata": {}, + "outputs": [], + "source": [ + "mags = np.log10(inverse_rates)" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": { + "tags": [ + "remove-cell" + ] + }, "outputs": [ { "data": { @@ -1272,40 +633,12 @@ "dtype: float64" ] }, - "execution_count": 73, + "execution_count": 24, "metadata": {}, "output_type": "execute_result" - }, - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 73;\n", - " var nbb_unformatted_code = \"mags = np.log10(inverse_rates)\\ndescribe(mags)\";\n", - " var nbb_formatted_code = \"mags = np.log10(inverse_rates)\\ndescribe(mags)\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" } ], "source": [ - "mags = np.log10(inverse_rates)\n", "describe(mags)" ] }, @@ -1313,43 +646,15 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "From the `empiricaldist` library, we'll use the `Surv` object, which represents survival functions, but we'll use a variation of the survival function which is the probability that a randomly-chosen value is greater than or equal to a given quantity.\n", + "To represent the distribution of these magnitudes, we'll use a `Surv` object, which represents survival functions, but we'll use a variation of the survival function which is the probability that a randomly-chosen value is greater than or equal to a given quantity.\n", "The following function computes this version of a survival function, which is called a tail probability." ] }, { "cell_type": "code", - "execution_count": 74, + "execution_count": 25, "metadata": {}, - "outputs": [ - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 74;\n", - " var nbb_unformatted_code = \"from empiricaldist import Surv\\n\\n\\ndef make_surv(seq):\\n \\\"\\\"\\\"Make a non-standard survival function, P(X>=x)\\\"\\\"\\\"\\n pmf = Pmf.from_seq(seq)\\n surv = pmf.make_surv() + pmf\\n\\n # correct for numerical error\\n surv.iloc[0] = 1\\n return Surv(surv)\";\n", - " var nbb_formatted_code = \"from empiricaldist import Surv\\n\\n\\ndef make_surv(seq):\\n \\\"\\\"\\\"Make a non-standard survival function, P(X>=x)\\\"\\\"\\\"\\n pmf = Pmf.from_seq(seq)\\n surv = pmf.make_surv() + pmf\\n\\n # correct for numerical error\\n surv.iloc[0] = 1\\n return Surv(surv)\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "from empiricaldist import Surv\n", "\n", @@ -1373,92 +678,11 @@ }, { "cell_type": "code", - "execution_count": 75, + "execution_count": 26, "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
probs
8.9617090.000020
8.9926840.000013
9.1333760.000007
\n", - "
" - ], - "text/plain": [ - "8.961709 0.000020\n", - "8.992684 0.000013\n", - "9.133376 0.000007\n", - "Name: , dtype: float64" - ] - }, - "execution_count": 75, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 75;\n", - " var nbb_unformatted_code = \"surv = make_surv(mags)\\nsurv.tail()\";\n", - " var nbb_formatted_code = \"surv = make_surv(mags)\\nsurv.tail()\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ - "surv = make_surv(mags)\n", - "surv.tail()" + "surv = make_surv(mags)" ] }, { @@ -1470,7 +694,7 @@ }, { "cell_type": "code", - "execution_count": 76, + "execution_count": 27, "metadata": {}, "outputs": [ { @@ -1482,33 +706,6 @@ }, "metadata": {}, "output_type": "display_data" - }, - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 76;\n", - " var nbb_unformatted_code = \"surv.plot(marker=\\\".\\\", ms=1, lw=0.2, label=\\\"data\\\")\\ndecorate(xlabel=\\\"Inverse rate (log10 words per appearance)\\\", ylabel=\\\"Tail probability\\\")\";\n", - " var nbb_formatted_code = \"surv.plot(marker=\\\".\\\", ms=1, lw=0.2, label=\\\"data\\\")\\ndecorate(xlabel=\\\"Inverse rate (log10 words per appearance)\\\", ylabel=\\\"Tail probability\\\")\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" } ], "source": [ @@ -1527,7 +724,7 @@ }, { "cell_type": "code", - "execution_count": 77, + "execution_count": 28, "metadata": {}, "outputs": [ { @@ -1539,33 +736,6 @@ }, "metadata": {}, "output_type": "display_data" - }, - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 77;\n", - " var nbb_unformatted_code = \"surv.plot(marker=\\\".\\\", ms=1, lw=0.2, label=\\\"data\\\")\\ndecorate(xlabel=\\\"Inverse rate (words per appearance)\\\", yscale=\\\"log\\\")\";\n", - " var nbb_formatted_code = \"surv.plot(marker=\\\".\\\", ms=1, lw=0.2, label=\\\"data\\\")\\ndecorate(xlabel=\\\"Inverse rate (words per appearance)\\\", yscale=\\\"log\\\")\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" } ], "source": [ @@ -1590,44 +760,16 @@ "\n", "To estimate the frequency of rare words, we will need to model the tail behavior of this distribution and extrapolate it beyond the data.\n", "So let's fit a $t$ distribution and see how it looks.\n", - "I'll use code from Chapter 8 of *Probably Overthinking It*, which is all about these long-tailed distributions.\n", + "I'll use code from [Chapter 8 of *Probably Overthinking It*](https://allendowney.github.io/ProbablyOverthinkingIt/longtail.html), which is all about these long-tailed distributions.\n", "\n", "The following function makes a `Surv` object that represents a $t$ distribution with the given parameters." ] }, { "cell_type": "code", - "execution_count": 78, + "execution_count": 29, "metadata": {}, - "outputs": [ - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 78;\n", - " var nbb_unformatted_code = \"from scipy.stats import t as t_dist\\n\\n\\ndef truncated_t_sf(qs, df, mu, sigma):\\n ps = t_dist.sf(qs, df, mu, sigma)\\n surv_model = Surv(ps / ps[0], qs)\\n return surv_model\";\n", - " var nbb_formatted_code = \"from scipy.stats import t as t_dist\\n\\n\\ndef truncated_t_sf(qs, df, mu, sigma):\\n ps = t_dist.sf(qs, df, mu, sigma)\\n surv_model = Surv(ps / ps[0], qs)\\n return surv_model\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "from scipy.stats import t as t_dist\n", "\n", @@ -1647,37 +789,9 @@ }, { "cell_type": "code", - "execution_count": 79, + "execution_count": 30, "metadata": {}, - "outputs": [ - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 79;\n", - " var nbb_unformatted_code = \"from scipy.optimize import least_squares\\n\\n\\ndef fit_truncated_t(df, surv):\\n \\\"\\\"\\\"Given df, find the best values of mu and sigma.\\\"\\\"\\\"\\n low, high = surv.qs.min(), surv.qs.max()\\n qs_model = np.linspace(low, high, 1000)\\n ps = np.linspace(0.01, 0.8, 20)\\n qs = surv.inverse(ps)\\n\\n def error_func_t(params, df, surv):\\n mu, sigma = params\\n surv_model = truncated_t_sf(qs_model, df, mu, sigma)\\n\\n error = surv(qs) - surv_model(qs)\\n return error\\n\\n pmf = surv.make_pmf()\\n pmf.normalize()\\n params = pmf.mean(), pmf.std()\\n res = least_squares(error_func_t, x0=params, args=(df, surv), xtol=1e-3)\\n assert res.success\\n return res.x\";\n", - " var nbb_formatted_code = \"from scipy.optimize import least_squares\\n\\n\\ndef fit_truncated_t(df, surv):\\n \\\"\\\"\\\"Given df, find the best values of mu and sigma.\\\"\\\"\\\"\\n low, high = surv.qs.min(), surv.qs.max()\\n qs_model = np.linspace(low, high, 1000)\\n ps = np.linspace(0.01, 0.8, 20)\\n qs = surv.inverse(ps)\\n\\n def error_func_t(params, df, surv):\\n mu, sigma = params\\n surv_model = truncated_t_sf(qs_model, df, mu, sigma)\\n\\n error = surv(qs) - surv_model(qs)\\n return error\\n\\n pmf = surv.make_pmf()\\n pmf.normalize()\\n params = pmf.mean(), pmf.std()\\n res = least_squares(error_func_t, x0=params, args=(df, surv), xtol=1e-3)\\n assert res.success\\n return res.x\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "from scipy.optimize import least_squares\n", "\n", @@ -1686,7 +800,7 @@ " \"\"\"Given df, find the best values of mu and sigma.\"\"\"\n", " low, high = surv.qs.min(), surv.qs.max()\n", " qs_model = np.linspace(low, high, 1000)\n", - " ps = np.linspace(0.01, 0.8, 20)\n", + " ps = np.linspace(0.1, 0.8, 20)\n", " qs = surv.inverse(ps)\n", "\n", " def error_func_t(params, df, surv):\n", @@ -1713,37 +827,9 @@ }, { "cell_type": "code", - "execution_count": 80, + "execution_count": 31, "metadata": {}, - "outputs": [ - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 80;\n", - " var nbb_unformatted_code = \"from scipy.optimize import minimize\\n\\n\\ndef minimize_df(df0, surv, bounds=[(1, 1e3)], ps=None):\\n low, high = surv.qs.min(), surv.qs.max()\\n qs_model = np.linspace(low, high * 1.2, 2000)\\n\\n if ps is None:\\n t = surv.ps[0], surv.ps[-5]\\n low, high = np.log10(t)\\n ps = np.logspace(low, high, 30, endpoint=False)\\n\\n qs = surv.inverse(ps)\\n\\n def error_func_tail(params):\\n (df,) = params\\n print(df)\\n mu, sigma = fit_truncated_t(df, surv)\\n surv_model = truncated_t_sf(qs_model, df, mu, sigma)\\n\\n errors = np.log10(surv(qs)) - np.log10(surv_model(qs))\\n return np.sum(errors**2)\\n\\n params = (df0,)\\n res = minimize(error_func_tail, x0=params, bounds=bounds, tol=1e-3, method=\\\"Powell\\\")\\n assert res.success\\n return res.x\";\n", - " var nbb_formatted_code = \"from scipy.optimize import minimize\\n\\n\\ndef minimize_df(df0, surv, bounds=[(1, 1e3)], ps=None):\\n low, high = surv.qs.min(), surv.qs.max()\\n qs_model = np.linspace(low, high * 1.2, 2000)\\n\\n if ps is None:\\n t = surv.ps[0], surv.ps[-5]\\n low, high = np.log10(t)\\n ps = np.logspace(low, high, 30, endpoint=False)\\n\\n qs = surv.inverse(ps)\\n\\n def error_func_tail(params):\\n (df,) = params\\n print(df)\\n mu, sigma = fit_truncated_t(df, surv)\\n surv_model = truncated_t_sf(qs_model, df, mu, sigma)\\n\\n errors = np.log10(surv(qs)) - np.log10(surv_model(qs))\\n return np.sum(errors**2)\\n\\n params = (df0,)\\n res = minimize(error_func_tail, x0=params, bounds=bounds, tol=1e-3, method=\\\"Powell\\\")\\n assert res.success\\n return res.x\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "from scipy.optimize import minimize\n", "\n", @@ -1776,7 +862,7 @@ }, { "cell_type": "code", - "execution_count": 81, + "execution_count": 32, "metadata": {}, "outputs": [ { @@ -1793,70 +879,43 @@ "35.407411894884405\n", "22.26495001595599\n", "14.142461878928419\n", - "25.283073250551638\n", - "19.7236020807297\n", - "22.167510053888318\n", - "21.970382380734854\n", - "21.93000366412791\n", - "21.932973281119267\n", - "21.93330665994396\n", - "21.933640038763702\n", - "18.866613319887918\n", + "25.697947062229677\n", + "20.707337220921392\n", + "22.572143210805486\n", + "22.390897998699558\n", + "22.387674578312218\n", + "22.38696883830132\n", + "22.386635466210475\n", + "22.38630209411468\n", + "19.77327093242095\n", "382.58404523885497\n", "618.4159547611448\n", - "236.83190952228986\n", - "146.75213571656514\n", - "91.07977380572476\n", - "56.67236191084038\n", - "35.407411894884405\n", - "22.264950015955982\n", - "14.142461878928419\n", - "25.283073250516892\n", - "19.723601830671786\n", - "22.167510080319087\n", - "21.97038323398375\n", - "21.93000166466634\n", - "21.932975865105618\n", - "21.933309198443858\n", - "21.932642531767378\n" + "236.8319095222899\n", + "146.75213571656516\n", + "91.0797738057248\n", + "56.672361910840394\n", + "35.40741189488441\n", + "22.264950015955993\n", + "14.142461878928424\n", + "25.697947054112568\n", + "20.7073371943388\n", + "22.572143293741753\n", + "22.390897870539447\n", + "22.38767398796015\n", + "22.386983906149368\n", + "22.386650572810865\n", + "22.38731723948787\n" ] }, { "data": { "text/plain": [ - "array([21.93297587])" + "array([22.38698391])" ] }, - "execution_count": 81, + "execution_count": 32, "metadata": {}, "output_type": "execute_result" - }, - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 81;\n", - " var nbb_unformatted_code = \"df = minimize_df(25, surv)\\ndf\";\n", - " var nbb_formatted_code = \"df = minimize_df(25, surv)\\ndf\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" } ], "source": [ @@ -1866,45 +925,18 @@ }, { "cell_type": "code", - "execution_count": 82, + "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "(array([21.93297587]), 6.430927981834091, 0.48910465307252876)" + "(array([22.38698391]), 6.430702047528606, 0.490849531484811)" ] }, - "execution_count": 82, + "execution_count": 33, "metadata": {}, "output_type": "execute_result" - }, - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 82;\n", - " var nbb_unformatted_code = \"mu, sigma = fit_truncated_t(df, surv)\\ndf, mu, sigma\";\n", - " var nbb_formatted_code = \"mu, sigma = fit_truncated_t(df, surv)\\ndf, mu, sigma\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" } ], "source": [ @@ -1921,84 +953,29 @@ }, { "cell_type": "code", - "execution_count": 83, + "execution_count": 34, "metadata": {}, - "outputs": [ - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 83;\n", - " var nbb_unformatted_code = \"low, high = surv.qs.min(), surv.qs.max()\\nqs = np.linspace(low, 11, 2000)\\nsurv_model = truncated_t_sf(qs, df, mu, sigma)\";\n", - " var nbb_formatted_code = \"low, high = surv.qs.min(), surv.qs.max()\\nqs = np.linspace(low, 11, 2000)\\nsurv_model = truncated_t_sf(qs, df, mu, sigma)\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "low, high = surv.qs.min(), surv.qs.max()\n", - "qs = np.linspace(low, 11, 2000)\n", + "qs = np.linspace(low, 10, 2000)\n", "surv_model = truncated_t_sf(qs, df, mu, sigma)" ] }, { "cell_type": "code", - "execution_count": 84, + "execution_count": 35, "metadata": {}, "outputs": [ { "data": { - "image/png": "", + "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" - }, - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 84;\n", - " var nbb_unformatted_code = \"surv_model.plot(color=\\\"gray\\\", alpha=0.4, label=\\\"model\\\")\\nsurv.plot(marker=\\\".\\\", ms=1, lw=0.2, label=\\\"data\\\")\\ndecorate(xlabel=\\\"Inverse rate (log10 words per appearance)\\\", ylabel=\\\"Tail probability\\\")\";\n", - " var nbb_formatted_code = \"surv_model.plot(color=\\\"gray\\\", alpha=0.4, label=\\\"model\\\")\\nsurv.plot(marker=\\\".\\\", ms=1, lw=0.2, label=\\\"data\\\")\\ndecorate(xlabel=\\\"Inverse rate (log10 words per appearance)\\\", ylabel=\\\"Tail probability\\\")\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" } ], "source": [ @@ -2011,52 +988,25 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "With the y-axis on a linear scale, we can see that the model fits the data reasonably well, except for a range in the middle of the distribution -- the words that are not common or rare.\n", + "With the y-axis on a linear scale, we can see that the model fits the data reasonably well, except for a range between 5 and 6 -- that is for words that appear about 1 time in a million.\n", "\n", - "And here's what the model looks like on a log-y scale." + "Here's what the model looks like on a log-y scale." ] }, { "cell_type": "code", - "execution_count": 85, + "execution_count": 36, "metadata": {}, "outputs": [ { "data": { - "image/png": "", + "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" - }, - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 85;\n", - " var nbb_unformatted_code = \"surv_model.plot(color=\\\"gray\\\", alpha=0.4, label=\\\"model\\\")\\nsurv.plot(marker=\\\".\\\", ms=1, lw=0.2, label=\\\"data\\\")\\ndecorate(\\n xlabel=\\\"Inverse rate (log10 words per appearance)\\\",\\n ylabel=\\\"Tail probability\\\",\\n yscale=\\\"log\\\",\\n)\";\n", - " var nbb_formatted_code = \"surv_model.plot(color=\\\"gray\\\", alpha=0.4, label=\\\"model\\\")\\nsurv.plot(marker=\\\".\\\", ms=1, lw=0.2, label=\\\"data\\\")\\ndecorate(\\n xlabel=\\\"Inverse rate (log10 words per appearance)\\\",\\n ylabel=\\\"Tail probability\\\",\\n yscale=\\\"log\\\",\\n)\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" } ], "source": [ @@ -2090,45 +1040,18 @@ }, { "cell_type": "code", - "execution_count": 86, + "execution_count": 37, "metadata": {}, "outputs": [ { "data": { - "image/png": "", + "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" - }, - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 86;\n", - " var nbb_unformatted_code = \"prior = surv_model.make_pmf()\\nprior.plot(label=\\\"prior\\\")\\ndecorate(\\n xlabel=\\\"Inverse rate (log10 words per appearance)\\\",\\n ylabel=\\\"Density\\\",\\n)\";\n", - " var nbb_formatted_code = \"prior = surv_model.make_pmf()\\nprior.plot(label=\\\"prior\\\")\\ndecorate(\\n xlabel=\\\"Inverse rate (log10 words per appearance)\\\",\\n ylabel=\\\"Density\\\",\\n)\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" } ], "source": [ @@ -2149,37 +1072,9 @@ }, { "cell_type": "code", - "execution_count": 87, + "execution_count": 38, "metadata": {}, - "outputs": [ - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 87;\n", - " var nbb_unformatted_code = \"ps = 1 / np.power(10, prior.qs)\";\n", - " var nbb_formatted_code = \"ps = 1 / np.power(10, prior.qs)\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "ps = 1 / np.power(10, prior.qs)" ] @@ -2188,13 +1083,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now suppose that in a given day, you read or hear 10,000 words in a context where you would notice if you heard a new word for the first time.\n", + "Now suppose that in a given day, you read or hear 10,000 words in a context where you would notice if you heard a word for the first time.\n", "Here's the number of words you would hear in 50 years." ] }, { "cell_type": "code", - "execution_count": 88, + "execution_count": 39, "metadata": {}, "outputs": [ { @@ -2203,36 +1098,9 @@ "182500000" ] }, - "execution_count": 88, + "execution_count": 39, "metadata": {}, "output_type": "execute_result" - }, - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 88;\n", - " var nbb_unformatted_code = \"words_per_day = 10_000\\ndays = 50 * 365\\nk = days * words_per_day\\nk\";\n", - " var nbb_formatted_code = \"words_per_day = 10_000\\ndays = 50 * 365\\nk = days * words_per_day\\nk\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" } ], "source": [ @@ -2246,48 +1114,19 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now, what's the probability that you fail to hear a word in `k` attempts and then hear it on the next attempt?\n", - "We can answer that with the negative binomial distribution, which computes the probability of getting the `n`th success after `k` failures." + "Now, what's the probability that you fail to encounter a word in `k` attempts and then encounter it on the next attempt?\n", + "We can answer that with the negative binomial distribution, which computes the probability of getting the `n`th success after `k` failures, for a given probability -- or in this case, for a sequence of possible probabilities." ] }, { "cell_type": "code", - "execution_count": 89, + "execution_count": 40, "metadata": {}, - "outputs": [ - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 89;\n", - " var nbb_unformatted_code = \"from scipy.stats import nbinom\\n\\nn = 1\\n\\nlikelihood = nbinom.pmf(k, n, ps)\";\n", - " var nbb_formatted_code = \"from scipy.stats import nbinom\\n\\nn = 1\\n\\nlikelihood = nbinom.pmf(k, n, ps)\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "from scipy.stats import nbinom\n", "\n", "n = 1\n", - "\n", "likelihood = nbinom.pmf(k, n, ps)" ] }, @@ -2300,45 +1139,18 @@ }, { "cell_type": "code", - "execution_count": 90, + "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "1.3468326547799861e-11" + "1.3581072166387401e-11" ] }, - "execution_count": 90, + "execution_count": 41, "metadata": {}, "output_type": "execute_result" - }, - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 90;\n", - " var nbb_unformatted_code = \"posterior = prior * likelihood\\nposterior.normalize()\";\n", - " var nbb_formatted_code = \"posterior = prior * likelihood\\nposterior.normalize()\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" } ], "source": [ @@ -2355,45 +1167,18 @@ }, { "cell_type": "code", - "execution_count": 91, + "execution_count": 42, "metadata": {}, "outputs": [ { "data": { - "image/png": "", + "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" - }, - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 91;\n", - " var nbb_unformatted_code = \"prior.plot(alpha=0.5, label=\\\"prior\\\")\\nposterior.plot(label=\\\"posterior\\\")\\ndecorate(\\n xlabel=\\\"Inverse rate (log10 words per appearance)\\\",\\n ylabel=\\\"Density\\\",\\n)\";\n", - " var nbb_formatted_code = \"prior.plot(alpha=0.5, label=\\\"prior\\\")\\nposterior.plot(label=\\\"posterior\\\")\\ndecorate(\\n xlabel=\\\"Inverse rate (log10 words per appearance)\\\",\\n ylabel=\\\"Density\\\",\\n)\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" } ], "source": [ @@ -2412,45 +1197,17 @@ "If you go 50 years without hearing a word, that suggests that it is a rare word, and the posterior distribution reflects that logic.\n", "\n", "The posterior distribution represents a range of possible values for the inverse rate of the word you heard.\n", - "Now we can use it to answer the question we started with: what is the probability of hearing the same word again on the same day -- that is, within the next 10,000 words you hear.\n", + "Now we can use it to answer the question we started with: what is the probability of hearing the same word again on the same day -- that is, within the next 10,000 words you hear?\n", "\n", - "To answer that, we can use the survival function of the binomial distribution to compute the probability of more than 0 successes in the next `n_pred` attempts.\n", + "To answer that, we can use the survival function of the [binomial distribution](https://allendowney.github.io/ThinkBayes2/chap18.html?highlight=binomial#binomial-likelihood) to compute the probability of more than 0 successes in the next `n_pred` attempts.\n", "We'll compute this probability for each of the `ps` that correspond to the inverse rates in the posterior." ] }, { "cell_type": "code", - "execution_count": 92, + "execution_count": 43, "metadata": {}, - "outputs": [ - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 92;\n", - " var nbb_unformatted_code = \"from scipy.stats import binom\\n\\nn_pred = words_per_day\\nps_pred = binom.sf(0, n_pred, ps)\";\n", - " var nbb_formatted_code = \"from scipy.stats import binom\\n\\nn_pred = words_per_day\\nps_pred = binom.sf(0, n_pred, ps)\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "from scipy.stats import binom\n", "\n", @@ -2462,50 +1219,23 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "And we can use the probabilities in the posterior to compute the expected value." + "And we can use the probabilities in the posterior to compute the expected value -- by the law of total probability, the result is the probability of hearing the same word again within a day." ] }, { "cell_type": "code", - "execution_count": 93, + "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "(0.00016010294670308533, 6245.981230155142)" + "(0.00016009921991573168, 6246.1266240169725)" ] }, - "execution_count": 93, + "execution_count": 44, "metadata": {}, "output_type": "execute_result" - }, - { - "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 93;\n", - " var nbb_unformatted_code = \"p = np.sum(posterior * ps_pred)\\np, 1 / p\";\n", - " var nbb_formatted_code = \"p = np.sum(posterior * ps_pred)\\np, 1 / p\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" } ], "source": [ @@ -2517,9 +1247,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The chance of hearing the same word again within a day is about 1 in 6000.\n", - "With all of the assumptions we made in this calculation, there's no reason to be more precise than that.\n", + "The result is about 1 in 6000.\n", "\n", + "With all of the assumptions we made in this calculation, there's no reason to be more precise than that.\n", "And as I mentioned at the beginning, we should probably not take this conclusion to seriously.\n", "If you hear a word for the first time after 50 years, there's a good chance the word is \"having a moment\", which greatly increases the chance you'll hear it again.\n", "I can't think of why chartism might be in the news at the moment, but maybe this post will go viral and make it happen." @@ -2543,6 +1273,7 @@ } ], "metadata": { + "celltoolbar": "Tags", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python",