OpenShitsuChan

OpenSource version of Shitsu-chan

How to start

Create a database
Configure project.config.json
Load a backup file from /assets/dump into your Postgres database
Run npm start
.. and you are done !

Introduction

The user thinks of a character without telling the computer who, the computer then makes guesses by asking questions efficiently and if the character is not in the knowledge base, the user can contribute by submitting the character they had in mind.

How?

First thing we know is that, initially, each character have an equal chance of being chosen.

$P (H) = \frac{1}{T} (1)$

$P (\neg H) = 1 - \frac{1}{T} (2)$

Second thing we would like to know is how do we calculate the probability that a hypothesis H is true given the available evidences?

A simple way to implement this idea mathematically is to use the Bayes’ theorem. In the general case, if H and E are two dependent events then,

$P (H \land E) = P (H) P (E | H) = P (E) P (H | E)$

$P (H | E) = \frac{P (H) P (E | H)}{P (E)} (3)$

Read as “probability of hypothesis H given the evidence E”.

We know that for any event X,

$P (X) + P (\neg X) = 1 (4)$

Which is further generalized with the law of total probability

$P (X) P (Y | X) + P (\neg X) P (Y | \neg X) = P (Y)$

We can then write the Bayes’ theorem in a much more compact way by replacing the denominator:

$P (H | E) = \frac{P (H) P (E | H)}{P (H) P (E | H) + P (\neg H) P (E | \neg H)} (5)$

For a more general case in which we provide more than a single evidence we have,

$P (H | E_{1} \land E_{2} \land . . \land E_{N}) = \frac{P (H) \prod_{k} P (E_{k} | H)}{P (H) \prod_{k} P (E_{k} | H) + P (\neg H) \prod_{k} P (E_{k} | \neg H)} (6)$

Simply because for any event X, Y and Z: $P (X \land Y | Z) = P (X | Z) P (Y | Z)$

Modeling

Computing $P (E_{k} | H)$

Since we need to implement the idea that when the user’s response matches ours then $P (E_{k} | H) = 1 - ϵ (ϵ \geq 0)$ , we can model this by defining a normalized distance function $d (x, y)$ .

$P (E_{k} | H) = 1 - d (U s e r A n s [k], O u r A n s [k])$ $P (E_{k} | H) = 1 - d (A_{k}, Q_{k}^{(H)}) (7)$

Computing $P (E_{k} | \neg H)$

Same thing as above but the main difference is that we have to take into account all other possibilities.

$P (E_{k} | \neg H) = 1 - a v g D i s t (O u r A n s G i v e n N o t_{H} [k], U s e r A n s [k])$

$P (E_{k} | \neg H) = 1 - \frac{1}{C a r d (C ∖ H)} \sum_{X \in C ∖ H} d (A_{k}, Q_{k}^{(X)}) (8)$

Finally..

From $(1)$ , $(2)$ , $(6)$ , $(7)$ and $(8)$ we have,

$P (H | E_{1} \land E_{2} \land . . \land E_{N}) = \frac{\frac{1}{T} \prod_{k} [1 - d (A_{k}, Q_{k}^{(H)})]}{\frac{1}{T} \prod_{k} [1 - d (A_{k}, Q_{k}^{(H)})] + (1 - \frac{1}{T}) \prod_{k} [1 - D^{(\neg H)} (k))]} (9)$

Read as "Probability of the hypothesis H given the evidences $E = E_{1}, E_{2}, . ., E_{N}$ ".

H: hypothesis $\equiv$ character
$(E_{i})_{(1 ⩽ i ⩽ N)}$ : obvervable evidence, defined implicitly by the user′s response
$Q_{k}^{(H)}$ : correct answer given H, a real number between 0 and 1.
$d (x, y) := | x - y |$ (normalized distance for all $x, y \in [0; 1]$ )
$D^{(\neg H)} (k) = \frac{1}{C a r d (C ∖ H)} \sum_{X \in C ∖ H} d (A_{k}, Q_{k}^{(X)})$
$C$ : set of all characters

Finding the next optimal question

For our "AI" to look more "AI-ish", we have to solve this optimization problem:

"Maximize the hypothesis’s probability the quickiest possible." i.e. "Ask the fewest number of question as possible"

First of all, how are we going to quantify information? We have Information Theory for backing us up!

$p r o b a b i l i t y = 1 / 2^{B i t s O f I n f o r m a t i o n}$ $p = \frac{1}{2^{I}}$

For example if $p = \frac{1}{2}$ then it is worth 1 bit of information (I = 1): 50% one thing and 50% something else which is analogous to the fact that a random bit has 50% chance to be either 0 or 1.

The lower the probability the bigger the information value.

Here is an interesting example, almost 70% of the population has black hair whereas only 2% of the population has red hair. If we were to ask the fewest possible questions on how a person looks like then asking if the person in question has red hair will bring us much more information because if so then congratulations! We have just reduced our search space to 2% of the population.

$0.7 = \frac{1}{2^{(I = 0.51)}}$

$0.02 = \frac{1}{2^{(I = 5.64)}}$

In the second case, we have reduced our search space to almost 1/50 of the initial size of the population.

Let $I (Q)$ be the information worth of the question $Q$ in our database.

$P (Q) = \frac{1}{2^{I (Q)}} \to I (Q) = - l o g_{2} P (Q)$

$P (Q)$ is a quantity that tells us how likely a character in our knowledge base corresponds. Let’s try to approximate this value with the few things we have at hand. Let $P o s s (Q)$ be the set that contains all characters corresponding to $P (Q)$ .

$P o s s (Q) = {X \in C | p r o b (Q^{(X)}) > 1 / 2} (Q \in S e t Q)$

Which can be read as “Any character $X$ with a probability greater than $0.5$ associated with question $Q$ (a Yes)”

In order to approximate $P (Q)$ we need to be a little bit more cautious. We are looking for is a quantity that embeds P(Q) and also verifies:

$\sum_{Q \in S e t Q} P (Q) = 1$

In our model, we shall use:

$P (Q) := \frac{C a r d (P o s s (Q))}{\sum_{X \in S e t Q} C a r d (P o s s (X))}$

This quantity computes a proportion hence it is the perfect candidate.

First, let’s only consider the questions that correspond to at least 1 character.

$S e t Q_{\geq 1} = {Q \in S e t Q | P o s s (Q) \neq \emptyset}$

$\forall Q \in S e t Q_{\geq 1} I (Q) = - l o g_{2} P (Q) = - l o g_{2} (\frac{C a r d (P o s s (Q))}{\sum_{X \in S e t Q_{\geq 1}} C a r d (P o s s (X))})$

If $P o s s (Q)$ is an empty set (a question that does not correspond to any character according to our definition) then there is no point in asking such a question in practice because it informs nothing useful to our set of characters.

Name	Name	Last commit message	Last commit date
Latest commit michael-0acf4 update explanation to total prob law Dec 7, 2023 f0945c3 · Dec 7, 2023 History 41 Commits
assets	assets	new dump on postgres 16.1	Nov 30, 2023
src	src	fix indent	Dec 1, 2023
.gitignore	.gitignore	init	Feb 21, 2022
LICENSE	LICENSE	init	Feb 21, 2022
README.md	README.md	update explanation to total prob law	Dec 7, 2023
package-lock.json	package-lock.json	remove unused and bump deps	Nov 26, 2023
package.json	package.json	move sources to src	Nov 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenShitsuChan

How to start

Introduction

How?

Modeling

Computing $P (E_{k} | H)$

Computing $P (E_{k} | \neg H)$

Finally..

Finding the next optimal question

About

Releases

Packages

Languages

License

michael-0acf4/OpenShitsuChan

Folders and files

Latest commit

History

Repository files navigation

OpenShitsuChan

How to start

Introduction

How?

Modeling

Computing P ( E k | H )

Computing P ( E k | ¬ H )

Finally..

Finding the next optimal question

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Computing $P (E_{k} | H)$

Computing $P (E_{k} | \neg H)$

Packages