frangente
diff --git a/‎report/src/figures/mcts.png
60.9 KB b/‎report/src/figures/mcts.png
60.9 KB
diff --git a/‎report/src/sections/figures/path-bottleneck.png renamed to ‎report/src/figures/path-bottleneck.png b/‎report/src/sections/figures/path-bottleneck.png renamed to ‎report/src/figures/path-bottleneck.png
diff --git a/‎report/src/main.tex
Lines changed: 3 additions & 2 deletions b/‎report/src/main.tex
Lines changed: 3 additions & 2 deletions
diff --git a/‎report/src/sections/background.tex
Lines changed: 34 additions & 0 deletions b/‎report/src/sections/background.tex
Lines changed: 34 additions & 0 deletions
diff --git a/‎report/src/sections/method.tex
Lines changed: 16 additions & 5 deletions b/‎report/src/sections/method.tex
Lines changed: 16 additions & 5 deletions
@@ -11,9 +11,9 @@
 \addbibresource{references.bib}
 
 \title{
-\rule{\linewidth}{1pt} \\[6pt] 
+\rule{\linewidth}{0.5pt} \\[6pt] 
 \huge Autonomous Software Agents  \\ Project Report \\
-\rule{\linewidth}{1pt}  \\[10pt]
+\rule{\linewidth}{2pt}  \\[10pt]
 }
 \author{
 \begin{tabular}{c}
@@ -27,6 +27,7 @@
 \maketitle
 
 \input{sections/introduction.tex}
+\input{sections/background.tex}
 \input{sections/method.tex}
 
 \printbibliography
 
@@ -0,0 +1,34 @@
+\section{Background}
+
+Before we delve into the details of our work, we will provide a brief overview of some of the key concepts that are relevant to our work.
+
+\subsection{Monte Carlo Tree Search}
+
+Monte Carlo Tree Search \parencite{mcts} is a heuristic search algorithm applied to decision processes, that has been successfully used in a variety of games, such as Go and Chess. The algorithm is based on the idea of performing a guided random search that focuses on exploring the more promising moves by expanding the search tree through randomly sampling the search space instead of a brute force approach that completely visits the search space. More in details, the algorithm builds the game tree by iteratively repeating the following four steps (see Figure \ref{fig:mcts} for a visual representation):
+
+\begin{enumerate}
+    \item \textbf{Selection}: starting from the root $R$, which represent the current game state, traverse the tree until a leaf $L$ is reached. A key aspect for the proper functioning of the algorithm is being able to balance the \textit{exploitation} of already promising paths and the \textit{exploration} of children with fewer simulations. The most used formula to compute such trade-off is the UCT (Upper Confidence Bound 1 applied to Trees)
+          \begin{align*}
+              UCT & = \underbrace{\frac {w_{i}}{n_{i}}}_{\text{exploitation term}}+c\underbrace{{\sqrt {\frac {\ln N_{i}}{n_{i}}}}}_{\text{exploration term}}
+          \end{align*}
+          where
+          \begin{itemize}
+              \item $w_i$ is the number of wins obtained in the subtree of $i$
+              \item $n_i$ is the number of times node $i$ has been visited
+              \item $N_i$ is the number of time the parent of node $i$ has been visited
+              \item $c$ is the exploration parameter, usually equal to $\sqrt{2}$
+          \end{itemize}
+
+    \item \textbf{Expansion}: if $L$ isn't a terminal node (i.e. there are valid moves that can be performed starting from the game state in $L$ ) pick a node $C$ among its children that haven't been yet expanded
+
+    \item \textbf{Simulation}: starting from $C$, randomly choose valid moves until a terminal state is reached and the game is decided (i.e. win/loss/draw)
+
+    \item \textbf{Backpropagation}: the result of the simulation is used to update the statistics (number of wins and number of visits) for all nodes along the path from $C$ to $R$ that are then used to compute the UCTs for the following iterations
+\end{enumerate}
+
+\begin{figure}
+    \centering
+    \includegraphics[width=\linewidth]{figures/mcts.png}
+    \caption{MCTS example for a two player game such as chess.}
+    \label{fig:mcts}
+\end{figure}
@@ -28,6 +28,7 @@ \section{Method}
 
 Before delving into the details of the implementation, some clarifications need to be made. While for the sake of simplicity we have presented the agent conrol loop as a sequential process, in practice most of the operations are performed concurrently. Indeed, this is a paramount requirement in a highly dynamic environment such as the one of the Deliveroo game, where changings in the environment can happen even while the agent is reasoning, planning and acting. Thus, all perceptions, communications and updates are performed asynchronously and in an event-driven fashion such that updates to the belief base and the desires are performed as soon as new information is available.
 
+
 \subsection{Team Communication}
 
 As stated in the introduction, not all agents in the environment are adversarial. In fact, agents may form teams and cooperate with each other to better achieve their goals. To allow the formation of teams and the exchange of information among the agents (necessary for an effective and productive cooperation), a communication protocol has been implemented.
@@ -44,6 +45,7 @@ \subsection{Team Communication}
 
 As for the observations, each time an agent updates its desires, it relays the new desires to the other agents in the team. In this way, the agents in the team can coordinate their actions and not interfere with each other (e.g. by not picking up the same parcels) to better achieve their goals. Finally, each time the position of an agent changes, it relays its new position to the other agents in the team.
 
+
 \subsection{Belief base}
 \label{sec:belief-base}
 
@@ -66,16 +68,25 @@ \subsection{Belief base}
 To assess whether an agent is randomly moving, a simple heuristic has been defined. If from the first time the agent was observed to the last time the agent was observed its observed score is below a certain threshold, the agent is considered a random agent. The threshold is defined as the expected reward a greedy agent would have obtained in the same time span. This is computed in the following way:
 
 \begin{equation*}
-    \mathbb{E}_{\text{greedy}} = \frac{\mathbb{E}_{\texttt{distance}}}{\mu_{\texttt{distance}}} \cdot \frac{\mu_{\texttt{reward}}}{\texttt{numSmartAgents}}
+    \mathbb{E}_{\text{greedy}}[r] = \frac{\mathbb{E}[dist]}{\mu_{\texttt{dist}}} \cdot \frac{\mu_{\texttt{reward}}}{n_{\texttt{smart}}}
 \end{equation*}
 
-where $\mathbb{E}_{\texttt{distance}}$ is the expected distance covered by the agent in the time span, $\mu_{\texttt{distance}}$ is the average distance between parcels in the map, $\mu_{\texttt{reward}}$ is the average reward of the parcels, and \texttt{numSmartAgents} is the number of agents in the environment that are not random agents.
+where $\mathbb{E}[dist]$ is the expected distance covered by the agent in the time span, $\mu_{\texttt{dist}}$ is the average distance between parcels in the map, $\mu_{\texttt{reward}}$ is the average reward of the parcels, and $n_{\texttt{smart}}$ is the number of agents in the environment that are not random agents.
 
 Finally, note that here we do not make any assumption about the possible intentions of other agents. Indeed, modelling the behaviour of other agents is a complex and challenging task that may require learning-based approaches. Therefore, we preferred not to implement any partial model of the other agents' intentions, given that wrong assumptions about the actions of the other agents can lead to suboptimal decisions.
 
-\subsection{Search}
 
-This section describes the search algorithm used to update the set of desires of the agent and to select the intention to pursue.
+\subsection{Desires}
+
+In the BDI model, the desires represent the goals that the agent wants to achieve. In the context of the Deliveroo game, at each time instant the goals of the agent are to move to a location where there are parcels or to deliver the currently held parcels (if any). Therefore, each time there is a change in the set of available parcels, the agent needs to update its desires accordingly.
+
+Since the long-term objective of the agent is to maximize its reward, not all desires are equally important and worth pursuing. Therefore, it is necessary to assign them a score that should reflect their goodness for the agent's long-term goal. Hence, we deemed not sufficient to assign them a score based on their immediate reward, as this would lead the agent to be short-sighted and not to take into account the future consequences of its actions. To avoid this, we decided to implement a forward-looking scoring function based on Monte Carlo Tree Search. By using the MCTS algorithm, the score of each desire depends not only on the immediate reward, but also on the expected future reward that the agent can obtain after pursuing the desire.
+
+\subsection{Selection}
+\label{sec:selection}
+
+The selection phase is the process by which the agent selects the desire to pursue
+
 
 \subsection{Planning}
 
@@ -89,7 +100,7 @@ \subsection{Planning}
 
 \begin{figure}
     \centering
-    \includegraphics[width=0.49\textwidth]{sections/figures/path-bottleneck.png}
+    \includegraphics[width=0.49\textwidth]{figures/path-bottleneck.png}
     \caption{An example of path bottleneck. The tiles in yellow belong to the path bottleneck from the start position to the end tile.}
     \label{fig:path-bottleneck}
 \end{figure}