From 71dc8c96642f09481281c9431b7e46c2fc646928 Mon Sep 17 00:00:00 2001
From: Patrick Liu <patrickl@xsense.ai>
Date: Sat, 15 Jun 2024 00:40:39 +0800
Subject: [PATCH] Add Lingo-2

---
 README.md             | 10 ++++++++--
 paper_notes/lingo2.md | 34 ++++++++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+), 2 deletions(-)
 create mode 100644 paper_notes/lingo2.md
diff --git a/README.md b/README.md
index 2ec10d7..c7fa726 100755
--- a/README.md
+++ b/README.md
@@ -35,6 +35,9 @@ I regularly update [my blog in Toward Data Science](https://medium.com/@patrickl
 - [Paper Reading in 2019](https://towardsdatascience.com/the-200-deep-learning-papers-i-read-in-2019-7fb7034f05f7?source=friends_link&sk=7628c5be39f876b2c05e43c13d0b48a3)
 
 ## 2024-06 (0)
+- [LINGO-1: Exploring Natural Language for Autonomous Driving](https://wayve.ai/thinking/lingo-natural-language-autonomous-driving/) [[Notes](paper_notes/lingo1.md)] [Wayve, open-loop world model]
+- [LINGO-2: Driving with Natural Language](https://wayve.ai/thinking/lingo-2-driving-with-language/) [[Notes](paper_notes/lingo2.md)] [Wayve, closed-loop world model]
+- [Enhancing End-to-End Autonomous Driving with Latent World Model](https://arxiv.org/abs/2406.08481)
 - [OccNeRF: Advancing 3D Occupancy Prediction in LiDAR-Free Environments](https://arxiv.org/abs/2312.09243) [Jiwen Lu]
 - [RenderOcc: Vision-Centric 3D Occupancy Prediction with 2D Rendering Supervision](https://arxiv.org/abs/2309.09502) <kbd>ICRA 2024</kbd>
 - [EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision](https://arxiv.org/pdf/2311.02077) [Sanja, Marco, NV]
@@ -52,11 +55,14 @@ I regularly update [my blog in Toward Data Science](https://medium.com/@patrickl
 - [Enable Faster and Smoother Spatio-temporal Trajectory Planning for Autonomous Vehicles in Constrained Dynamic Environment](https://journals.sagepub.com/doi/abs/10.1177/0954407020906627) <kbd>JAE 2020</kbd> [Joint optimization, search]
 - [Focused Trajectory Planning for Autonomous On-Road Driving](https://www.ri.cmu.edu/pub_files/2013/6/IV2013-Tianyu.pdf) <kbd>IV 2013</kbd> [Joint optimization, Iteration]
 - [SSC: Safe Trajectory Generation for Complex Urban Environments Using Spatio-Temporal Semantic Corridor](https://arxiv.org/abs/1906.09788) <kbd>RAL 2019</kbd> [Joint optimization, SSC, Wenchao Ding]
+- [MPDM: Multipolicy decision-making in dynamic, uncertain environments for autonomous driving](https://ieeexplore.ieee.org/document/7139412) <kbd>ICRA 2015</kbd>
+- [MPDM2: Multipolicy Decision-Making for Autonomous Driving via Changepoint-based Behavior Prediction](https://www.roboticsproceedings.org/rss11/p43.pdf) <kbd>RSS 2015</kbd>
+- [MPDM3: Multipolicy decision-making for autonomous driving via changepoint-based behavior prediction: Theory and experiment](https://link.springer.com/article/10.1007/s10514-017-9619-z) <kbd>RSS 2017</kbd>
 - [EUDM: Efficient Uncertainty-aware Decision-making for Automated Driving Using Guided Branching](https://arxiv.org/abs/2003.02746) <kbd>ICRA 2020</kbd> [Wenchao Ding]
-- [MPDM: Multipolicy Decision-Making for Autonomous Driving via Changepoint-based Behavior Prediction](https://www.roboticsproceedings.org/rss11/p43.pdf) <kbd>RSS 2011</kbd>
-- [AlphaGo: Mastering the game of Go with deep neural networks and tree search](https://www.nature.com/articles/nature16961) <kbd>Nature 2016</kbd> [DeepMind]
+- [AlphaGo: Mastering the game of Go with deep neural networks and tree search](https://www.nature.com/articles/nature16961) <kbd>Nature 2016</kbd> [DeepMind, MTCS]
 - [AlphaZero: A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play](https://www.science.org/doi/full/10.1126/science.aar6404) <kbd>Science 2017</kbd> [DeepMind]
 - [MuZero: Mastering Atari, Go, chess and shogi by planning with a learned model](https://www.nature.com/articles/s41586-020-03051-4) <kbd>Nature 2020</kbd> [DeepMind]
+- [Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving](https://arxiv.org/abs/1610.03295) [MobileEye, desire and traj optimization]
 
 ## 2024-03 (11)
 - [Genie: Generative Interactive Environments](https://arxiv.org/abs/2402.15391) [[Notes](paper_notes/genie.md)] [DeepMind, World Model]
diff --git a/paper_notes/lingo2.md b/paper_notes/lingo2.md
new file mode 100644
index 0000000..c78e2ef
--- /dev/null
+++ b/paper_notes/lingo2.md
@@ -0,0 +1,34 @@
+# [LINGO-2: Driving with Natural Language](https://wayve.ai/thinking/lingo-2-driving-with-language/)
+
+_June 2024_
+
+tl;dr: First closed-loop world model that can output action for autonomous driving.
+
+#### Overall impression
+This is perhaps the second world-model driven autonomous drving system deployed in real world, other than FSDv12. Another example is [ApolloFM (from AIR Tsinghua, blog in Chinese)](https://mp.weixin.qq.com/s/8d1qXTm5v4H94HxAibp1dA).
+
+Wayve call this model a VLAM (vision-language-action model). It improves upon the previous work of [Lingo-1](lingo1.md), which is an open-loop driving commentator, and [Lingo-1-X](https://wayve.ai/thinking/lingo-1-referential-segmentation/) which can outputing reference segmentations. Lingo-1-X extends vision-language model to VLX (vision-language-X) domain. Lingo-2 now officially dives into the new domain of decision making and include action as the X output.
+
+The action output from Lingo-2's VLAM is a bit different from that of RT-2. Lingo-2 predicts traejctory waypoints (like ApolloFM) vs actions (as in FSD).
+
+The paper claims that is is a strong first indication of the alignment between explanations and decision-making. --> Lingo-2 is outputing driving behavior and textual predictions in real-time, but I feel the "alignment" claim needs to be examined further. 
+
+
+#### Key ideas
+- Why languages? 
+	- Language opens up new possibilities for accelerating learning by incorporating a description of driving actions and causal reasoning into the model’s training. 
+	- Natural language interfaces could, even in the future, allow users to engage in conversations with the driving model, making it easier for people to understand these systems and build trust.
+- Architecture
+	- the Wayve vision model processes camera images of consecutive timestamps into a sequence of tokens
+	- Auto-regressive language model. 
+		- Input: video tokens and additional variables (route, current speed, and speed limit) are fed into the language model. 
+		- Output: a driving trajectory and commentary text.
+- The input of language allows driving instruction through natural language (turning left, right or going straight at an intersection).
+
+#### Technical details
+- The E2E system relies on a photorealistic simulator. [Ghost Gym](https://wayve.ai/thinking/ghost-gym-neural-simulator/) creates photorealistic 4D worlds for training, testing, and debugging our end-to-end AI driving models.
+
+#### Notes
+- The blog did not say whether the video tokenizer is better with tokenizing the latent space embeddings after a vision model or directly tokenize the raw image (like VQ-GAN or MAGVIT). It would be interesting to see an ablations study on this.
+- If language is taken out from the trainig and inference process (by distilling into a VA model), how much performance loss would Lingo-2 lose? It would be interesting to see an ablation on this as well.
+