Skip to content

Commit

Permalink
Add Lingo-2
Browse files Browse the repository at this point in the history
  • Loading branch information
patrick-llgc committed Jun 14, 2024
1 parent 3540c4d commit 71dc8c9
Show file tree
Hide file tree
Showing 2 changed files with 42 additions and 2 deletions.
10 changes: 8 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,9 @@ I regularly update [my blog in Toward Data Science](https://medium.com/@patrickl
- [Paper Reading in 2019](https://towardsdatascience.com/the-200-deep-learning-papers-i-read-in-2019-7fb7034f05f7?source=friends_link&sk=7628c5be39f876b2c05e43c13d0b48a3)

## 2024-06 (0)
- [LINGO-1: Exploring Natural Language for Autonomous Driving](https://wayve.ai/thinking/lingo-natural-language-autonomous-driving/) [[Notes](paper_notes/lingo1.md)] [Wayve, open-loop world model]
- [LINGO-2: Driving with Natural Language](https://wayve.ai/thinking/lingo-2-driving-with-language/) [[Notes](paper_notes/lingo2.md)] [Wayve, closed-loop world model]
- [Enhancing End-to-End Autonomous Driving with Latent World Model](https://arxiv.org/abs/2406.08481)
- [OccNeRF: Advancing 3D Occupancy Prediction in LiDAR-Free Environments](https://arxiv.org/abs/2312.09243) [Jiwen Lu]
- [RenderOcc: Vision-Centric 3D Occupancy Prediction with 2D Rendering Supervision](https://arxiv.org/abs/2309.09502) <kbd>ICRA 2024</kbd>
- [EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision](https://arxiv.org/pdf/2311.02077) [Sanja, Marco, NV]
Expand All @@ -52,11 +55,14 @@ I regularly update [my blog in Toward Data Science](https://medium.com/@patrickl
- [Enable Faster and Smoother Spatio-temporal Trajectory Planning for Autonomous Vehicles in Constrained Dynamic Environment](https://journals.sagepub.com/doi/abs/10.1177/0954407020906627) <kbd>JAE 2020</kbd> [Joint optimization, search]
- [Focused Trajectory Planning for Autonomous On-Road Driving](https://www.ri.cmu.edu/pub_files/2013/6/IV2013-Tianyu.pdf) <kbd>IV 2013</kbd> [Joint optimization, Iteration]
- [SSC: Safe Trajectory Generation for Complex Urban Environments Using Spatio-Temporal Semantic Corridor](https://arxiv.org/abs/1906.09788) <kbd>RAL 2019</kbd> [Joint optimization, SSC, Wenchao Ding]
- [MPDM: Multipolicy decision-making in dynamic, uncertain environments for autonomous driving](https://ieeexplore.ieee.org/document/7139412) <kbd>ICRA 2015</kbd>
- [MPDM2: Multipolicy Decision-Making for Autonomous Driving via Changepoint-based Behavior Prediction](https://www.roboticsproceedings.org/rss11/p43.pdf) <kbd>RSS 2015</kbd>
- [MPDM3: Multipolicy decision-making for autonomous driving via changepoint-based behavior prediction: Theory and experiment](https://link.springer.com/article/10.1007/s10514-017-9619-z) <kbd>RSS 2017</kbd>
- [EUDM: Efficient Uncertainty-aware Decision-making for Automated Driving Using Guided Branching](https://arxiv.org/abs/2003.02746) <kbd>ICRA 2020</kbd> [Wenchao Ding]
- [MPDM: Multipolicy Decision-Making for Autonomous Driving via Changepoint-based Behavior Prediction](https://www.roboticsproceedings.org/rss11/p43.pdf) <kbd>RSS 2011</kbd>
- [AlphaGo: Mastering the game of Go with deep neural networks and tree search](https://www.nature.com/articles/nature16961) <kbd>Nature 2016</kbd> [DeepMind]
- [AlphaGo: Mastering the game of Go with deep neural networks and tree search](https://www.nature.com/articles/nature16961) <kbd>Nature 2016</kbd> [DeepMind, MTCS]
- [AlphaZero: A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play](https://www.science.org/doi/full/10.1126/science.aar6404) <kbd>Science 2017</kbd> [DeepMind]
- [MuZero: Mastering Atari, Go, chess and shogi by planning with a learned model](https://www.nature.com/articles/s41586-020-03051-4) <kbd>Nature 2020</kbd> [DeepMind]
- [Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving](https://arxiv.org/abs/1610.03295) [MobileEye, desire and traj optimization]

## 2024-03 (11)
- [Genie: Generative Interactive Environments](https://arxiv.org/abs/2402.15391) [[Notes](paper_notes/genie.md)] [DeepMind, World Model]
Expand Down
34 changes: 34 additions & 0 deletions paper_notes/lingo2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# [LINGO-2: Driving with Natural Language](https://wayve.ai/thinking/lingo-2-driving-with-language/)

_June 2024_

tl;dr: First closed-loop world model that can output action for autonomous driving.

#### Overall impression
This is perhaps the second world-model driven autonomous drving system deployed in real world, other than FSDv12. Another example is [ApolloFM (from AIR Tsinghua, blog in Chinese)](https://mp.weixin.qq.com/s/8d1qXTm5v4H94HxAibp1dA).

Wayve call this model a VLAM (vision-language-action model). It improves upon the previous work of [Lingo-1](lingo1.md), which is an open-loop driving commentator, and [Lingo-1-X](https://wayve.ai/thinking/lingo-1-referential-segmentation/) which can outputing reference segmentations. Lingo-1-X extends vision-language model to VLX (vision-language-X) domain. Lingo-2 now officially dives into the new domain of decision making and include action as the X output.

The action output from Lingo-2's VLAM is a bit different from that of RT-2. Lingo-2 predicts traejctory waypoints (like ApolloFM) vs actions (as in FSD).

The paper claims that is is a strong first indication of the alignment between explanations and decision-making. --> Lingo-2 is outputing driving behavior and textual predictions in real-time, but I feel the "alignment" claim needs to be examined further.


#### Key ideas
- Why languages?
- Language opens up new possibilities for accelerating learning by incorporating a description of driving actions and causal reasoning into the model’s training.
- Natural language interfaces could, even in the future, allow users to engage in conversations with the driving model, making it easier for people to understand these systems and build trust.
- Architecture
- the Wayve vision model processes camera images of consecutive timestamps into a sequence of tokens
- Auto-regressive language model.
- Input: video tokens and additional variables (route, current speed, and speed limit) are fed into the language model.
- Output: a driving trajectory and commentary text.
- The input of language allows driving instruction through natural language (turning left, right or going straight at an intersection).

#### Technical details
- The E2E system relies on a photorealistic simulator. [Ghost Gym](https://wayve.ai/thinking/ghost-gym-neural-simulator/) creates photorealistic 4D worlds for training, testing, and debugging our end-to-end AI driving models.

#### Notes
- The blog did not say whether the video tokenizer is better with tokenizing the latent space embeddings after a vision model or directly tokenize the raw image (like VQ-GAN or MAGVIT). It would be interesting to see an ablations study on this.
- If language is taken out from the trainig and inference process (by distilling into a VA model), how much performance loss would Lingo-2 lose? It would be interesting to see an ablation on this as well.

0 comments on commit 71dc8c9

Please sign in to comment.