From 71dc8c96642f09481281c9431b7e46c2fc646928 Mon Sep 17 00:00:00 2001 From: Patrick Liu Date: Sat, 15 Jun 2024 00:40:39 +0800 Subject: [PATCH] Add Lingo-2 --- README.md | 10 ++++++++-- paper_notes/lingo2.md | 34 ++++++++++++++++++++++++++++++++++ 2 files changed, 42 insertions(+), 2 deletions(-) create mode 100644 paper_notes/lingo2.md diff --git a/README.md b/README.md index 2ec10d7..c7fa726 100755 --- a/README.md +++ b/README.md @@ -35,6 +35,9 @@ I regularly update [my blog in Toward Data Science](https://medium.com/@patrickl - [Paper Reading in 2019](https://towardsdatascience.com/the-200-deep-learning-papers-i-read-in-2019-7fb7034f05f7?source=friends_link&sk=7628c5be39f876b2c05e43c13d0b48a3) ## 2024-06 (0) +- [LINGO-1: Exploring Natural Language for Autonomous Driving](https://wayve.ai/thinking/lingo-natural-language-autonomous-driving/) [[Notes](paper_notes/lingo1.md)] [Wayve, open-loop world model] +- [LINGO-2: Driving with Natural Language](https://wayve.ai/thinking/lingo-2-driving-with-language/) [[Notes](paper_notes/lingo2.md)] [Wayve, closed-loop world model] +- [Enhancing End-to-End Autonomous Driving with Latent World Model](https://arxiv.org/abs/2406.08481) - [OccNeRF: Advancing 3D Occupancy Prediction in LiDAR-Free Environments](https://arxiv.org/abs/2312.09243) [Jiwen Lu] - [RenderOcc: Vision-Centric 3D Occupancy Prediction with 2D Rendering Supervision](https://arxiv.org/abs/2309.09502) ICRA 2024 - [EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision](https://arxiv.org/pdf/2311.02077) [Sanja, Marco, NV] @@ -52,11 +55,14 @@ I regularly update [my blog in Toward Data Science](https://medium.com/@patrickl - [Enable Faster and Smoother Spatio-temporal Trajectory Planning for Autonomous Vehicles in Constrained Dynamic Environment](https://journals.sagepub.com/doi/abs/10.1177/0954407020906627) JAE 2020 [Joint optimization, search] - [Focused Trajectory Planning for Autonomous On-Road Driving](https://www.ri.cmu.edu/pub_files/2013/6/IV2013-Tianyu.pdf) IV 2013 [Joint optimization, Iteration] - [SSC: Safe Trajectory Generation for Complex Urban Environments Using Spatio-Temporal Semantic Corridor](https://arxiv.org/abs/1906.09788) RAL 2019 [Joint optimization, SSC, Wenchao Ding] +- [MPDM: Multipolicy decision-making in dynamic, uncertain environments for autonomous driving](https://ieeexplore.ieee.org/document/7139412) ICRA 2015 +- [MPDM2: Multipolicy Decision-Making for Autonomous Driving via Changepoint-based Behavior Prediction](https://www.roboticsproceedings.org/rss11/p43.pdf) RSS 2015 +- [MPDM3: Multipolicy decision-making for autonomous driving via changepoint-based behavior prediction: Theory and experiment](https://link.springer.com/article/10.1007/s10514-017-9619-z) RSS 2017 - [EUDM: Efficient Uncertainty-aware Decision-making for Automated Driving Using Guided Branching](https://arxiv.org/abs/2003.02746) ICRA 2020 [Wenchao Ding] -- [MPDM: Multipolicy Decision-Making for Autonomous Driving via Changepoint-based Behavior Prediction](https://www.roboticsproceedings.org/rss11/p43.pdf) RSS 2011 -- [AlphaGo: Mastering the game of Go with deep neural networks and tree search](https://www.nature.com/articles/nature16961) Nature 2016 [DeepMind] +- [AlphaGo: Mastering the game of Go with deep neural networks and tree search](https://www.nature.com/articles/nature16961) Nature 2016 [DeepMind, MTCS] - [AlphaZero: A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play](https://www.science.org/doi/full/10.1126/science.aar6404) Science 2017 [DeepMind] - [MuZero: Mastering Atari, Go, chess and shogi by planning with a learned model](https://www.nature.com/articles/s41586-020-03051-4) Nature 2020 [DeepMind] +- [Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving](https://arxiv.org/abs/1610.03295) [MobileEye, desire and traj optimization] ## 2024-03 (11) - [Genie: Generative Interactive Environments](https://arxiv.org/abs/2402.15391) [[Notes](paper_notes/genie.md)] [DeepMind, World Model] diff --git a/paper_notes/lingo2.md b/paper_notes/lingo2.md new file mode 100644 index 0000000..c78e2ef --- /dev/null +++ b/paper_notes/lingo2.md @@ -0,0 +1,34 @@ +# [LINGO-2: Driving with Natural Language](https://wayve.ai/thinking/lingo-2-driving-with-language/) + +_June 2024_ + +tl;dr: First closed-loop world model that can output action for autonomous driving. + +#### Overall impression +This is perhaps the second world-model driven autonomous drving system deployed in real world, other than FSDv12. Another example is [ApolloFM (from AIR Tsinghua, blog in Chinese)](https://mp.weixin.qq.com/s/8d1qXTm5v4H94HxAibp1dA). + +Wayve call this model a VLAM (vision-language-action model). It improves upon the previous work of [Lingo-1](lingo1.md), which is an open-loop driving commentator, and [Lingo-1-X](https://wayve.ai/thinking/lingo-1-referential-segmentation/) which can outputing reference segmentations. Lingo-1-X extends vision-language model to VLX (vision-language-X) domain. Lingo-2 now officially dives into the new domain of decision making and include action as the X output. + +The action output from Lingo-2's VLAM is a bit different from that of RT-2. Lingo-2 predicts traejctory waypoints (like ApolloFM) vs actions (as in FSD). + +The paper claims that is is a strong first indication of the alignment between explanations and decision-making. --> Lingo-2 is outputing driving behavior and textual predictions in real-time, but I feel the "alignment" claim needs to be examined further. + + +#### Key ideas +- Why languages? + - Language opens up new possibilities for accelerating learning by incorporating a description of driving actions and causal reasoning into the model’s training. + - Natural language interfaces could, even in the future, allow users to engage in conversations with the driving model, making it easier for people to understand these systems and build trust. +- Architecture + - the Wayve vision model processes camera images of consecutive timestamps into a sequence of tokens + - Auto-regressive language model. + - Input: video tokens and additional variables (route, current speed, and speed limit) are fed into the language model. + - Output: a driving trajectory and commentary text. +- The input of language allows driving instruction through natural language (turning left, right or going straight at an intersection). + +#### Technical details +- The E2E system relies on a photorealistic simulator. [Ghost Gym](https://wayve.ai/thinking/ghost-gym-neural-simulator/) creates photorealistic 4D worlds for training, testing, and debugging our end-to-end AI driving models. + +#### Notes +- The blog did not say whether the video tokenizer is better with tokenizing the latent space embeddings after a vision model or directly tokenize the raw image (like VQ-GAN or MAGVIT). It would be interesting to see an ablations study on this. +- If language is taken out from the trainig and inference process (by distilling into a VA model), how much performance loss would Lingo-2 lose? It would be interesting to see an ablation on this as well. +