Limitations of the presented approach¶
The second stage of the pipeline is implemented on CPU. For better performance everything should run on GPU or even dedicated hardware.
We only detect the ego lane boundaries. But it is obvious how to extend our approach to include the left and right lanes.
The semantic segmentation approach will have problems when there are lane changes. A cleaner solution would be instance segmentation.
We created the lane boundary labels automatically using Carla’s high definition map. Creating labels for a camera installed in a real car is way more challenging. The options I know of are
You can have humans manually label each image separately. This approach is done for example for NVIDIA’s PilotNet (Ref. [BCD+20]).
Similar to our virtual approach, you can create labels using a high definition map of a real highway (Ref. [BS19]). The challenge is to perfectly localize the vehicle within the map in order to get good labels. Furthermore, to train a lane detection system that works on different highways across the world, you will need high definition maps of lots of different highways.
If you have some system that detects lane markings on a short distance reliably, you can combine that with visual-inertial odometry to create good labels. Examples of such short-distance lane-detection systems would be a lidar, or a camera-based lane-detection system that works well for the first say 5 meters, but is not so good further away. Once you logged some seconds or minutes of driving, you can use visual-inertial odometry to obtain the vehicle trajectory and then stitch together a local map of lane boundaries. Subsequently, you can project those mapped lane boundaries into each image you have logged.
The inverse perspective mapping step relies on very good calibration parameters, i.e., on knowing the position and orientation of the camera with respect to the road. Since we are running simulations here, we exactly know those parameters. In the real world you need to calibrate your camera. Getting the camera intrinsics is typically no problem if you have a chess boad. Obtaining the camera extrinsics (orientation and height) is more challenging and might become another chapter of this book at some point.
We are assuming that the road is flat, which is obviously not true everywhere. We are also neglecting that the car dips or “nose-dives” a bit when breaking. In this case, the vehicle’s forwards axis is not parallel to the road, which is something we assumed in our derivations.
Comparison to literature¶
As our approach to lane detection in this chapter is heavily inspired by the baseline described in Ref. [GBN+19], we want to list the differences
We are using an approach known as inverse perspective mapping to transform from pixel to road coordinates, since this allows us to fit a lane boundary model in meters. Describing the lane boundary in meters rather than pixels is necessary if we want to use the lane boundary model for control algorithms (see next chapter). Ref. [GBN+19] also transforms to a bird’s eye view, but they use a homography for that. The resulting coordinates are not in meters. Note that this is not the aim of the paper, and hence should not be seen as a criticism.
For the image segmentation we are using an off-the-shelf neural network from the great pytorch library segmentation models pytorch.
Our pipeline is similar to the baseline model in Ref. [GBN+19], not their actual model. Their actual model is an end-to-end neural network which fuses the two-step pipeline of the baseline model into one single neural network. This is advantageous, since it increases the accuracy, and speed of execution. Of course, creating an end-to-end network is also possible for our slightly modified approach, but we keep this as an exercise for the reader 😉.
Comparison to a real ADAS system: openpilot¶
It is interesting to see how a real world lane-detection system works. Luckily, there is one ADAS company that open sources their software: comma.ai. As you can read in the source code of their product openpilot their lane-detection system is designed roughly as follows
Perform online calibration to estimate camera extrinsics
Apply homography (warpPerspective) to the camera image in order to compute the image that you would get from a camera with default extrinsics. In the openpilot documentation this is referred to as calibrated frame.
Train a neural net with the default-perspective images. The output of the neural network is the path the vehicle should take (somewhat close to the center between the lane boundaries). I am not totally sure, but based on their medium article I think they create labels like this: Take recorded videos and estimate vehicle trajectory using visual odometry. Then for each image frame, transform this trajectory into the vehicle reference frame at that point in time and use this as a label.
If you want to read some more about lane detection, I recommend the following ressources:
Karsten Behrendt and Ryan Soussan. Unsupervised labeled lane marker dataset generation using maps. In Proceedings of the IEEE International Conference on Computer Vision. 2019.
Mariusz Bojarski, Chenyi Chen, Joyjit Daw, Alperen Değirmenci, Joya Deri, Bernhard Firner, Beat Flepp, Sachin Gogri, Jesse Hong, Lawrence Jackel, Zhenhua Jia, BJ Lee, Bo Liu, Fei Liu, Urs Muller, Samuel Payne, Nischal Kota Nagendra Prasad, Artem Provodin, John Roach, Timur Rvachov, Neha Tadimeti, Jesper van Engelen, Haiguang Wen, Eric Yang, and Zongyi Yang. The nvidia pilotnet experiments. 2020. arXiv:2010.08776.
Wouter Van Gansbeke, Bert De Brabandere, Davy Neven, Marc Proesmans, and Luc Van Gool. End-to-end lane detection through differentiable least-squares fitting. 2019. arXiv:1902.00293.