Abstract
To promote safety and fully consider human drivers' acceptance, precise decision-making is realized for automated vehicles under the lane-change scenario in this paper. More specifically, automated vehicles not only decide to change lanes or not but also decide specific microcosmic behaviors, such as lane-change time and expected acceleration. Thus, precise decisions for lane-change are described with three parameters and learned by reinforcement learning. The rationality of such parameter-based precise decisions is shown in two aspects. First, different values of decision parameters will notably influence the planned trajectory, which means other microcosmic behaviors will be a significant uncertainty when they are not precisely decided in the decision-making layer. Secondly, based on the analysis of real traffic data, NGSIM, changeable lane-change time, and expected acceleration are revealed in lane-change behaviors, which is seldom explicitly considered in the decision-making layer of current researches. The decision parameters that include lane-change time and expected acceleration are learned with kernel-based least-squares policy iteration reinforcement learning (KLSPI). Safety, current driver's willingness, and average human driving style are considered in the reward function. Simulation results demonstrate that using reinforcement learning (RL) to learn decision parameters can realize more precise decisions, promote safety performance, and imitate human drivers' behaviors in the lane-change scenario.
Automatic control will be fully realized from the decision-making layer to the planning layer in automated vehicle
In previous researches, learning-based methods have obtained great attention to learn human-like driving behavior
As it is hoped that the autopilot driving system can ensure safety performance as well as respect human willingness, using parameters that can be some physical measurements to describe decisions helps to make more precise decision
The precise decision-making method based on parameter description in the lane-change scenario is investigated in this paper. The overall framework is shown in

Fig.1 Diagram of framework
The trajectory planning controller is described in
The trajectory planning controller is designed with nonlinear model predictive control in the parameter decision framewor

Fig.2 Diagram of the straight-road scenario
The nonlinear motion control model is established as:
(1) |
where: the vector of states is , X and Y are the positions, is the heading angle, and are the longitudinal and lateral velocity, is the yaw rate; the vector of control inputs is , is the steering-wheel angle, the simple longitudinal dynamics is considered to simplify the motion control model; is the mass of the vehicle; is the moment of inertia of the vehicle around the -axis; and are the distances from the center of gravity (CoG) to the front and rear axles, respectively. The linear tire model is considered, and the tire slip angle of the front wheel and rear-wheel can be linearized with the small slip angle, the lateral tire force and on the front and each rear tires is written as:
(2) |
where: and are the cornering stiffness values of the front and rear tires, respectively.
The trajectory planning problem in the straight lane-change scenario at decision parameters is formulated after discretization a
(3) |
where: are the weighting matrices; is the lateral offset to the center of target lane, which is different in lane-keeping and lane-change; is the distance between two neighboring lanes. The lane-change time decides the predictive horizon. As have illustrated before, the terminal equality equation ensures that the lane-change behavior is finished in the predictive horizon without a reference trajectory.
Decisions in the lane-change scenario can be precisely described as lane-keeping or lane-change to the right/ left lane in fast/ moderate/ mild mode with acceleration/maintained velocity/ deceleration. The examples represented with decision parameters are listed in
Description | YT/m | T/s |
---|---|---|
Lane keeping | 0 | 1 |
Right lane—change in fast mode | -L | 3 |
Right lane—change in mild mode | -L | 5 |
Left lane—change in moderate mode | L | 4 |
The action is not listed, while its value will decide whether to accelerate, maintain velocity or decelerate.
Then, the influence on the planned trajectory is shown with different decision parameters through parallel simulations. In the parallel simulations, the left lane-change () is implemented with changeable decision parameters and at two initial longitudinal velocities , which involves different trajectory optimization problems as shown in
The trajectories with different action durations , expected accelerations , and initial longitudinal velocities are shown in

Fig.3 Trajectories at different values of decision parameters
As shown in
Firstly, the parameter-based decisions and the corresponding trajectories after optimization are compared with real traffic trajectories from NGSIM to verify the rationality of parameter-based decision-making. Three typical examples (,,) are shown in

Fig.4 Comparison of actual trajectories from NGSIM (R in legend) and optimized trajectory with optimization trajectory planning controller (C in legend)
Meanwhile, the ranges of these decision parameters are analyzed with this real traffic dataset. The lane-change trajectories from the NGSIM are selected and labeled with the method in reference [
Action duration | Tmin/s | Tmax/s | /s |
---|---|---|---|
/s | 2.5 | 7.5 | 4.5 |
Expected acceleration | |||
/(m/ | -1.1 | 2.5 | 0.2 |
In free traffic flow, there are eight potential surrounding positions around the host vehicle, which is numbered as shown in

Fig.5 Diagram of host vehicle and its surrounding vehicles
Position | Range | Position | Range |
---|---|---|---|
, | |||
, | ,, |
and are the distance in the lane direction and timeheadway between the host vehicle and the surrounding vehicles in the position i. is the velocity of the host vehicle.
In these driving scenarios, the relative distance between the host vehicle and the surrounding vehicles are calculated on its current lane and target lane when the host vehicle is changing the lane. 15 driving scenarios that the minimal distance is beyond 4 m are assumed as a potential emergency driving situation and can be promoted with a better decision while the driver's intention is also fully respected. Thus, the start time step of lane-change, the action duration for lane-change, and the average acceleration are recorded in these driving scenarios. The learning is done on the driving scenarios to obtain better safety situations without violating the driver's willingness. The driver's willingness includes the intention of changing lane, the action duration, and the average acceleration of lane-change.
The RL based parameter lane-change decision-making problem is established, and the kernel-based least squares policy iteration algorithm (KLSPI) that is proposed in reference [
The decision process for driving is modeled as a MDP, which contains the design of state space, action space, and reward function. The trajectory planning controller changes the state of the host vehicle with the action selected in the decision layer.
In the design of state, to depict each of the vehicle in the potential positions ‒, the relative velocity , the acceleration , the relative distance in the lane direction , and the intention of the surrounding vehicles are considered. The intention of the surrounding vehicles is calculated using the method from reference [
(4) |
In the parameter-based decision framework, the decision is described with parameters, which is the lateral offset, the time of lane-change and expected acceleration, respectively. The action vector is expressed as:
(5) |
where: is the target lateral offset, L is the distance between two neighboring lanes; is the time of lane-change or is the time of lane keeping; is the expected acceleration.
The reward function is designed with the consideration of safety , the intention of diver , and the consistency with all drivers , which can be expressed as:
(6) |
The safety reward evaluates the safety situation compared with the action taken and the original situation in NGSIM , which is the driver's willingness considered in this paper. The safety reward is:
(7) |
The relative distance between the host vehicle and the surrounding vehicles on its current lane and target lane when the host vehicle is changing the lane is calculated during the lane change process. Assuming there are surrounding vehicles, the incremental equation of action taken and the original situation in NGSIM are calculated in the same way and expressed as:
(8) |
where: is the emergency distance, is the collision distance. The reward for the intention of diver and the consistency with all drivers can be expressed as:
(9) |
(10) |
Here: and are the time for lane-change with the current driver and the average of all drivers, respectively; and are the average acceleration with the current driver and the average of all drivers.
The reinforcement learning algorithm KLSPI that is proposed in reference [
Algorithm 1 KLSPI for lane-change decision-making:
1) Collect sample set
with random policy and trajectory planning controller.
2) Sparsification: Initialize empty dictionary Dic={}.
For k=1 to n, do
Assuming that the current dictionary has t features, for feature m(k) calculate
(11) |
where:
, |
If , do
else continue
end if
end for
3) Policy Iteration: Random initialize
Loop for j=1 to maximum iteration
For k=1 to n
Compute with
(12) |
Until
4) Output
In KLSPI, first, all training samples are collected. Secondly, the sparsification procedure is done to obtain features in this sample set that are not evident linear correlation to form a dictionary for function approximation, which is calculated withby using
The radial basis function (RBF) network is used as the function approximation to approximate the state-action value function, which can be expressed as:
(14) |
Feature representation is customized, which combines the state vector and the action vector and can be expressed as:
(15) |
here h is the activation vector and can be expressed as:
(16) |
The weight vector is manually set to normalize the feature vector with a different range and measure state and action differently. The kernel function can be described as:
(17) |
The driving scenarios in NGSIM are obtained and divided into the training set,the test set, and the cross-validation set. Sample sets are generated from the training set. The learning is implemented as illustrated in Sections 2 and 3, and influential parameters are decided. Finally, simulation results in the test set and the cross-validation set and promoted performance are verified.
In NGSIM, 254 driving scenarios whose host vehicle execute lane-change are selected and are randomly divided into the training set, the test set, and the cross-validation set, which has 214, 30, 10 driving scenarios, respectively. In the training set, the time step that the host vehicle changes lane is found and three decision time steps that before and behind this time step are also considered whose time interval is 0.2 s. We use tThese time steps rather than the whole time are used to sample, because the behavior of other environments could be assumed to be maintained in this short time interval. Lane-change or lane-keeping decisions are both simulated in these decision points to collect sample sets in these training scenarios. Finally, 10 327 samples are obtained.
In the learning process, the threshold parameter will be compared with a feature linear correlation with features in the current dictionary to decides whether it will be added to the dictionary, which will be used to approximate the action-state value function. Thus, the threshold parameter will influence the dimension of the dictionary and function approximation. The dimension of the dictionary is 572, 170, and 91, respectively, when =0.01, 0.06, and 0.11. The average estimated error in the training set and test set are shown in

Fig.6 Average estimated error in the training set and test set
The error in each sample is calculated using
First, an emergency scenario whose minimal distance between the host vehicle and the surrounding vehicles in the whole lane-change process is only 2.5 m in the test set is simulated. As the performance shown in

Fig.7 Results for an emergency scenario in test set
In the cross-validation set, a common scenario is simulated, and the performance is shown in

Fig.8 Results for a common scenario in cross-validation set
Microcosmic behaviors, such as lane-change time and expected acceleration, in precise decision-making not only influence trajectory but also differ in drivers, which are shown in NGSIM but seldom considered in the decision layer. As emergency driving scenarios exist in NGSIM, a learning-based parameter decision-making method for automated vehicles to learn the precise decision-making has been investigated, which can balance safety and human driving willingness. The lane-change time and expected acceleration are added to the action space. Safety, the current driver's willingness, and the average human driving style are considered in the reward function. After training by KLSPI with driving scenarios in NGSIM, precise decision-making is realized. Safety performance is promoted in an emergency lane-change scenario of the test set, which indicates a better-performing behavior and is acceptable for still executing the human driver's willingness. Safety performance is maintained in the cross-validation set, which verifies its generalization performance of the learning results.
In the further, other deep reinforcement learning methods will be explored in more complex and changeable driving scenarios.
Reference
PADEN B, ČÁP M, YONG S Z, et al. A survey of motion planning and control techniques for self-driving urban vehicles[J]. IEEE Transactions on Intelligent Vehicles, 2016, 1(1): 33. [Baidu Scholar]
LI X H, SUN Z P, CAO D P, et al. Real-time trajectory planning for autonomous urban driving: Framework, algorithms, and verifications[J]. IEEE/ASME Transactions on mechatronics, 2015, 21(2): 740. [Baidu Scholar]
GUO C Z, KIDONO K, TERASHIMA R, et al. Toward human-like behavior generation in urban environment based on Markov decision process with hybrid potential maps[C]// 2018 IEEE Intelligent Vehicles Symposium (IV). Changshu: IEEE, 2018: 2209. [Baidu Scholar]
CHU H Q, GUO L L, YAN Y J, et al. Self-learning optimal cruise control based on individual car-following style[J], IEEE Transactions on Intelligent Transportation Systems, 2020, 99: 1. [Baidu Scholar]
GINDELE T, BRECHTEL S, DILLMANN R, et al. Learning driver behavior models from traffic observations for decision making and planning[J]. IEEE Intelligent Transportation Systems Magazine, 2015, 7(1): 69. [Baidu Scholar]
MARTINEZ C M, HEUCKE M, WANG F Y, et al. Driving style recognition for intelligent vehicle control and advanced driver assistance: A survey[J]. IEEE Transactions on Intelligent Transportation Systems, 2018, 19(3): 666. [Baidu Scholar]
GONZÁLEZ D, PÉREZ J, MILANÉS V, et al. A review of motion planning techniques for automated vehicles[J]. IEEE Trans. Intelligent Transportation Systems, 2016, 17(4): 1135. [Baidu Scholar]
VALLON C, ERCAN Z, CARVALHO A, et al. A machine learning approach for personalized autonomous lane change initiation and control[C]// 2017 IEEE Intelligent Vehicles Symposium (IV). Los Angeles: IEEE, 2017: 1590. [Baidu Scholar]
HE G L, LI X, LYU Y, et al. Probabilistic intention prediction and trajectory generation based on dynamic bayesian networks[C]// 2019 Chinese Automation Congress (CAC), Hangzhou: IEEE, 2019: 2646. [Baidu Scholar]
TAN Y V, ELLIOTT M R, FLANNAGAN C A C, et al. Development of a real-time prediction model of driver behavior at intersections using kinematic time series data[J]. Accident Analysis & Prevention, 2017, 106: 428. [Baidu Scholar]
YOU C X, LU J B, FILEV D, et al. Highway traffic modeling and decision making for autonomous vehicle using reinforcement learning[C]// 2018 IEEE Intelligent Vehicles Symposium (IV). Changshu: IEEE, 2018: 1227. [Baidu Scholar]
SHALEV-SHWARTZ S, SHAMMAH S, SHASHUA A, et al. On a formal model of safe and scalable self-driving cars [DB/OL]. arXiv: 1708.06374, 2017. https://doi.org/10.48550/arXiv.1708.06374. [Baidu Scholar]
ZHANG Y X, GAO B Z, GUO L L, et al. Adaptive decision-making for automated vehicles under roundabout scenarios using optimization embedded reinforcement learning[J]. IEEE Transactions on Neural Networks and Learning Systems, 2020, 99: 1. [Baidu Scholar]
ARIKERE A, YANG D, KLOMP M, et al. Integrated evasive manoeuvre assist for collision mitigation with oncoming vehicles[J]. Vehicle System Dynamics, 2018, 56(10): 1. [Baidu Scholar]
ZHANG Y X, GAO B Z, GUO L L, et al. A novel trajectory planning method for automated vehicles under parameter decision framework[J]. IEEE Access, 2019, 7: 88264. [Baidu Scholar]
XU X, HU D W, LU X C, et al. Kernel-based least squares policy iteration for reinforcement learning[J]. IEEE Transactions on Neural Networks, 2007, 18(4): 973. [Baidu Scholar]