Enhanced Autonomous Navigation of Robots by Deep Reinforcement Learning Algorithm with Multistep Method

In this paper, we propose a new method to improve the autonomous navigation of mobile robots. The new method combines a multistep update method with a double deep Q-network (MS-DDQN) to realize reinforcement learning (RL) to enhance the navigation ability of mobile robots. The proposed MS-DDQN gives two types of rewards for taking actions: terminal and non-terminal rewards. These rewards are subdivided into several different ones including rewards for arrival and collision (terminal rewards) and rewards for distance, orientation, and danger (non-terminal rewards). With the subdivided reward system, the new method trains mobile robots more effectively to increase their autonomous navigation ability. In the experimental process of this study, a laser range finder was used as the sensor for the mobile robot to perceive the distance information of the obstacle. Experiment results validated the new method’s enhanced ability, showing higher success rates (97% on average) than those of other methods such as the double deep Q-network (DDQN), prioritized deep Q-network (DQN), and prioritized DDQN. The higher success rates originated from the sophisticated reward system as the total reward of the proposed method was 7–94% higher than those of the other methods in simulations in five different environments. The learning speed was also improved, reducing the learning time, as the new method completed the learning in fewer episodes. The results of the new model suggest that MS-DDQN enables mobile robots to have higher learning efficiency and generalization ability than conventional deep reinforcement (DRL)-based methods and allows them to navigate autonomously even in unknown complex environments.


Introduction
Autonomous navigation ability is essential for mobile robots. To improve the navigation ability, many navigation planning methods have been proposed for mobile robots that work effectively in different environments. (1,2) A representative method is the potential field method (PFM), (3) which has been widely used owing to its simplicity, high safety, and fast computation time. However, PFM has the following disadvantages: a robot is likely to fall into local minima, oscillate in narrow passages, and wander in the presence of obstacles and between closely spaced obstacles. (4) To overcome these disadvantages, Borenstein and Koren proposed a vector field histogram (VFH) algorithm that finds the best path of a robot in a locally established polar histogram. (5) VFH * and VFH + have also been developed as modifications of VFH. (6,7) Simultaneous localization and mapping (SLAM) is another navigation algorithm. SLAM needs a map with complete information, (8) which enables autonomous navigation with continuous constructions and updates of the map. The algorithm of SLAM mainly uses an extended Kalman filter (9) and a particle filter (10) to collect information in a working environment for mobile robots. The information of a working environment can be added using cameras in a SLAM algorithm. (11,12) Although SLAM performs well in various navigation tasks, it still needs a huge amount of memory and computation. Therefore, most SLAM algorithms are applied in static environments.
In many cases, mobile robots are required to work in environments without previous information. Thus, traditional navigation algorithms have limited applicability to mobile robots. Recently, the development of artificial intelligence (AI) has allowed supervised and self-supervised learning methods to be applied to mobile robots even for navigation planning. For example, Chen et al. used a deep neural network to train mobile robots for autonomous navigation in crowded environments, (13) and Pfeiffer et al. proposed an end-to-end autonomous navigation model based on a deep convolutional neural network (CNN). (14) The model with CNN used two-dimensional (2D) laser scanner data and the relative positions of targets to execute steering commands. Tai and Liu also used CNN to train an obstacle avoidance model in corridors for the end-to-end control of mobile robots. In this case, CNN inputs raw depth image data, and output discrete commands were used to control the robots. (15) These deep-learning navigation algorithms learn strategies using raw information in multiple dimensions. However, they also have shortcomings for real applications. For example, the performance of a robot with such an algorithm depends on training data sets that are collected manually.
Thus, reinforcement learning (RL) algorithms are widely used. RL enables mobile robots to directly learn about the interactive environment. However, it is challenging for mobile robots to learn a good strategy in a large space. In addition, although RL guides and trains mobile robots to learn navigation strategies in unknown environments, the strategy is effective only in a fixed time period in the environments. This limits the application of RL in a complex environment with multiple dimensions. Thus, many new technologies and architectures of RL have been developed to improve the efficiency and performance in various tasks, examples of which are hierarchical RL (HRL) and deep RL (DRL). (16)(17)(18)(19) HRL decomposes a complex main task into several subtasks. Then, through the strategy of 'divide and conquer', the subtasks are completed one by one to fulfill the main task. DRL combines the advantages of deep learning and RL, achieving remarkable successes in many challenging tasks. DRL uses CNN to process multidimensional raw information and approximate the value function of RL. DRL is divided into temporal-difference (TD) learning and Monte Carlo (MC) learning. In TD, a mobile robot updates the Q table of states every time it interacts with the environment. TD includes the deep Q-network (DQN), (20) double deep Q-network (DDQN), (21) and prioritized DQN. (22) DQN uses CNN to directly process the multidimensional information of raw images and approximate the best value function. (23) Although TD has a high convergence speed and high learning efficiency, it is less stable than other learning algorithms and sometimes converges to an incorrect solution. In MC, a mobile robot creates the Q learning table of states after it has completed one episode of interaction with the environment. The policy gradient algorithm is an example of MC. (24) As MC needs to first complete an episode of interaction with the environment and then update the Q table, its convergence speed and learning efficiency are relatively low.
Most DRL algorithms use TD to perform the navigation planning of mobile robots. For example, Tai et al. proposed a DRL network trained by an asynchronous DRL algorithm that was based on input and output control commands for 10-dimensional distances measured by a laser. (25) Vinyals et al. designed an A3C network for rescue missions of a mobile robot. (26) Al-Nima et al. used human driving data as the input to a DRL network to train the autopilot ability of a vehicle. (27) TD learns rapidly in general, but it only calculates the reward from the next state, which restricts the ability of the mobile robot to perceive obstacles. MC improves a mobile robot's ability to perceive obstacles but the learning is still slow.
Therefore, we use a DRL algorithm to solve the autonomous navigation task of a mobile robot in an unknown environment. This method is a machine learning algorithm that combines the advantages of RL and deep learning algorithms. A mobile robot based on this method can directly process the original high-dimensional input information, and can also learn by directly interacting with the environment. In order to overcome the shortcomings of traditional DRL directly applied to autonomous tasks of a mobile robot and improve the ability of the mobile robot to quickly perceive obstacles, we propose a DRL algorithm using a multistep update method and a continuously combined reward function. On the basis of the method and the reward function, we use hierarchical DRL, where each layer of a neural network has a clear learning goal. The multistep method is different from TD and MC. Without reducing the training efficiency, the multistep method predicts the impact of multiple states in the future to obtain the reward in the current state. DRL with the multistep method is expected to enhance the navigation ability of mobile robots by improving their ability to sense and avoid obstacles in advance. The multistep method also ensures that mobile robots have high training and learning efficiencies.
In this study, we simulated obstacle environment information and a laser range finder (LRF), and the mobile robot measured the distance information of obstacles through the simulated LRF. Only 36-dimensional distance information in the forward direction of the mobile robot was used as the input information of the deep neural network, and then the deep neural network outputs control instructions for controlling the mobile robot. To validate the proposed method in this study, we carried out simulations and compared the results with those of other methods. It is expected that an autonomous navigation algorithm based on the results of this study will improve the autonomous navigation ability of mobile robots in unknown and complex environments.

1 RL
The navigation problem of a mobile robot in an unknown environment can be expressed as an RL problem in which the mobile robot interacts with the environment E in a fixed time step. At each time step t, the mobile robot obtains the state information s t 1 t e D ∈ S of the environment through sensors, where S is the state space. The mobile robot selects an action a t from all possible action sets A according to the acquired state information. After the action is completed, the mobile robot transitions to the next environment state s t+1 and gets a reward r t . In this process, the state is altered according to the state transition probability P(s t+1 /s t , a t ). This defines the possibility with which the robot takes action a t in state s t and then transfers to state s t+1 . In the traditional RL algorithm, the return function G t is defined as the sum of discounts obtained in state s t and all rewards in the future states as follows: where γ 1 t e D ∈ (0, 1) is a discount factor defining the impact of future rewards on the current state. The goal of the mobile robot is to find the best strategy to maximize the future rewards by training actions to take in the current state s t that is a mapping function from the state sets S to the action sets A. Under a given strategy π, the state-action value function Q π (s, a) is the expected reward of the robot when it takes an action at a state with the expectation function E π (•).
[ ] Depending on the update method of the state-action value table, the RL algorithm is divided into the MC and TD methods. The MC method updates the state-action value table as where α 1 t e D ∈ (0, 1) is the learning rate and the polynomial (r t + γr t+1 + γ 2 r t+1 + ...) is equal to the return function G t . Thus, r t + γr t+1 + γ 2 r t+1 + ... is replaced by the return function G t , and Eq.
The MC method requires the mobile robot to complete an episode of interaction with the environment before updating the state-action value table. If the mobile robot takes a long time to finish an episode of training, the update of the state-action value table becomes slow. Therefore, the training and learning efficiencies of the MC method are relatively low.
Unlike the MC method, the TD method only considers the impact of the reward obtained by the next state based on the current state's reward. The standard Q learning algorithm is an example of a TD method. (29,30) The TD method updates the state-action value table as where α 1 t e D ∈ (0, 1) is the learning rate.
t G is defined as the return function of the TD method, The return function of the TD method is approximated as the return function G t . The rest of the return function G t is γr t+1 + γ 2 r t+1 + ..., which is replaced by after time step t. The TD method does not need to update the state-action value table after one episode of interaction with the environment, but it updates the state-action value table at every step of training the robot. Therefore, the TD method has higher efficiencies in training and learning. However, as the TD method only considers the return of the next state, it does not predict the future returns, so it is relatively short-sighted.
The multistep method is different from the TD and MC methods and is one of the RL algorithms. (1,28) In some tasks, RL algorithms improve the performance of the TD algorithm and the learning efficiency of the MC algorithm by considering the reward of the next states. The multistep method is very similar to the TD method, and the only difference is the return function. The return function of the multistep method is defined as The return function of the multistep method truncates the traditional return function G t after n steps, and the remaining items are replaced by The multistep method enables the mobile robot to overcome the limitation of a single time step and focus on the reward of a longer time interval. Therefore, the multistep method updates the state-action value table as

2 DRL
When a mobile robot works in a very complex environment with a large number of states and actions, the traditional RL algorithm encounters a dimension problem. A good solution to this problem is to use a neural network to approximate the state-action value function in the algorithm. Therefore, the state-action value function at this time is related not only to state s and action a, but also to the weight parameter θ in a neural network. A method combining deep neural networks and RL is called a DRL method. By optimizing the parameter θ in the network to approximate the state-action value function, the loss function of the neural network is defined as where y Q = r + γ max Q(s', a; θ).
In practical applications, owing to the strong correlation between the collected state data, the RL algorithm does not converge or even diverge as it directly uses the approximate Q value function of the neural network. (29) To solve this problem, the DQN algorithm was proposed. (17) This algorithm uses a double neural network structure with an evaluation network and a target network. The parameter θ in the evaluation network is assigned randomly at the beginning, and the parameter θ' in the target network is copied periodically. Therefore, y Q of the DQN algorithm is changed into To break the strong correlation between the training data, the DQN algorithm replays its experience. By storing the experience data e t = (s t , a t , r t , s t+1 ) in the replay memory D, the mobile robot remembers and reuses the experience from the past. In the process of training, two tuples, 1 t e D ∈ and 2 t e D ∈ , are weakly correlated due to random sampling. The random extraction of training data in small batches from D helps to break the strong correlation between training sample data and ensures the stability of the DRL system. Thus, the loss function of the DQN algorithm is defined as In the DQN algorithm, the target network uses the MAX method to estimate the state-action value, which leads to an overestimation problem. The DDQN method solves this problem. The MAX operator in the target is broken down into two parts, and the parameter θ is used to select the action while θ' is used for the estimation of the state-action value function. Therefore, y DQN of the DQN algorithm becomes y DDQN , where ( ) ,arg max ( , , );

3 Proposed algorithm
We applied the multistep update method to DDQN to train a mobile robot for navigation planning. We named this method the multistep deep Q-network (MS-DDQN). By combining Eqs. (7) and (12) where U(D) represents a function that randomly extracts the experience data from memory D, and the experience data stored in the replay memory becomes e t (s t , a t , r t , s t+n ). The algorithm of the MS-DDQN training method is as follows. With probability ε select a random action a t , otherwise select a t = arg max Q'(s t , a; θ) Execute action a t in emulator and observe reward r t and s t+1 Store s t in S t , store r t in R t , store a t in A t , store s t+1 in S t+1 If s t+1 is terminal, then 1 , , t t n t s S a A s S τ τ + + ∈ ∈ ∈ Sample random mini-batch of transition (s τ , a τ , r τ , s τ+n ) from D Set y i = r i + γ n Q(s i+n , arg max Q(s i+n , a; θ); θ) Perform a gradient descent step on (y i − Q'(s t , a t ; θ)) 2 with respect to network parameter θ Until τ = T −1 End

4 Hierarchical DRL framework
To find the best navigation planning in a complex environment, we adopted a 'divide and conquer' strategy. In this study, we adopted a hierarchical DRL framework, and we divided a navigation task into two submodules: an avoidance module and navigation module. The avoidance module guides a mobile robot to avoid obstacles. With the input of distance data by a laser scan and the relative positions of the mobile robot, the module outputs a discrete command to control the robot. The navigation module guides the mobile robot to reach the target position faster. It also outputs a discrete command with the input information of the relative positions of the mobile robot. Either of the two modules finds the nearest distance between the mobile robot and the obstacles. The hierarchical DRL framework we proposed is shown in Fig. 1. The deep neural network architecture of the avoidance module and the global navigation module is shown in Fig. 2.

5 Reward function
The reward function gives the reward value of the mobile robot moving from the current state to the next state and indicates how well the action is taken in the current state. Generally, the reward function of the RL method gives 0 for a failed action and 1 for a completed action. This simple reward function provides sparse rewards, which are not conducive to the convergence of the algorithm. To solve this problem and accelerate the convergence of the algorithm, a new combined reward function is needed.
We divide the rewards into two types, depending on the end of the current training episode. The first reward is a 'terminal reward', which is given when the mobile robot reaches a target position or collides with an obstacle. The second reward is a 'non-terminal reward', which is given when the mobile robot is moving towards a target position. The terminal reward is divided into an arrival reward and a collision reward.
When the mobile robot reaches the target position, the reward given is where d r-t is the Euclidean distance from the mobile robot to the target position and d win is the threshold of the mobile robot's collision with the obstacle. When a mobile robot collides with an obstacle, the reward becomes where d r-o is the Euclidean distance between the mobile robot and the nearest obstacle and d col is the threshold of the mobile robot's collision with the obstacle. Thus, the terminal reward is r arr + r col . Non-terminal rewards consist of three types: a positive reward, a danger reward, and an orientation reward.
When the mobile robot moves towards the target position, it gets a positive reward of where C r 1 t e D ∈ (0, 1) is a coefficient. This reward guides the mobile robot toward the target position. As the distance of the mobile robot from obstacles becomes shorter, the danger reward decreases and is defined as where d min is the distance of the mobile robot from the obstacle. In addition, we designed an 'orientation reward'. The orientation reward is given according to the angle between the vector of the mobile robot's forward direction and the vector from the current coordinate of the mobile robot to the coordinate of the target position. When the angle is greater than 18° and less than 72°, the reward is 0.3. In other cases, the reward is 0. The orientation reward is defined as where a ori is the angle between the vector of the mobile robot's forward direction and the vector from the current coordinate of the mobile robot to the coordinate of the target position. The final non-terminal reward is defined as r t-goal + r dang + r ori . This combination of the rewards solves the problem of sparsity so that the mobile robot gets corresponding rewards at each step of the training process. The combination also enables the mobile robot to learn a strategy allowing it to reach the target position faster along a shorter path.

Simulated environment
To demonstrate and evaluate the proposed method in this study, a 2D environment for simulations was designed as shown in Fig. 3. In Fig. 3, gray ellipses, circles, and polygons represent various types of obstacles. The red circle represents the mobile robot and blue lines represent laser beams. When the mobile robot starts a new episode of training, its starting position is randomly distributed in the light yellow rectangular area at the bottom of Fig. 3, marked with an orange circle. The target positions are randomly distributed in the light cyan rectangle at the top of Fig. 3, marked with a black circle. This layout ensures that the mobile robot must pass through a large number of obstacles and safely avoid them before reaching the target position, and avoid invalid training due to a too short distance. The size of the training environment (Env-1) is 500 × 500.
The mobile robot senses its surroundings through a simulated LRF with a field of view (FOV) of 180°. Distance is measured by the mobile robot with an angular resolution of 5°. The output of the simulated LRF is a 36-dimensional vector o i , which is normalized by the maximum effective range o max . The first element of o t always indicates a measurement in the horizontal left direction, followed by a measurement in the clockwise direction. In addition, the mobile robot also obtains the relative 2D coordinate information p t of the robot's current position relative to the target position, as well as the Euclidean distance information Ω t of the current position relative to the target position, and the relative angle information A t of the target point relative to the robot's forward direction. The robot moves at a constant speed. The command set for controlling the mobile robot is composed of five discrete commands: turn left, turn left 30°, forward, turn right 15°, and turn right 30°. In the 2D environment, the mobile robot needs to continuously sense the surrounding environment to avoid obstacles on the way to the target position.

Training results
MS-DDQN trained the autonomous navigation capabilities of the mobile robot in the environment Env-1. During the training process, the starting position of the mobile robot and the target position were randomly initialized at the beginning of each training episode. When the mobile robot collided with an obstacle or reached the target position, a new episode of training started. The network models were built with TensorFlow and implemented on a single GIGABYTE 2070 GPU. The simulated environment was run on an Intel i7-6800k CPU. The training parameters are given in Table 1.
To verify the effectiveness of the multistep DRL method, we compared the MS-DDQN algorithm with the DDQN, prioritized DQN, and prioritized DDQN algorithms in terms of training the navigation ability of the mobile robot in Env-1. We used the same network structure and the same software and hardware platforms for the training. The success rate indicates the probability of the mobile robot successfully reaching the target position. Reward curves represent the sum of the rewards obtained by the mobile robot in each episode of the training. We used a sliding average method to process the curves with a sliding window size of 300. The average reward is the mean of the rewards for the mobile robot in 3000 episodes.  a) and 4(b) show that the success rate curve of MS-DDQN rises faster than those of the other three methods, indicating that MS-DDQN has the highest learning efficiency. After 3000 episodes of training, MS-DDQN has a much higher success rate than the other algorithms. The success rate with MS-DDQN is 80.133%, while those with DDQN, prioritized DQN, and prioritized DDQN are 61.7, 63.633, and 53.366%, respectively. This indicates that the mobile robot trained by MS-DDQN avoided obstacles far better owing to its improved navigation capabilities. The average reward value of MS-DDQN is 185.072, while those of DDQN, prioritized DQN, and prioritized DDQN are 130.064, 132.067, and 101.650, respectively. This also proves that the mobile robot with MS-DDQN had stronger navigation capabilities. A lower reward value means many negative rewards, implying that the mobile robot had more collisions. Figure 4(b) shows that the reward with MS-DDQN remained above 200 after 500 episodes of training, while that with the other methods showed greater fluctuation. The navigation model learned by MD-DDQN has higher stability than the other methods.

Test results
The navigation model was tested 200 times in Env-1. In each test, the starting position and target position of the mobile robot were randomly assigned. The performance of each algorithm was measured on the basis of the success rate and the average reward of the mobile robot reaching the target position in the 200 episodes. A higher success rate and average reward indicate that the navigation model provides a better strategy. The results are shown in Table 2. After 3000 episodes of training, the mobile robot trained by the four algorithms in the table had learned how to avoid obstacles and reach the target position in Env-1. MS-DDQN had a success rate of 100% and the highest average reward. The navigation trajectory of the MS-DDQNtrained mobile robot is shown in Fig. 5.
To evaluate the performance of the proposed method in this study, we designed four test environments that differed from the training environment. The sizes of the four test environments were 500 × 500 (Env-2), 600 × 600 (Env-3), 700 × 700 (Env-4), and 800 × 800 . In these environments, the starting and target positions of the mobile robot were randomly initialized (light yellow and light cyan shaded areas in Fig. 6, respectively). The navigation models based on MS-DDQN, DDQN, prioritized DQN, and prioritized DDQN were tested 200 times in each of the four test environments. The navigation trajectories of the mobile robot with MS-DDQN in the different environments (Fig. 6) show that the navigation model trained by MS-DDQN had high generalization ability, adapting well to the new unknown environments.    Table 3, the success rates of the navigation model trained by MS-DDQN were 97% (Env-2), 91% (Env-3), 94% (Env-4), and 96% (Env-5). However, the navigation models trained by the other three algorithms did not reach a success rate of 90% in the test environments. The success rate of DDQN in Env-3 was only 46%. This demonstrates that the MS-DDQN-based navigation model successfully solved the navigation problem of the mobile robot in a new environment. The test results also confirm that the navigation strategy with MS-DDQN is more stable than the other methods. The generalization ability of prioritized DDQN and prioritized DQN was better than that of DDQN because the two methods performed targeted training and learning on collision training samples during the training process.
The above experiments revealed that the performance of the navigation model trained by MS-DDQN was better than that of the navigation models trained by DDQN, prioritized DQN, and prioritized DDQN. The reason is that MS-DDQN integrated the multistep update method in the RL into DDQN. When training and learning, MS-DDQN calculates the impact of the rewards obtained in the next few steps on the current state. Therefore, the mobile robot navigation model trained by this method has greater ability to sense obstacles, and the mobile robot can devise avoidance strategies in advance.
In MS-DDQN, the number of steps n is a sensitive hyperparameter. We examined how different values of n affected the performance with MS-DDQN. The larger the value of n, the greater the impact of future rewards. We conducted 3000 episodes of training for n 1 t e D ∈ (1, 8) in Env-1. The training results are shown in Fig. 7. The experimental results show that a larger n has a higher learning efficiency with MS-DDQN. When n = 1, MS-DDQN is degraded to DDQN, and the learning efficiency is the lowest. However, the success rates of n = 3, 4, and 5 are  similar, which indicates that the model with MS-DDQN already has the best strategy to avoid obstacles in advance with n = 3. Thus, even though increasing n further increases the success rate, the degree of improvement is not high. For example, when n = 6, 7, and 8, the difference in the success rate is less than 3% from that when n = 5. Moreover, as n further increases, MS-DDQN gradually degenerates into the MC method. Therefore, we set n as 5 in this study with MS-DDQN.

Discussion
There are three novelties in this study as follows: 1. We propose the MS-DDQN algorithm, which combines the multistep update method in RL with the DDQN algorithm. This method was applied for the first time to solve the problem of the autonomous navigation of mobile robots in unknown environments. This method can enable a mobile robot to perceive obstacles in advance, thereby improving its autonomous obstacle avoidance and navigation capabilities. 2. We construct two neural network structures, the local obstacle avoidance neural network structure and the global navigation neural network structure. The advantage of the DRL architecture is that each deep neural network has a clear training/learning goal. Local obstacle avoidance is mainly used by the mobile robot to avoid obstacles at a static distance, and global navigation is mainly used to quickly reach the target position. 3. We propose a new combined reward function that is divided into terminal rewards and nonterminal rewards. Terminal rewards include rewards for reaching the target position and rewards for collisions; non-terminal rewards include a positive reward, a danger reward, and an orientation reward. Through the combined reward function, not only can the convergence speed of the DRL algorithm be increased, but also the mobile robot can be guided to learn the strategy of reaching the target position with a shorter path. In Sect. 3.2, we compared the proposed MS-DDQN algorithm with the DDQN, (17) prioritized DQN, (20) and prioritized DDQN algorithms. It was found that MS-DDQS has higher learning efficiency as well as greater ability to avoid obstacles.

Conclusions
To solve the navigation problem of a mobile robot in an unknown environment, we proposed a multistep DRL method (MS-DDQN) that applies a multistep method of RL to the DDQN network. In the process of training and learning, we tested the subdivided reward system comprising terminal rewards (arrival, collision) and non-terminal rewards (distance, orientation, danger). With these rewards, the mobile robot was able to learn and improve its navigation ability. The learning efficiency of MS-DDQN in this study was higher than that of other methods with a success rate in the training of 100%, while the success rates of DQN, prioritized DDQN, and prioritized DQN were 88, 94, and 91%, respectively. In four test environments that were completely different from the training environment, the success rate of MS-DDQN was 91-97%, compared with 46-70% for DDQN, 62-77% for prioritized DQN, and 77-87% for prioritized DDQN. The well-established reward system enabled such results as the total reward of MS-DDQN was 7-94% higher than those of the other methods. The learning speed was also higher, as MS-DDQN achieved the highest reward with the fewest episodes and just five steps. This proves that MS-DDQN has high generalization ability, stability, and efficiency, and a mobile robot using the method can perform real-time autonomous navigation in an unknown environment with only 36-dimensional laser detection distance information. In future work, we will consider using a deterministic policy gradient algorithm to improve the algorithm in this study and solve more general and realistic navigation problems such as navigation in a continuous motion space and multi-mobile robot navigation control.