Autofocus System and Evaluation Methodologies: A Literature Review

The autofocus (AF) system has gained popularity in over 40 years. Embedded in a camera, it can bring the best focused image to the viewer within a few seconds, which frees users from focusing manually. An AF system usually consists of a motor, a lens, and the processing and control units. Within years of development, the AF system has become mature in terms of both technology and market, and many fast and accurate AF systems have been introduced and widely installed in compact cameras and digital single lens reflex (DSLR) cameras. On the other hand, the market for thermal infrared cameras has been increasingly growing in recent years owing to their decreasing cost and wide use by both civilians and the military. The AF system in those cameras adopted similar but more complex mechanisms. This article serves as a literature review of the state of the art of the AF system in both visible light digital cameras and thermal infrared cameras.


Introduction
The first mass-produced autofocus (AF) camera was the Konica-C35 AF, which was released in November 1977. (1) Nowadays, the AF system is embedded in almost every compact digital camera and digital single lens reflex (DSLR) camera, and some smartphones and tablets also incorporated the AF system. An AF system in a digital camera is a feedback control system normally composed of three parts: (1) a motor that drives a camera lens to move along the optical axis iteratively to search for the lens position of the best focus, (2) a group of lenses that converge light rays to the image sensor and (3) a processing unit that both carries out computation such as focus value per frame and issues control signals to the motor. Focus accuracy and speed are two important indicators for evaluating an AF system. They are affected by the selected algorithm for searching the best focus position, noise level, motor performance, specification of the optics such as f-number, density of the lens, and whether or not the subject is moving. AF can be achieved by actively, passively, or a hybrid of the two. The active method includes ultrasonic, infrared, (2) and time-of-flight (TOF) types. The camera emits a beam of light or sound and receives the reflected beam to estimate distance between the camera and the subject. For instance, Canon's "sure-shot" is an infrared-type AF. It uses a triangulation technique to estimate distance to the subject. (3,4) The passive method is categorized into phase detection and contrast detection AF. The former measures the phase difference between two captured images to estimate the focus position. The latter measures the sharpness of each frame to find the best focus position. Each method has pros and cons. For example, the active AF works under any illumination condition. However, it cannot "see" through windows and occlusions. Its accuracy is also inferior to that of the passive AF system. (1) The passive AF can detect and focus on subjects behind windows but may fail when the illumination is poor. In comparison with contrast detection, phase detection distinguishes between near and far focus. It is generally faster than contrast detection and adapts to focusing on moving objects. However, its accuracy is inferior to that of the latter. (5) Some cameras employ a hybrid-type AF, which combines active and passive AF, to ensure better performance instead of using active or passive alone, but they are more expensive and bulky.
Among the three types mentioned above, contrast-based AF has received considerable attention because of its good performance and low cost. (6,7) There are numerous papers and patents on contrast-based AF, some of which have already been used in consumer digital cameras or mobile phone cameras. In this paper, we review the literature on contrast-based AF from the early years to the state of the art. In addition, we contribute a section on the AF of thermal infrared cameras [also known as long-wave infrared (LWIR) cameras], because the LWIR camera has seen rapid growth in recent years in both consumer and military markets. Some high-end models incorporate the AF system, whose design is similar to that of digital cameras but more challenging.

Focus measure function (FMF)
In an AF system, the FMF calculates the focus value for each frame. The conventional AF system searches for the peak of the focus values or the highest frequency component as the lens moves from out-of-focus to in-focus position. (8,9) For example, Fig. 1 shows some video frames captured during the AF process using an thermal infrared camera. The corresponding curves at the bottom are calculated at each frame using Gaussian focus measure (GFM). A good FMF should possess the following properties: (1) independent of the scene being captured; (2) fast and accurate; (3) has good reproducibility; (10) and (4) less affected by noise, which results in local maxima. Many FMFs were proposed in the 1970s-80s for microscopes. (11)(12)(13)(14)(15) Later, researchers found that some of the old FMFs are not suitable for digital cameras. (16) Therefore, many improved algorithms were introduced.
A comprehensive study about FMF was published in 1985. (10) The authors compared 11 different FMFs, and evaluated three different types of images. Results show that the squared gradient, Laplacian, and normalized variance outperform the other 8 FMFs for all three images. However, the number of images used for evaluation is small so it is still difficult to tell which FMF is superior.
The same team in Ref. 16 conducted further research in 2015. (34) They expanded the image sets to 32 benchmarks for evaluation, a total of 5344 images. Additionally, 11 more sets were captured in a dark room to evaluate FMFs under low light condition. This is by far the largest image set used for evaluating an AF system as far as we know. However, they evaluated only  the squared gradient (24)(25)(26)28) and GFM, (18,(35)(36)(37) and found that the latter is significantly better than the former under low light condition. This is because the GFM adds a smoothing effect to the input image before taking the first-order derivatives. This filters out a lot of noise such that it will not contribute to the focus measure. Other publications (38,39) also discussed the effectiveness of the GFM under low light conditions. Another FMF that has been reported to be effective in reducing noise under low light condition is the frequency selective weighted median (FSWM). (40) Choi et al. proposed this filter as an FMF for the AF system. (41) It was based on previous research on weighted median filters. (42) The FSWM filter can extract high-frequency components from an image and reduce impulsive noise as well. The authors used 11 indoor and 3 outdoor scenes, a total of 2100 images for evaluating the proposed FSWM filter and compared with three other FMFs: one first-order derivative, one Laplacian, and one absolute gradient. As to the noise reduction ability, FSWM beat all the other three methods by effectively eliminating impulsive noise while keeping the useful high-frequency components such as edges and corners unchanged. However, the author did not compare the FSWM with GFM, which also demonstrates good noise reduction ability.
Some other methods have also been proposed to tackle noise problem under low contrast conditions. Many studies used variance-based methods employing discrete cosine transformation (DCT). Xu (43) et al. cited some traditional DCT-based methods that are vulnerable to noise and also some modified methods that are effective for noisy scenes.
A quantitative evaluation of the noise sensitivity of FMF was documented by Subbarao and Tyan. (35) They used standard deviation and root-mean-square error to evaluate noise sensitivity. Results showed that the best focus measure is dependent on both noise level and image texture.

Search algorithm
The search algorithm is the basis for all contrast-based AF systems. It directly relates to AF speed and accuracy. It can be divided into hill climbing search (HCS), (9) Fibonacci search, (19,44) curve fitting search, (44,45) binary search, (46) and a combination of the above-mentioned search methods. The conventional hill climbing algorithm without noise reduction results in a local maximum. It also depends on the FMF.

Global search
The simplest search algorithm is the global search, which means that the AF system measures the focus value at every lens step. (46) This is not efficient because it searches the entire lens range for the peak using the same speed. Some highly blurred frames captured during the AF process are not likely surrounded by a sharp frame, which is probably the infocus position. The lens at these positions can move faster by taking large steps instead of taking the same steps.

Rule-based search
Therefore, a rule-based search algorithm was proposed by Kehtarnavaz and Oh. (46) They divided the search range into coarse, middle, and fine regions. The fine region corresponds to the lens positions, which probably includes the global peak, so that the motor stops at every step to calculate the focus value. In the middle and coarse regions, the motor stops at three to four and seven to ten steps, respectively. The authors compared their rule-based algorithm with global search and binary search algorithms and showed the fastest speed in both the number of iterations and steps. Although the rule-based search is faster than the other two, it still requires a full sweep of the lens focus range.

Model-based search
The model-based method has also been reported. Chen et al. proposed a method by combing a discrete difference equation prediction model (DDEPM) and a bisection search algorithm to search the best focus position. (47) The DDEPM can predict the trend of the focus value curve and locate the neighbors of the in-focus position quickly. Their method achieved real-time AF, which is 384.2 ms on average for 10 evaluations and showed good accuracy.

Coarse to fine search
The majority of the AF method (34,(45)(46)(47)(48)(49) employs a coarse to fine search scheme. As mentioned above, this scheme can effectively reduce the time for searching the best focus position. The FLIR system has proposed a two-step searching algorithm. (45) The first step uses a coarse but fast search method based on the low spatial frequencies of the image, which followed by a fine but slower search method based on the high frequencies of the image. Chen et al. (34) followed He et al. (48) and Li's (49) work but they employed two additional fine steps at the beginning to predict the direction where a peak most probably exists.

Curve-fitting-based methods
The curve fitting method fits the focus data to a curve such as a polynomial or Gaussian equation, then calculates the maximum on the curve and finds the corresponding lens position. During the AF process, it only needs three or four initial focus values to be calculated in order to locate the best focus position. Therefore, it reduces AF time. As mentioned earlier, the FLIR system uses a two-step search algorithm. (45) During the fine search stage, they adopted a curve fitting method to locate the best focus position. Chen et al. used four initial lens positions (47) to predict the best focus position using a DDEPM. However, it is said that the curve fitting search algorithm is highly dependent on the acquired data around the peak. If the noise level is high, such as under low light conditions, local maxima appear. In this case, the algorithm may fail to locate the correct in-focus position.

Machine-learning-based methods
Machine learning has also been used for AF. Machine-learning-based methods do not require the lens to search the entire range but jump to the best focus position based on trained results. This reduces AF time significantly. The earliest learning-based searching algorithm was proposed by Park et al. (50) They used the depth from defocus (DFD) method to reduce the computational cost and AF time because this method searches the best focus position by referring to only two prefixed defocused lens positions compared with the depth from focus (DFF) method, which uses multiple positions. A multilayer neural network (MNN) was utilized to classify distance from the objects. However, the author did not mention how fast their method is compared with conventional methods such as the rule-based methods.
Chen et al. (51) used a well-trained self-organizing map (SOM) neural network to predict the best focus position in order to reduce the searching time. A frequency domain approach called "Discrete Wavelet Transformation (DWT)" was proposed in their paper to search for the highest frequency from the captured images, which corresponds to the best focus position. The input of the SOM network is three initial focus values. The output of the network is the lens position for the best focus. Then, a backward search is carried out to search for the best focus position more precisely. Compared with conventional full search methodologies, the SOM-based search algorithm increased the AF speed 2.5 times.
Similarly, Han et al. (52) also used a training-based method to reduce the searching time, but they used the focus value incremental ratio as the feature vector, which they said is less vulnerable to texture and illumination changes. This method also needs only three initial focus values to be calculated, so that the AF time is significantly reduced compared with those of conventional approaches, for example, 3.2 times faster than the rule-based approach.
A recent learning-based approach was proposed by Chen and van Beek. They introduced a supervised machine learning approach, (34) in which two decision tree classifiers are defined to decide the state of the focusing process and locate the best focus positions. They used two sets of feature vectors, and each set includes many different features. While their approach is superior to He et al.'s coarse to fine method (48) in AF accuracy, and obtained better accuracy even under low light conditions, it showed some decrease in AF speed. In addition, the authors of Refs. 33 and 52 all mentioned that the focus measure itself is not a good choice for the feature vector because this value is easily affected by texture and illumination changes.

Focus window
The selection of focus window also affects the AF accuracy and speed. The focus window being too large results in redundant data, thereby increasing the computational load. In contrast, a very small window may not contain the subject that needs to be focused on. (53) A focus window can be defined by users or automatically. (34,46,54,55) FLIR's AF system for a thermal infrared camera can choose a focusing window by analyzing input images (45) rather than routinely choosing the central area, which is sometimes not very informative. Lee et al. proposed an AF algorithm by dividing the entire image into multiple windows, each with 40 × 40 pixels. These windows are used in their two-step search scheme. (55) Compared with traditional methods, their method is very effective when multiple objects of different depths exist. Rahman and Kehtarnavaz proposed an AF approach for focusing on the human face. (54) Their method performs better than multiple-window AF.

Focusing on moving subjects
Focusing on a moving subject is challenging, especially when the subject moves fast. Fortunately, camera makers have already incorporated such a feature in some of their highend products. Typical examples are the artificial intelligence (AI) servo of Canon's DSLR (56,57) continuous focus (AF-C) of Nikon's DSLR, (58) and trap focus (59) and a moving detection device to follow the movement of a subject and focus on it simultaneously (45) by FLIR systems. Zoom tracking, a method of continuous focusing on a moving subject along the camera's optical axis during zooming operation, was also described. (60)(61)(62)

AF System for Thermal Infrared Camera
Thermal infrared (long-wave infrared or LWIR) refers to the spectral band of approximately 7-14 µm. Cameras receiving this band of light do not require illumination of the subjects but "sense" the energy emitted directly from them. Therefore, the LWIR camera is very useful for night vision, surveillance, and military purposes.
There are many thermal imagers available on the market. Some high-end types employ AF feature. Basically, the thermal infrared camera can use similar FMFs and searching algorithms to locate the best focus position to a digital camera. A recent article about thermal imager AF by Srivastava et al. proposed a sharpness evaluation algorithm based on cumulative gradient measure. (63) They evaluated their algorithm under low contrast and noisy conditions as well as the effect of focus window on the focusing result. Cakir and Cetin used a cumulative probability of blur detection (CPBD) method to measure the amount of blur in an AF system for thermal infrared cameras. (64) The CPBD serves as an FMF, and the sharpest frame corresponds to the highest focus value.
One of the problems when designing an AF apparatus for the LWIR camera is the low level of energy penetrating to the sensor. (45) Therefore, the f number of the LWIR camera is always smaller than those of visible light cameras, normally between 0.8 and 1.2. This results in a shallow depth of field, which is much harder to focus than visible light cameras. Another issue is the temperature dependence of the optics, which also affects AF performance. (45) As a result, the design of the AF apparatus for the LWIR camera becomes much difficult. Cakir and Cetin (64) also addressed the difficulty when applying the CPBD to thermal imager because of its inherent noise problem. Their countermeasure is to modify CPBD algorithm to increase edge quantity rather than edge quality. Results showed that the modified CPBD outperformed conventional CPBD for LWIR imager. On the other hand, the optical lens of the LWIR camera usually adopts Ge as the material, which is highly transmissive to LWIR but has a much higher density (5.323 g/cm 3 ) than a digital camera lens. It means that the lens becomes very heavy, which cannot be driven by conventional AF motors or driven at a low speed. Typical methods to solve the problem are to choose a lens with a low density (thus low weight), such as silicon (Si, 2.57 g/cm 3 ); select a motor with a high torque, speed, and accuracy; or move the sensor instead.
A thermal image is noisier than a visual image because it has low resolution, typically around 320 × 240 or 640 × 480. Therefore, antinoise is always required for an LWIR camera.
It should be mentioned that some exclusive properties of the LWIR camera, e.g., thermal difference, can be used for determining the focus window. (45) This cannot be achieved using visual cameras because they do not measure the temperature of subjects.

Evaluation Methodologies
An AF system is evaluated if the AF performance meets the required criteria such as accuracy, speed, and ease of implementation. (17) Many evaluation methodologies have been reported.
Shih mentioned the subjectivity of manually choosing the in-focus position in order to evaluate the accuracy of AF algorithms. (17) He compared the results of seven FMFs with manually selected in-focus images. As the manual selection is a subjective process, the author suggested collecting evaluation results from a group of observers for each FMF.
Chen and van Beek proposed a method of evaluating both accuracy and speed. (34) The speed is defined as the lens steps taken for AF and the accuracy is defined as the percentage where the peak is found.
Yousefi et al. evaluated the performance of four FMFs using 60 different AF sequences. For each FMF, they defined accuracy as the percentage of cases matching the true in-focus position and speed as the averaged time of all 60 sequences in millisecond. (21) However, they did not mention how the true in-focus position is defined.
Mir et al. introduced precision, recall, and mean absolute error (MAE) (16) for evaluating different FMFs. In their paper, ground truth is determined by letting a person view the captured images on the camera's display screen. This, however, may introduce subjectivity.

Conclusions
AF is an indispensable feature for DSLRs, compact digital cameras, and smartphone cameras. A good AF system should be fast, accurate, and adaptable to most scenes. It not only depends on the FMF, search algorithm, and selected focus window, but also the motor performance, properties of the optical lens, and processing capability. It is a feedback control system that requires optimization. In this paper, we mainly reviewed contrast-based algorithms and AF evaluation methodologies from the early years to the state of the art. It is clear that none of the reported algorithms can adapt to all scenes under a variety of illumination conditions, noise level, texture, and subject motion. Most of the commercial systems adopt gradient-based FMF and hill-climbing search with some modifications, which can adapt to most cases. In addition, we also reviewed AF for thermal infrared cameras, which is more challenging than AF for digital cameras. Although the current commercial AF cameras have reached a satisfactory level, the future AF system could be faster, more accurate, and robust with the advent of new algorithms such as learning-based methods.