A Review of Bioinspired Vision Sensors and Their Applications

vision sensors have become very attractive in recent years because of their inherent redundancy suppression, integrated processing, fast sensing capability, wide dynamic range


Introduction
Until recently, charge-coupled devices (CCDs) (1,2) or complementary metal-oxide silicon (CMOS) (3,4) image sensors along with digital signal processing units that execute computer vision algorithms have been used for realizing conventional vision systems such as those for object tracking, object recognition, 3D reconstruction, simultaneous localization and mapping (SLAM), and navigation. These devices obtain visual information in the form of time-quantized "snapshots" recorded at a predetermined frame rate. During the period of transition from frame to frame, a problem similar to the undersampling phenomenon may arise. (5) This shortcoming may be tolerable for a human observer, but artificial vision systems that require real-time processing, such as autonomous robot navigation or high-speed control, may fail as a consequence of this shortcoming. Another problem of frame-based visual information acquisition is redundancy. Each recorded frame displays the information from all pixels, regardless of whether this information has changed since the last frame. This method obviously leads to a high degree of redundancy in the acquired image data. Obtaining and processing these unnecessary data waste resources and lead to increased channel bandwidth requirements, high transmission power dissipation, and increased memory size.
In contrast, biological sensing systems operate in a different way. In nature, including in human beings, the sensing systems of organisms are acted upon by asynchronous events, and the information is processed hierarchically and in parallel in a massive neuron network. (6) This idea has stimulated numerous studies to understand the relationship between the human sensory and central nervous systems, and to apply the knowledge of computational neuroscience to construct intelligent machines. It also gave rise to a new field called neuromorphic engineering in the late 1980s. (7,8) In neuromorphic vision devices, various sensors have been developed over the past two decades, including temporal contrast vision sensors, gradient-based sensors, edge-orientation sensitive sensors, and optical flow sensors. (9)(10)(11) However, very few have thus far been used in practical applications. Many lack technical completeness because of circuit complexity, large pixel area or high noise level, preventing realistic application. (5) More recently, however, there have been many improvements in vision sensors based on biological principles in terms of performance and practicality. These sensors include dynamic vision sensors (DVSs), (12,13) asynchronous time-based image sensors (ATISs), (14)(15)(16) and very recently developed dynamic and active pixel vision sensors (DAVISs). (17,18) In this paper, we review bioinspired vision sensors and their applications in the computer vision and robotics fields. The paper is organized as follows. In § 2, reviews of the biological vision system and bioinspired vision sensors are presented. In § 3, reviews of various applications and algorithms based on the bioinspired vision sensors are given. In § 4, conclusions are shown.

Biological vision system
Prior to reviewing bioinspired vision sensors, it is necessary to look at biological vision systems, especially the human retinal system. The human retinal system is composed of photoreceptors, bipolar cells, and ganglion cells. (19,20) Within the retina, photoreceptors are connected to horizontal cells, which are then connected to bipolar cells. Once the light enters the eye and passes through the eye lens, it is detected by photoreceptors that convert light into electrical pulses. These pulses are reflected and pass through bipolar and ganglion cells to enter optical fibers. Vision is realized after the visual cortex recognizes the electrical pulses. A schematic drawing of the retina network is illustrated in Fig. 1.
The translation of light into visual information initiates from two types of ganglion cells, X-and Y-cells. X-cells are distributed along the Parvo-cellular pathway that occupies 80% of the nerve fibers and are focused on the fovea. The Parvo-cellular pathway usually carries color information, spatial details, and patterns. For this reason, the Parvo-cellular system is often referred to as the biological "what" system. On the other hand, Y-cells are distributed along the Magno-cellular pathway that occupies only 10% of the nerve fibers. The Magno-cellular pathway usually carries visual information related to changes, including the detection of the movement, distance, and speed of an object. Owing to this, the Magno-cellular system is often referred to as the biological "where" system.

DVS
Conventional frame-based image sensors suffer from data redundancy and dataprocessing delay problems. There have been many attempts to resolve these problems. Several event-based (frame-free) temporal contrast vision sensors have been reported in recent years. (21,22) However, the sensor proposed by Kramer (21) had low contrast sensitivity, whereas that proposed by Zaghloul and Boahen (22) suffered from poor fixed pattern noise (FPN). Lichtsteiner et al. presented the first practical so-called DVS that mimics the function of Y-cells in an attempt to sense the dynamic information of the scene. (12,13) The sensor is "event-driven" instead of clock-driven, and similarly to its biological model, it responds to "natural" events occurring in the scene it observes. The pixel autonomously responds to relative changes in intensity at a microsecond temporal resolution. Each pixel asynchronously sends out an ON event if the log-compressed light intensity of the pixel increases by a fixed amount and an OFF event when it decreases. This way, information is continuously transmitted and processed, and communication bandwidth is only used by active pixels. This type of asynchronous event-based data format is called address event representation (AER) protocol and was introduced by Sivilotti and Mahowald to model the transmission of neural information within biological systems. (23,24) The sensor has a low FPN (2.1%), a low power consumption (24 mW), a small pixel array size (128 × 128), a low latency (15 μs), and a high dynamic range (120 dB). The DVS is the first commercially available product belonging to a neuromorphic sensor class.
The picture and design of the DVS are shown in Fig. 2. (12) The DVS pixel is composed of a fast logarithmic photoreceptor, a differencing circuit, and two comparators. In the logarithmic photoreceptor circuit, the photocurrent of a photodiode is sourced by a saturated N-channel metal-oxide-semiconductor field-effect transistor (MOSFET). The gate of the N-channel MOSFET is connected to the output of an inverting amplifier, which has the structure of a cascaded common source amplifier. Attributed to this transimpedance configuration, the photocurrent is logarithmically converted to the voltage V P . In the differencing circuit, two capacitors are integrated to amplify the signal. In addition, a reset switch is connected between the input and output (V diff ) of the inverting amplifier to remove the DC mismatch. The comparators determine whether the status of events is ON or OFF using the level of V diff .

ATIS
In addition to the invention of the DVS, consideration has been given to creating a new type of sensor that combines the "where" and "what" systems. The so-called ATIS brings these systems together using numerous bioinspired approaches such as event-based imaging. (14)(15)(16) This sensor is based on an array of fully autonomous pixels containing an event-based change detector (CD) and a pulse width modulation (PWM)-based exposure measurement (EM) unit. The EM unit is locally initiated by an individual pixel that detects a change in brightness in its field-of-view from the CD. The sensor outputs the temporal contrast event data and the absolute intensity of each event. The sensor has a very low FPN (0.25%), a low power consumption (< 175 mW), a reasonable pixel array size (304 × 240), a low latency (4 μs), and a high dynamic range (125 dB). The ATIS has not yet been commercialized.
The composition and operation of an ATIS pixel are illustrated in Fig. 3. The structure of the CD is similar to that of the DVS pixel. The EM circuit is composed of a photodiode, a capacitor, a P-channel MOSFET switch, and a comparator. The reset signal, generated from the change detector during the ON events, is applied to the gate of the P-channel MOSFET switch. Therefore, the capacitor is charged when the ON event occurs. After the reset, the capacitor voltage (V int ) gradually decreases because the capacitor is discharged by the photocurrent generated in the photodiode. When the level of V int is lower than the slice level of the comparator (V ref ), the comparator output C is converted from a low signal to a high signal. Before the next ON event occurs, the high level of the comparator output C is maintained. Because the decreasing rate of the level of V int is proportional to the intensity of light that illuminates the photodiode, the light intensity determines the duration of the comparator output C.

DAVIS
Very recently, a DAVIS has been proposed as another version of a combined biological "where" and "what" system. (17,18) The DAVIS combines the advantages of DVSs and active pixel sensors (APSs) at the pixel level. It outputs image frames through the synchronous APS pathway and simultaneously outputs events through the asynchronous DVS pathway. The shared photodiode (PD) and the small size of the APS circuit lead to a DAVIS pixel area that is 60% smaller than the ATIS pixel area. The sensor has a very low FPN (0.5%), a very low power consumption (< 14 mW), a reasonable pixel array size (240 × 180), a low latency (3 μs), and a reasonable dynamic range (130 dB for DVS and 51 DB for the frame image sensor). The DAVIS is also a commercially available product.
A schematic of the DAVIS pixel is shown in Fig. 4. A four-transistor CMOS APS, which is used to detect the light intensity, is integrated with the conventional DVS circuit. Using the photocurrent generated from the photodiode in the DVS circuit, the APS can obtain the readout light intensity. The detected information on the light intensity is stored in the parasitic gate capacitance of MN2. MN5, connected between the DVS and APS circuits, is used to reduce the reset transient of the source voltage of MN1.  Table 1 Specifications of bioinspired vision sensors. DVS (12,13) ATIS (14)(15)(16) DAVIS (17,18)

Summary of bioinspired vision sensors
We reviewed the biological vision system and three bioinspired vision sensors. Table 1 shows the specifications of the previously reviewed bioinspired DVS, ATIS, and DAVIS. The DVS is a bioinspired "where" system that responds to relative changes. The ATIS is a combination of bioinspired "where" and "what" systems that contains event-based CD and PWM-based EM units. Both the DVS and the ATIS are based on an asynchronous event-driven method, and the single pixel handles its own visual information individually and autonomously. The DAVIS is a combination of an asynchronous "where" system and a synchronous "what" system. It outputs image frames through the synchronous APS pathway and simultaneously outputs events through the asynchronous DVS pathway.
With respect to the commercialization of bioinspired vision sensors, the DVS and DAVIS are already commercialized, while the ATIS has not yet been commercialized. While the ATIS and DAVIS are both biological "where" and "what' systems, the ATIS has several disadvantages over the DAVIS. The power consumption and pixel size of the ATIS are relatively high and large compared with those of the DAVIS. Furthermore, the ATIS only provides intensity measurements of event-detected pixels, while the DAVIS provides intensity measurements of all pixels. These disadvantages may make the ATIS difficult to be commercialized.
The reviewed bioinspired vision technologies have the potential to overcome the problems experienced in conventional vision-based systems, such as high power consumption and high computational load. Bioinspired vision sensors have already been applied to various computer vision and robotics applications. Detailed reviews will be presented in § 3. In the future, much more progress in bioinspired vision sensors is expected.

Applications of Bioinspired Vision Sensor
In this section, various applications of bioinspired vision sensors are reviewed. Since the emergence of bioinspired vision sensors, various applications using these sensors have been proposed for computer vision and robotics. To the best of our knowledge, there have not yet been any applications proposed based on the DAVIS camera. There are a few applications based on the ATIS camera, but these applications do not utilize the absolute intensity information of each event. In the following, we will regard these cases as DVS camera-based applications. The reviewed applications can be classified into six categories: visual tracking, detection and recognition, SLAM, visual reconstruction, stereo matching, and control.

Visual tracking
Lizenberger et al. proposed an object tracking algorithm using a single DVS. (26) The proposed algorithm was inspired by a conventional frame-based mean-shift approach and implements continuous clustering of address events and tracking of clusters. (27) Figure 5 presents the people-tracking result obtained using the proposed method.
Lizenberger et al. presented a vehicle speed estimation algorithm for traffic monitoring based on a single DVS. (28) The proposed method is able to measure the velocities of vehicles in the range of 20 to 300 km/h on up to four lanes simultaneously.  (26) Bauer et al. further improved this system and proposed three different algorithms for vehicle speed estimation based on DVS output data stream processing. (29) Benosman et al. proposed an optical flow algorithm for visual tracking using a single DVS. (30) The optical flow estimate is obtained by adapting the differential flow brightness consistency constraint to an event-based domain. (31) They further improved the optical flow estimation algorithm based on a local differential approach on the surface defined by events. (32)

Detection and recognition
Humenberger et al. proposed a stereo DVS-based fall detection application using a neural network for elderly people in a home environment. (33) Using the stereo DVS camera system proposed in their previous work, (34) a meaningful feature vector is calculated. The detailed explanation of feature extraction for fall detection can be found in another previous work of theirs. (35) The neural network is used to classify the actual event as fall or non-fall. Figure 6 shows an example of a possible fall scenario.
Pérez-Carrasco et al. proposed a texture recognition hardware application based on convolutional neural networks. (36) A monocular DVS camera and a 2D convolution chip combined with a host PC were used as the hardware setup. The proposed texture recognition method modified the conventional frame-based Manjunath's method to a frameless event-based sensing system. (37) The experiments showed that the proposed recognition process can be achieved before the equivalent conventional frame-based system could capture and transmit the video while maintaining a similar recognition rate.

SLAM
Weikersdorfer et al. proposed an upward-looking DVS-based 2D SLAM algorithm in an indoor environment. (38) The previously proposed event-based feature tracking algorithm (39) was used for landmark tracking. The experiments showed the feasibility of the proposed method in a small indoor environment. Figure 7 shows an example of a SLAM result obtained using the algorithm.
Mueggler et al. proposed an onboard quadrotor 6-degree-of-freedom (DOF) pose estimation system using a DVS camera that is able to track high-speed maneuvers such as flips. (40) Their system starts by integrating events until a known artificial template is detected, and it then tracks the borders of the template by updating both line segments and the pose of the flying robot on an event-by-event basis. They demonstrated robust motion tracking during quadrotor flips with angular speeds up to 1200 °/s. Because a DVS camera does not provide absolute brightness values, few attempts have been made to combine an event camera with an extra full frame camera. (41,42) Weikersdorfer et al. developed an event-based 3D SLAM system combining a DVS with an RGB-D camera. (41) The experiments showed that the proposed event-based 3D SLAM algorithm was twenty times faster than the conventional KinectFusion-based 3D SLAM. (43) Similarly, Censi and Davide presented a low-latency event-based visual odometry system combining a DVS with a normal CMOS camera. (42) The two sensors were automatically spatiotemporally calibrated on the basis of the computation of similarity statistics. Experiments showed that the rotation can be estimated with surprising accuracy, while the translation can be estimated only very noisily because it produces few events owing to a very small apparent motion. Kim et al. showed that an event stream with no additional sensing can be used to build a persistent and high-quality mosaic of a scene while a hand-held DVS camera is in rotational motion. (44) Their method relies on two parallel probabilistic filters to (c) Robot trajectory resulting from SLAM algorithm (gray) and external tracking system as a ground truth (black). (38) jointly track the global rotational motion of a camera and estimate the gradients of the scene around it; the gradient map is then upgraded to a full image-like gray level mosaic with super-resolution and high dynamic range properties. Figure 8 shows a photograph of the experimental environment, DVS camera output, estimated gradient map, and reconstructed image-like mosaic of the scene.

Visual reconstruction
Carneiro et al. proposed a 3D reconstruction algorithm of a moving object for more than two fixed DVS setups. (45) A camera calibration method proposed by Benosman et al. is used for multiple DVS calibration. (46) After the calibration, geometrical and time constraints for matching events and Bayesian inference-based matching selection are used for the 3D reconstruction. Figure 9 shows examples of wireframe cube, human hand, and human face reconstruction results obtained using a 6 DVS camera setup.

Stereo matching
Various stereo matching algorithms for stereo DVS camera setups have been proposed. (47)(48)(49)(50) Sulzbachner et al. proposed an address event frame, a collection of events over a defined time period, based on the correlation method for stereo matching algorithms. (47) Rogister et al. proposed an asynchronous event-based binocular stereo matching algorithm combining epipolar geometry and timing information. (48) Taking advantage of the high temporal resolution and the epipolar geometry constraint, they provided a truly event-based approach for real-time stereo matching. Kogler et al. proposed area-based and feature-based stereo matching algorithms for stereo DVS. (49) They also proposed an event time-based stereo matching algorithm and showed that the time-based algorithm has a superior performance over previously proposed area-based and feature-based methods. (50) 3.6 Control system Delbruck and Patrick proposed a soccer goalie robot as an example of the application of a hybrid neuromorphic-procedural system consisting of a monocular DVS camera, a computer, and a servo motor controller. (51) Moving balls that approach the goal are tracked by an event-driven cluster tracker algorithm that was proposed by Litzenberger et al. (26) The ball position and velocity are used to control the servo motor. The goalie robot can block balls even when they are low-contrast white-on-gray objects and there are many background distracters.
Conradt et al. proposed a pencil balancing control system where a pair of spike-based silicon retina DVSs provide fast visual feedback. (52) This application requires very fast feedback control, successfully proving the markedly high measurement rate and low latency capabilities of the event camera. Each DVS updates its estimate of the pencil location, and the linear PD controller is used for maintaining the pencil balanced upright. Figure 10 shows the proposed pencil balancing control system hardware.  Object tracking (26) Monocular DVS Vehicle speed estimation (28,29) Monocular DVS Optical flow estimation (30,32) Monocular DVS Detection and recognition Fall detection for elderly people (33) Monocular DVS Texture recognition (36) Binocular DVS SLAM 2D SLAM (38) Monocular DVS 6-DOF pose estimation (40) Monocular DVS 3D SLAM (41) Monocular DVS + RGB-D camera Visual odometry (42) Monocular DVS + CMOS camera Visual reconstruction Image reconstruction (44) Monocular DVS 3D reconstruction (45) N-ocular DVS Stereo matching Stereo matching algorithm (47)(48)(49)(50) Binocular DVS

Control system
Soccer goalie robot (51) Monocular DVS Pencil balancer (52) Binocular DVS Table 2 shows a summary of the bioinspired vision sensor-based applications reviewed in this paper. Clearly, there are several advantages of DVS cameras over conventional frame-based sensors with respect to high temporal resolution, low computational power requirement due to inherent data compression, wide dynamic range, low power consumption, and wide dynamic range. In the fields of visual tracking and control using fast visual feedback, a DVS camera can be a good alternative to the conventional frame-based camera considering the speed and computational load. However, the limitations of the DVS camera are that it does not provide the absolute pixel intensity of the scene and there is no output data when the scene is static and the camera is fixed. In recognition-related fields, this can be a severe drawback because the feature extraction for classification can be strictly limited. Additionally, in SLAM and 3D reconstruction-related fields, feature matching and loop closure detection can be difficult with only event data. The situation is the same for ATIS cameras because a descriptor of a visually salient feature usually requires neighboring pixel information. To solve this problem, several attempts have been made to combine an event camera with an extra full frame camera. Although these are certainly possible practical solutions, these methods require an external camera system alongside the event camera, causing an increase in cost. In our view, the DAVIS camera can be a solution for this problem. The combined static and dynamic output of the DAVIS makes it promising for various applications. The DVS output can be used for tracking or segmenting the moving objects, while the image frames can be used for recognition, feature extraction, and classification. Progress in computer vision and real-time robotics using these bioinspired vision sensors is expected.

Conclusions
In this paper, we reviewed bioinspired vision sensors and their applications in the fields of computer vision and robotics. The reviewed bioinspired vision sensors have several advantages over conventional vision sensors, including inherent redundancy suppression, efficient in-sensor processing, fast sensing capability, wide dynamic range, and low power consumption. Until now, most of the applications and algorithms have been based on the DVS camera. The visual tracking and control-related algorithms and applications are successful cases. However, in SLAM, 3D reconstruction, and recognition-related fields, the DVS camera-based system has limitations because it does not provide any pixel intensity information. The DAVIS camera can be a good solution for this problem. In the future, much more progress in bioinspired vision sensors and their various applications in many different fields is expected.