Drug Verification System Based on Deep Learning Multiscale Rotating Rectangle Detector and Feature Embedding

This research is aimed at the development of automatic drug image verification functions. Our verification system is composed of two stages. The first stage is an arbitrarily axis-aligned object detector, which is mainly based on a deep residual network and feature pyramid network (FPN). The detector predicts the rotation bounding boxes for drugs using multiscale feature maps generated by the FPN. Then, the rotation bounding boxes are axially aligned, and the drug image is cropped according to the axis-aligned bounding box. The second stage is a feature matcher, which is based on a feature embedding network. The embedding feature extracted by the feature embedding network is combined with the geometric feature obtained by the arbitrarily axis-aligned object detector to determine whether the drug in the image belongs to the category specified by the user. The database used in this research is an image database created by imaging drugs provided by domestic local medical centers. Our verification system achieved a false positive rate (FPR) of 0.047% in verification tasks of drugs of 21 categories.


Introduction
The Institute of Medicine has reported that medication errors are the single most common type of error in health care, representing 19% of all adverse events and accounting for over 7000 deaths annually. (1) Intensive care units (ICUs) in hospitals face great challenges resulting from medication errors, (2) which account for 78% of serious medical errors. (3) Even minor medication errors can cause severe disability. Moreover, the patients in ICUs are relatively weak, so they cannot tolerate medication errors. The drugs in ICUs are more diverse and complex than those in general wards, and the potential risks of medication errors to patients' lives are greater, making the prevention of medication errors a very important task.
The fatigue and stress of nurses undertaking long shifts in ICUs are potential factors contributing to medication errors owing to the possibility of confusion and misunderstanding at every stage of the drug use process. Possible causes of medication errors include mislabeling due to a similar appearance, similar drug names, the same drug being composed of different formulas, multiple abbreviations of the same concept, confusing abbreviations and symbols, different doses of the same drug, and confusing labeling. (4) Even experts in drug administration may misidentify drugs due to fatigue. Therefore, additional mechanisms to assist medical personnel in the administration of drugs will be a future trend, which, combined with the vigorous development of AI technology and related hardware, will make it possible to achieve fast and accurate drug identification through the assistance of AI. In the previous research on drug identification, some mechanical equipment or additional measures were often required, or there was no uniform and effective solution for identifying multiple drug categories. Because the problems of different types of packaging, surfaces, placement positions, and angles of drugs need to be solved at the same time, there were certain difficulties in the practical use of previous approaches. Thus, we propose a system with an easy-to-use, efficient, and accurate drug verification function, thereby avoiding the potential risks of medication errors and the irreparable harm they may cause.

Traditional methods
Some related studies on drug identification have been reported. In 2015, Yang et al. (5) used the barcode attached to the bottle body to identify the category of drug. In their system, a scanner was used to scan the barcode, which had to face the scanner. Che et al. (6) calculated a global threshold to binarize a gray-scale image as a feature vector, and then compared the similarity with an image feature vector in a database through template matching. However, their method ignored the problem of the placement angle. Zhang et al. (7) proposed a method of extracting the drug region of interest (ROI) based on edge detection, and obtained the diameter, height, and neck position of the bottle by applying predefined rules, and then used a trained Bayesian classifier to identify the drug category. In 2016, Gong et al. (8) proposed a way to continuously capture images with label surfaces in the form of conveyor belts, then combine the images to correct the boundary distortion, and then use optical character recognition (OCR) and a scale-invariant feature transform (SIFT) to match the template image and input image. However, in this method, it was necessary to ensure that the label of the drug on the conveyor belt faced the camera and that the bottle was placed upright when the conveyor belt started. In 2017, Xu et al. (9) proposed a method involving registering a complete and clear image of a drug in advance, which was expanded into a flat image as a template image through calculation, and a label image was segmented by a clustering algorithm. Their proposed algorithm extracted the feature matrix of a gray-scale image and compared it with the database image feature by cosine similarity. In the above methods, the shape, text, color, or texture on the label was used to identify drugs, but these methods usually had certain limitations, for example, the number of recognizable categories should not be too large or the accuracy was sensitive to the light source and rotation angle.

Deep learning methods
In 2018, Lee et al. (10) proposed the use of VGG-19 deep learning technology for drug recognition. The input image was obtained by photographing handheld drugs on mobile phones. Eleven categories of pill boxes were successfully identified, all of which were flat pill boxes, so there was no need to solve the problem of multiple differently placed feature surfaces. In 2020, Ting et al. (11) proposed the use of YOLOv2 deep learning technology to identify blister packs of drugs. They developed a single-stage detection system that included positioning and recognition that trained a model for each side of the blister pack. However, the model trained with text and logo features on the back of the blister pack performed better. The above research on the application of deep learning in drug recognition currently only recognizes specific types of packaging or containers, and directly uses deep learning technology to learn attention to ensure robustness to the effects of changes in the rotation angle.

Proposed System
The proposed system architecture is shown in Fig. 1. The system consists of two stages: a multiscale rotating rectangular drug detection system based on a deep residual network followed by a drug matching system based on feature embedding. In the drug detection system, the input image is passed through a deep residual network to obtain feature maps of different scales, and a feature pyramid network (FPN) is used to merge features to build a multi-semantic feature map. For each feature map, a confidence module and regression module are used to obtain the confidence and spatial location of all rotation bounding boxes, and then the results of all scale feature maps are input to a non-maximum suppression module to obtain a rotation bounding box with high confidence. After obtaining all the rotation bounding boxes, axis alignment is performed to obtain the image information contained in the axis-aligned bounding boxes.
In the part of the drug matching system, a trained full convolution feature embedding network is used to embed the image output of the drug detection system and the image in the database, then the geometric features and the embedded features are compared separately, and finally, the results and the expected category are input to the decision module to determine whether the input drug image passes a verification.

Feature extraction backbone
In the object detection system using deep learning, backbones with multiscale feature maps are needed as the basic feature extractor, but because of the need for multiple scales, the depth of the network must be sufficient. The backbones are ResNet-50 and ResNet-101, (12) and these two networks have two basic shortcut connection blocks, the architectures of which are shown in Fig. 2.
The architectures of ResNet-50 and ResNet-101 are shown in Table 1. The deep residual network can determine the required depth of the network through the shortcut connection block structure. The problem of the disappearance of gradients that may occur in deep networks can thus be solved.

Feature pyramid network
The architecture of the FPN (13) in the proposed system is shown in Fig. 3. The FPN is used as a feature fusion method after these backbones to generate multi-semantic feature maps. The FPN contains a bottom-up pathway, a lateral connection, and a top-down pathway. The bottomup pathway is composed of a backbone. The final feature map of each stage is used as our lateral transfer feature map.
The lateral connection uses a 1 × 1 convolution layer, the main purpose of which is to provide the number of channels required to combine with deeper features, and at the same time combine the information of each channel to obtain new features.
In the path from top to bottom, the deeper feature map is upsampled twice through bilinear interpolation, and then element-wise addition is performed with the shallower lateral connection results. In addition, to detect larger objects, it is necessary to obtain feature maps with a smaller spatial size because the receptive field is relatively large. A convolution with step 2 is used for downsampling. The final feature maps obtained from the backbone are C3, C4, and C5, which are used to generate P3, P4, and P5, then further generate P6 and P7 through the convolution with step 2. Feature map C2 is not used because its size would make the number of calculations huge.

Anchor box
In each spatial position of the output feature map, anchor boxes with different scales, aspect ratios, and angles are generated to meet objects of different shapes and placement angles. The sizes of the anchor box are defined as {32 2 , 64 2 , 128 2 , 256 2 , 512 2 }, and these boxes are set in the Table 1 Deep residual networks.

Regression module and confidence module
For the output feature map of each layer of the feature pyramid P 3 , P 4 , P 5 , P 6 , P 7 , we connected the regression module and then the confidence module. The architectures of these two modules are shown in Fig. 4. The regression module generates a bounding box spatial offset based on each anchor of all A anchors for each position on the feature map. Thus, the number of channels of the output is 5A, which is the total length of the rotating bounding box vector (x, y, w, h, θ) corresponding to each anchor box. Similarly, the confidence module generates the confidence of each category of all A anchors for each position on the feature map of different spatial sizes. The only difference between the regression module and the confidence module is that the number of channels of the last connected 3 × 3 convolution layer is KA, where K is the total number of categories and KA is the total length of the confidence vector for the categories of all the rotation bounding boxes.

Post-processing
The spatial position and confidence of the prediction bounding box generated by the regression module and confidence module at each layer of the feature pyramid must be postprocessed to obtain the final output. First, the prediction box predicted using the model is an offset relative to the anchor box, so it must be converted to an absolute value. The output of the prediction box is defined by the model t = (t x , t y , t w , t h , t θ ), the anchor box a = (a x , a y , a w , a h , a θ ), and the actual value b = (b x , b y , b w , b h , b θ ), and the conversion formula is given by the following equations.
In actual situations, there may be multiple predicted bounding boxes that contain the same object in the image, but this is unlikely to occur, so we must use non-maximum suppression to delete these redundant boxes. The algorithm is defined as Fig. 5.

Multitask loss function
Because there are multiple tasks to be performed, the multitask loss function is defined as pl pl reg c l on L t,g L p, y L t,g , p, y where t is the rotation bounding box predicted using the regression module, g is the label of the regression task, p is the confidence for each rotation bounding box predicted using the confidence module, and y is the label of the confidence task. The regression loss is defined as Eq. (7), which refers to IOU smooth L1 loss. (14) The regression task involves maximizing the area of intersection between the bounding box and the ground truth.
Here, v is the label of the regression task in the offset relative to the anchor box and b is the prediction box converted into an absolute value from t, which is the offset of the anchor box.
The confidence loss is defined as Eq. (8), which is referred to as focal loss. (15) Focal loss reduces the loss of samples that are easy to classify (such as background), and increases the loss of samples that are more difficult to classify (objects of interest), which is equivalent to solving the problem of category imbalance.
Here, p t is the probability of correctly predicting that a drug belongs to the ground-truth class. γ is the modulation factor, and α t is the weighting factor.

Feature embedding network
A very important part of the matching system is to obtain a global feature description of the image, which may be its color or texture. In our system, the feature description is obtained through a fully convolutional encoder network. The architecture of the encoder is shown in the left of Fig. 6, and feature extraction is performed through four identical and continuous CONV blocks. The architecture of a CONV block is shown in the right of Fig. 6.
The spatial range of the image of each category after feature embedding will be within a certain distance, and the feature distribution of each category must have a center point, called the prototype. The training algorithm of the feature embedding network is shown in Fig. 7. During training, two encoders are used, which share weights. One of the encoder inputs is the support set, and the average value of the corresponding output is used as the prototype. The other encoder input is the query set. It inputs the query set feature and prototype calculation distance into the loss function and then adjusts the distance between them according to the label. Through this training method, the distance between image features of the same category is reduced by the loss function, and the distance between features of different categories is increased.

Feature distance measurement
In the inference stage, we compare the embedded features and geometric features of the input image with those of the image in our database, so a definition of similarity or distance is required. The following equation is used to calculate the similarity of embedded features: where k is the expected category, N is the total number of categories, d(x, y) is the function used to calculate the distance between the features (we use the Euclidean distance), c k is the prototype, and f ⌀ (x) is the embedded feature of the input image obtained by the encoder. The following equations are used to calculate the similarity of geometric features:  where superscript k represents the feature of the prototype of the expected category. The method used to compare the similarly of geometric features is to compare the long side with the long side and the short side with the short side. The long and short sides produce similarity separately, and when the numerator is small and the denominator is large, the similarity of geometric features is low.

Verification decision
In the decision of whether to pass the verification, the corresponding thresholds are set to give a certain fault tolerance space, and the decision rules are as follows: The above rules must all be satisfied in this system for the identification of the drug to be considered as verified, that is, the embedded features and geometric features have a certain level of similarity.

Experiments
The drug image verification dataset was collected in cooperation with Kaohsiung Veterans General Hospital (KVGH). To maintain a certain image quality, we used a Logitech C920 Pro webcam to collect the images, which had a resolution of 1920 × 1080 pixels. The database included images taken with various placement positions and placement angles. The drug image verification dataset contained 21 categories of drugs from KVGH. Since our system is composed of two subsystems for training, our dataset was also divided into two parts, namely, the detection dataset and matching dataset, for the training of the individual subsystems. A sample image from each dataset is shown in Fig. 8. The matching dataset was cropped by our trained detection system, and we evaluated our drug verification system by applying this matching dataset to the matching system.
In the experimental comparison, the Euclidean distance was used to compare the false positive rate (FPR) under different training parameters. The experimental results are shown in Table 2, where n-way represents the number of categories in each training support test and n-shot represents the number of images in a single category for each training support set. The experimental results show that when n-way is relatively high but n-shot is relatively low, the FPR will be high. A higher n-shot reduces the FPR, and under our experimental comparisons, the final 10-way 20-shot training obtained the lowest FPR (0.047%).
The experimental results of a comparison with other methods are shown in Table 3. The proposed method is divided into feature distance calculation methods using the cosine distance or Euclidean distance. It can be seen from the experimental results that the Siamese network and triplet network, which respectively use dual samples or triples in a single training, have higher FPR values than our method. That is, feature embedding of our dataset will help improve the FPR if we consider multiple category features for training.
The results with the lowest FPR in Table 3 are shown as a confusion matrix in Table 4. We evaluate the results with the following common metrics: It can be seen that the FPR for our method with the Euclidean distance is 0.047%, which means that the incorrect categorization of a drug can be effectively avoided. The FRR is 1.68%, which means that most drugs of the correct category can pass verification, giving the system high efficiency.   Table 3 Comparison with other image feature embedding methods.
Training method FPR (%) Siamese network (16) 0.151 Triplet network (17) 0.169 Ours (cosine distance) 0.124 Ours (Euclidean distance) 0.047 We analyzed the incorrectly identified samples in the experiments. Figure 9 is a detailed confusion matrix for all 21 classes; note that the table cannot display the number of TNs. Figure 10 shows some FN samples, that is, positive samples that the system believed should not be verified. The FN for the image on the left may be due to a detection angle error or shooting exposure. The detection error for the image on the right may have been due to the poor focus of the label and its closeness to the boundary.
The pair of images in Fig. 11(a) have the same bottle body and the labels have the same background color. The pair of images in Fig. 11(b) show bottles with no features on the surface,

Conclusions
In this study, we propose a drug image verification system, which includes an automatic drug detection system and a matching system. The dataset used contains drugs of 21 categories for performance evaluation of verification tasks. The database used in this research is an image database created by imaging drugs provided by domestic local medical centers. Our verification system achieved an FPR of 0.047% in verification tasks of drugs of 21 categories. Furthermore, the proposed system can be integrated into a mobile phone, so that users can easily identify a drug at any time. In the future, we will attempt to achieve a higher recognition rate for our proposed drug image verification system by collecting more databases and make the system more practical.