Hardware Architecture Design for High-performance H.264/AVC Deblocking Filter

save transposed memory and reduce the hardware size. By coupling the proposed architecture with parallel processing units, the processing speed will be increased


Introduction
With the booming of information technology, various multimedia technologies are often used via Internet applications, such as video conferencing, video on demand, and video surveillance.Network technology has led to the introduction of broadband internet access; however, with the increasing number of users and the higher demand for picture quality, the network bandwidth is expected to be saturated soon.Therefore, multimedia video compression technology will be heavily relied on to compress data, so as to store more multimedia data in existing storage spaces, which in turn can reduce the time required for data transformation via the Internet.
The H.264/AVC standard for video compression has recently been jointly developed by ITU-T (Video Coding Experts Group, VCEG) and ISO/IEC (Moving Picture Experts Group, MPEG). (1)The picture quality, compression efficiency, and error tolerance of H.264/ AVC significantly outperform those of previous video compression standards.Compared with other video compression standards, (2) in the case of low-data movies, H.264/AVC can provide higher picture quality and compression efficiency.However, the significant increase in the computational complexity of the signal results in many obstacles for the use of H.264/AVC in real-time systems.Similarly to other existing video compression systems, the block-based discrete cosine transform (BDCT) and quantization based on block processing are commonly used, whose signal is processed by dividing the entire picture into non-overlapping blocks.In BDCT, each block is then converted from a space domain to a frequency domain.Then, the obtained coefficient is divided by a quantization-parameter-based quantization matrix.As a result of this processing, the high-frequency signal invisible to human eyes is removed, thereby achieving data compression.However, when a relatively large quantization parameter is used, the BDCT coefficient obtained after quantization and the amount of data are sufficient, indicating that the quantization parameter directly leverages the compression quality of a video system.
A deblocking filter is involved in the next-generation video compression system based on H.264/AVC and called the in-loop filter.Traditionally, a deblocking filter is integrated into a video codec as a postfilter, then the built-in deblocking filter will process the picture into a reference image. (3)In contrast, a H.264/AV deblocking filter is equipped with a mechanism that is highly adaptable to different picture sources, which produces higher video quality. (4)In the H.264/AVC standard, the deblocking filter is an essential mechanism for ensuring video quality.Taking into account the fact that the prediction, conversion, quantization, and motion compensation blocks have minimum sizes of 4 × 4, to remove the blocking effect in the picture, the deblocking filter must filter every 4 × 4 matrix in the picture, that is, almost every pixel must be processed by the deblocking filter.To this end, it is necessary to carry out complicated reading and access of the memory storing the picture signals.In addition, H.264/AVC utilizes various thresholds and conditions to determine and select modes, so as to work in concert with highly adaptable and tunable pictures.Although H.264/AVC has been optimized in terms of the deblocking algorithm, it still contributes one-third of the total computational complexity of the decoder. (5)he hardware reported in Refs.6-9 applies a traditional processing flow, i.e., firstly, all the pixel signals necessary to process a macroblock (MB) are loaded into the design, and then they are transferred back to the picture memory.Huang et al. (6) proposed a typical architecture, wherein horizontal filtering can be processed perfectly, while in the vertical direction, transposed memory must be introduced to avoid the possibility of a memory crash; this process takes nearly twice the computing time in Ref. 7. However, it only considered the calculation of the luminance part of the MB, while the color part was not considered, which is considered to be a limitation of this study.Our newly proposed memory architecture is based on the concept of two-dimensional memory processing introduced by Li et al., (8) wherein the pixel data originally stored in the same memory module is interleaved, so as to avoid the abovementioned memory crash in Ref. 6. Indeed, a certain hardware cost is required, but such processing is expected to reduce the use of transposed memory.Venkatraman et al. (9) suggested that two sets of deblocking filter modules can be applied at the same time, which can enable the parallel operation of horizontal and vertical filtering, thus reducing the operation time.This however, results in a higher cost in terms of the memory and hardware circuit in the system.The hardware designs proposed in Refs.10-15 attempted to adopt a process integrating reading, write-back, and processing, while increasing the hardware usage and reducing the waiting time of the circuit.Cheng et al. (10) applied shift registers to preserve the temporary data that requires continuous operation, so as to reduce the time required for data access.In the literature study of Chang et al., (11) pixels that must be filtered are first calculated, and the pixels that do not need to be filtered are not transferred into the hardware, resulting in a variable operation time.By using two-dimensional continuous operations, Sheng et al. (12) were able to retain more temporary data and reduce the time required for reading and loading.The disadvantage of their approach is the high memory requirement; (13)(14)(15) slice memory was introduced to store the pixel data of horizontal pictures, thereby reducing the time spent reading and loading data, which, however, increases the demand for memory.Zheng et al. (13) attempted to use more internal buffers to reduce the times required for data reading and loading.Liu et al. (14) proposed to unify external memory and internal transpose memory to reduce computation time, and Shih et al. (15) proposed the design of a fifth-order pipeline filter with a parallel computation to increase the boundary strength (BS).Effective hardware architectures (6)(7)(8)(9)(10)(11)(12)(13)(14)(15) have generally aimed to reduce the circuit size, memory cost, and operation time.However, these architectural designs failed to effectively take into account the memory cost and processing speed.In contrast to the previous designs in Refs.6-15, in this paper, we propose a more effective hardware architecture that enables the H.264/AVC deblocking filter to achieve a lower memory capacity, reduced access requirements, and a reduced number of operation cycles.
In Sect.2, we outline the deblocking filter algorithm in accordance with H.264/AVC.In Sect.3, we thoroughly explain our newly proposed effective hardware architecture design.In Sect.4, we compare the efficiency of the proposed architecture with that of the previous architectures.We conclude this study in Sect. 5.

Deblocking Filter Algorithm for H.264/AVC
A deblocking filter is used to eliminate the blocking effect in pictures by applying its algorithm, (5) thereby generating a smoother picture.In the block-based H.264/AVC standard, the blocking effect originates from the use of conversions with 4 × 4 matrices and block motion compensation.Therefore, a deblocking filter is considered as an effective tool for removing the blocking effect.In theory, a deblocking filter can be separated from the system by postfiltering, (5) which is only used to filter displayed pictures.By introducing a deblocking filter in the loop at the encoding end, higher visual quality is expected to be achieved.This is because the reference picture used for motion compensation has been filtered and reconstructed.18)(19)(20) The deblocking filter used in the H.264/AVC standard has high adaptability.The large number of thresholds can be used to adjust the strength of the filter in accordance with the picture and image characteristics.Also, the thresholds can be fine-tuned in accordance with quantization parameters (QPs), because the generation of the blocking effect is directly related to QPs. (4)

Filtering order
The deblocking filter applied in H.264/AVC carries out processing with the MB as a unit.There are many 4 × 4 blocks within an MB.The deblocking filter first processes the vertical sides of adjacent blocks in the horizontal direction before processing the horizontal sides of adjacent blocks in the vertical direction.After deblocking filtering in both directions in an MB is completed, the next MB is processed.As shown in Fig. 1, deblocking filtering is carried out on all the MBs in the picture in a raster-scan manner until the MBs of the entire picture are filtered.Deblocking filtering requires a maximum of eight pixel values.Taking into account the BS and QP, filtering may not be performed.If the pixel values of up to six points are to be adjusted, the values of up to three pixels (p2, p1, p0, q0, q1, q2) can be changed on each side of the boundary, as shown in Fig. 2.

BS
For two adjacent 4 × 4 luma component blocks, a parameter ranging between 0 and 4 called the BS is specified in the standard.For two adjacent blocks, the BS is determined from the selection of the intrablock/interblock prediction mode, the motion vector error, and whether the residual value is encoded.If two adjacent 4 × 4 blocks are subjected to intrablock coding, and the two adjacent sides overlap the boundary of the MB, the strongest filtering mode is used and the BS is set to 4. If two adjacent blocks are subjected to intrablock coding, without overlapping the boundary of the MB, the BS is set to 3. If the above conditions have not yet been satisfied, the determination process is continued.If intrablock coding is used by one of the blocks and one block has added a residual value into the code, a medium-intensity filtering mode is applied and the BS is set to 2. If the motion compensation of the two blocks refers to different pictures, or the difference between the moving coordinates of the two blocks is at least one luma, a  relatively weak filtering mode is chosen and the BS is set to 1.When all the above-mentioned conditions are not yet satisfied, the boundary remains unfiltered and the BS is set to 0, as shown in Fig. 3. Also, the BS of the chroma component is not to be recalculated; instead, the BS of the corresponding luma component is directly copied to the boundary of the chroma component, as shown in Fig. 4.

Definition of thresholds α and β
The set of sample values to be filtered was mentioned in Sect.2.2.If the decision of whether or not to carry out filtering is made simply according to the BS, it may result in a blurred picture.Therefore, only the edges with a blocking effect are deblocked, and the other edges remain unchanged so as to preserve the original sharpness of the picture.The decision process is shown in Fig. 5, which indicates that the sample value will be regarded as having a square effect only when the following conditions are met.Thereby, deblocking filtering will be carried out on the sample in accordance with Eqs. ( 1)-( 4).(1) With increasing QP of blocks Q and P, the thresholds α and β increase, where α and β are regarded as the criteria for judging whether the changes in the original image are sufficiently large.If the QP shrinks, any changes at the block edge are regarded as the original feature of the picture, not a false edge brought about by the blocking effect.On this basis, α and β are set to small values, to keep the original picture as unchanged as possible.Given that the distortion driven by the block effect becomes rather pronounced when the QP is increased, the inevitable increases in α and β lead to more sampling points in the picture required to apply the deblocking filter.
The slice level of H.264/AVC's is defined by two parameters, which are to adjust offsets at the encoding end, i.e., Offset A and Offset B. These offsets can be fine-tuned at the compression end, thereby causing the filter to apply different thresholds α and β for the same QPs.

Filter mode utilized when BS = 1-3
A filter with a basic strength is applied when BS = 1-3.The boundary pixels p3, p2, p1, p0, q0, q1, q2, and q3 are input to obtain P1, P0, Q0, and Q1 via its algorithm.Only the filtered P0 and Q0 are output, which replace the original p0 and q0 when they satisfy Eqs. ( 1)-( 4).Only the filtered P1 is output, which replaces the original p1 when it satisfies Eq. ( 9).Similarly, Q1 is output and replaces the original q1 if and only if Eq. ( 10) is satisfied.The equations used to calculate P1, Q1, P0, and Q0 are Eqs.(11) to (14), respectively, where c1 is the BS and the coefficient is related to Index A .In the luma component, c0 is the number of true values of c1 + Eqs. ( 9) and (10).When considering the chroma component, c0 is fixed to c1 + 1.

Hardware Architecture Design
In Fig. 6, an architecture for the deblocking filter system is proposed.In the memory, interleaved methods are applied to store pixels, thereby addressing the problem of transposing memory in previous designs.In addition, two parts of modularized memory with different functions are proposed, equipping both memory modules with the capability of twodimensional storage.One of the memory modules is centered on the core of the dual-port static random access memory (SRAM), storing the block data under real-time processing, which is referred to as the RAM-0 module.The other memory module is in the mode of the two-port SRAM, which stores the block data on the far right of the previous MB and is named the RAM-1 module.A novel feature of this architecture is that a fourth-order pipeline filter is placed in the deblocking filter module, which, coupled with the proposed recursive control, can reduce the frequency of memory access.In the control unit, in addition to controlling the addresses of the basic memory components, the data stream selection, and the data input and output, the parameters required for the deblocking filter module must also be stored in a system on chip (SOC).These parameters include Index A , Index B , and the BS.In the system architecture diagram shown in Fig. 6, except for the control signal lines, the width of the entire internal data bus is 32 bits.The external bus input and output pixel data are asynchronous.Therefore, they can share one 32-bit bidirectional channel, making this architecture ideal for an SOC.
Four 4 × 4 luma component blocks and four 4 × 4 color component blocks from an MB are stored in the module RAM-1, while the data is composed of eight groups of 16 word × 8 bit dual-port memory, for which the required storage space is 32 × 32 bits, equivalent to 1024 bits.The other memory module RAM-0 temporarily stores the pixel data of the MB while it is being processed.For the deblocking filter processing flow proposed in this paper, the maximum temporary storage is sixteen 4 × 4 blocks; the RAM-0 module consists of eight groups of 32 word × 8 bit dual-port memory, for which the storage space required is 64 × 32 bits, equivalent to 2048 bits.
For the deblocking filter architecture proposed in this paper, the processing of an MB only takes 279 computation cycles, among which five cycles are required to load the calculated BS and the parameters Index A and Index B , and the other 274 cycles are required for loading pixels and restoring data.When the processing of an MB has reached the last MB in the picture, an additional 32 cycles are required to store the pixel data remaining in RAM-1 in the memory of the reconstructed picture, giving a total of 301 computation cycles.

Data processing flow
The deblocking filter signal processing proposed in this paper is shown in Figs.7 and  8, in which B0-B39 are 4 × 4 pixel blocks.The circles numbered 1 to 48 all require four computation cycles; ellipse H indicates horizontal filtering of the vertical edges and ellipse V indicates vertical filtering of the horizontal edges.In the first stage, the data in block B5 is received from the outside, and the data in block B4 is extracted from the RAM-1 module and sent to the deblocking filter module.In the second stage, the block filter will successively complete the B4 and B5 obtained from the first stage in a synchronized manner.At this point, because B5 must be input in the deblocking filter again, it is not necessary to write B5 in the memory.Instead, it is necessary to send B5 and B6 received from the outside to the deblocking filter module.At the same time, B4 is temporarily saved in the RAM-0 module.In the fifth stage, the data in block B0 data is received from the outside, and the data in block B5 stored in the RAM-0 module in the second stage is extracted and sent to the deblocking filter module.Moreover, B7 and B8 are stored together in the RAM-0 module.After stage eight has been completed, B0-B3, which are not used again in the current MB processing flow, are extracted and sent to the outside of the deblocking filter module.After completing stages 16, 24, 32, 45, 46, 47, and 48, blocks B8, B13, B18, B23, B26, B29, B34, and B37 are respectively stored in the RAM-1 module, because the next MB is soon applied again, thereby the time spent in the read and load stages is reduced, and the repeated loading and writing of the same block can be eliminated.

Internal memory planning
In Fig. 9, during the vertical access of the memory, memory crash occurs, which hinders the access from being completed in one execution cycle.This means that transposed memory must be used, which can result in reduced efficiency.To solve this problem, we employ the twodimensional memory access design proposed in Ref. 8, which applies interleaved methods for data placement in different memory modules, eliminating the origin of the memory crash in the previous design.It also makes it possible to read and write in both the horizontal and vertical directions.In a two-dimensional memory module, considering that vertical pixel data is placed in different modules, memory crash can be entirely avoided, allowing both horizontal and vertical operations to be accelerated in two-dimensional memory without the occurrence of memory crash.
To achieve two-dimensional memory access, compatible components are required for the circuit to conduct data segmentation, address generation, and data combination, as shown in Fig. 10.The address generator produces the address corresponding to the read or write of the memory row or column as required.Also, the module in which data segmentations occurs is responsible for shifting the input pixel and determining the frequency of the data shift for different input addresses, so as to enable vertical pixel signals to be stored in different memory modules.The data combination module receives the shifted data and the output from the memory unit, and according to the output address, the data combination module reverses the shift by restoring the pixel data in the memory module back into the pixel data that has not been previously shifted.Thereby, the correct sequence can be obtained, facilitating the subsequent processing.

Pipeline-based deblocking filter module
To optimize the processing performance of the deblocking filter, a parallel and pipelined circuit design is used as shown in Fig. 11.The parallel processing data includes eight pixel inputs, eight pixel outputs, and a set of recursive inputs, and the reduced latency of the critical path is largely avoided by employing a fourth-order pipeline design.Different signals are selected as the input source to read pixels.The purpose of Stage 1 is to query the values of α, β, and Clip in line with the values of Index A , Index B , and BS, respectively, so as to conduct initial processing on the input pixels.In Stage 2, the output from Filter Stage 1 is processed and the precalculation of the decision flag required by the final signal selector is conducted.Stage 3 performs the calculation in the last stage using the output result obtained from the part of Filter Stage 2 with BS = 4. Filter Clip applies the reduction operation to the output result obtained from the part of the Filter Clip with the BS ranging from 1 to 3. In the Filter Out stage, the final result based on the previously calculated decision flag is selected and output.

Flag operation unit
On the basis of the absolute value obtained by subtracting the real-time BS from the pixel values, a flag operation unit can determine whether the boundary on the picture is a false boundary generated by the block effect or a real one sourced from an actual picture, from which FLAG1 to FLAG6 and FLAG chroma can be obtained.In the case of different BSs, the flag is used for decision making [Eqs.( 24)-( 26) are all satisfied].If a false boundary resulting from the blocking effect is spotted, the filtered pixels are output to remove the effect.If the result is the boundary of a real picture's boundary [one of Eqs. from (24)-( 26) is not satisfied], or the BS is 0, the original pixel can be left unfiltered.Output selection lists are detailed in Tables 1 and 2.  "True", "False", and "X" in the following tables represent true, false, and unaffected, respectively.Taking P1 as an example, when FLAG1, FLAG2, FLAG3, and FLAG4 are all true and FLAG Chroma is false, then the output of P1 is the value (bs1p1) processed by the filter; otherwise, it remains unchanged.

Simulation Results and Comparison
The hardware circuit design for the proposed architecture was based on VerilogHDL and synthesized using Synopsys Design Compiler under a TSMC CMOS 0.18 μm process with the operating frequency set to 100 MHz.The number of synthesized logic gates was 19.4 K.By comparing the proposed hardware architecture with those in the recent literature, (6)(7)(8)(9)(10)(11)(12)(13)(14)(15) we avoided the disadvantages of the deblocking filter in terms of the memory cost and processing speed, as shown in Table 3.
We hereby briefly expounded on the fields that were evaluated.Cycles/MB is defined as the number of execution cycles required to fully filter an MB.Filter cycle/MB is the number of operation cycles actually needed by the deblocking filter in the core.SRAM for pixels is the memory applied in the design, and SRAM in bits is the memory in bits used to directly convert the design via the capacity that can be stored, where the space involved in this part is not included in the gate count.4 × 4 registers consist of the internal memory of the circuit, first-in, first-out (FIFO), the transposed memory, and space.Registers in bits are the storage space that is converted to the corresponding bits; # of edge filters is the number of deblocking filter cores utilized.The calculation of frames per second (FPS) can produce the number of frames of 1280 × 720 size that can be processed per second at the same frequency of 100 MHz; taking into account the design process, the cost of the circuit after synthesis, and the figure of merit derived by the performance measurement system P (performance = cycle × memory), the obtained reference factors are in line with the standard.We carried out a comparison of the proposed hardware architecture with those in the recent literature, as summarized in Table 3.

Conclusion
In this paper, a new hardware architecture design was proposed, which employs a deblocking filter with low cost and high efficiency.The distinguishing feature of the proposed hardware architecture is that it is based on a static random-access memory module, its application of a fourth-order pipeline deblocking filter architecture, and its improved read/load capability, reducing the computation time needed and the frequency of memory access.In terms of integrated circuits, the architecture design was carried out with Verilog HDL and the circuit synthesis tool Synopsys Design Compiler under a TSMC CMOS 0.18 μm process.
In contrast to the architectures proposed in the recent literature, the deblocking filter proposed in this study applies two-dimensional access memory for the interleaving of data, which enables the design to be free from transposed memory.This approach is characterized by a low memory requirement (3072 bits) and as few as 279 operation cycles to calculate an MB.The experimental results suggest that our newly designed hardware architecture is compatible with the H.264/AVC video compression system and can satisfy the requirements for real-time computation.

Fig. 9 .
Fig. 9. (Color online) Diagram of memory access.(a) Traditional memory storage and (b) shift transposition storage method.

Table 2
Output selection table when BS = 4.

Table 3
Comparison of hardware circuit designs.