# Flexible and Low Power Binary-Tree Current Mode Min/Max Nonlinear Filters Realized in CMOS Technology

R. Długosz<sup>1,2,\*</sup>, T. Talaśka<sup>3</sup>

 <sup>1</sup> University of Neuchâtel, Institute of Microtechnology, Rue A.-L. Breguet 2, CH-2000, Neuchâtel, Switzerland
<sup>2</sup> University of Alberta, Department of Electrical and Computer Engineering 114 St – 89 Ave, Edmonton Alberta, T6G 2V4, Canada, <u>rdlugosz@ualberta.ca</u>
<sup>\*</sup> fellow of the Marie Curie European Union Outgoing International Fellowship <sup>3</sup> University of Technology and Life Science, Faculty of Telecommunication and Electrical Engineering, ul. Kaliskiego 7, 85-791 Bydgoszcz, Poland, talaska@utp.edu.pl

# ABSTRACT

In this paper we present current mode, programmable, binary tree MIN/MAX filters designed for nonlinear data processing. Such circuits are often used in the image filtration, in such operations like erosion or dilatation, where enable easy noise reduction or correction of objects in the image. Two kinds of filters are proposed in the paper. The first one can be used in the 1-dimensional (1-D) signal processing. Samples of the input signal are stored in the circular analog delay line, where each sample remains in one delay element as long as is overwritten by the new data after several clock phases. As a result, only one analog delay element is updated with each new sample, what minimizes both the errors and the power dissipation. The 2-D filters are the natural extension of proposed 1-D filters. In this paper we propose programmable 2-D filters, which can be easily programmed to perform various nonlinear operations. The experimental image processor with 64 inputs, organized into 8x8 matrix, has been designed and tested in the HSPICE simulations. The circuit enables processing of the images with 64 pixels (8x8) with the rate equal to 500 kHz, dissipating power of 20  $\mu$ W. The effective data rate is equal to about 32 MSPS and energy/pixel is about 1 pJ.

Keywords: nonlinear filters, image processing, erosion, dilatation, winner takes all circuits

# **1. INTRODUCTION**

The MIN/MAX circuits are often used in different applications like artificial intelligence systems (e.g. neural networks), multimedia and telecommunication systems. There are significant similarities between the nonlinear MAX/MIN filters and the other circuits often called winner takes all (WTA). Both types of circuits have a similar task, which relies on searching for the maximum or minimum signal between many signals provided to the circuit's inputs. In fact, the only difference relies on the type of signals delivered to the inputs. The MIN/MAX filters operate with one signal, which is sampled in time domain, and particular inputs get samples stored in the delay line as illustrated in Fig. 1 (a). In WTA circuits, all input signals are generally independent one with respect to each other, what is shown in Fig 1 (b). The core circuit, which is directly used to select the maximal or minimal signal, can be exactly the same in both cases. As a result, in this paper both classes of circuits are treated as one, and proposed circuits are simple called MIN/MAX circuits.

Various types of MAX/MIN circuits have been described in the literature [1 - 8], but they can be classified generally into two groups. In one type, all input signals are compared directly in the one stage, selecting the largest of *n* competing voltages or currents [3, 8]. These circuits often have a simple structure, but are limited in terms of *n*, since increasing *n* degrades the resolution of the circuit. The second group is based on the conception of the binary tree, where all input signals are grouped into pairs and only one signal of each pair goes to the next stage (layer) of the circuit. After passing through all layers, finally we obtain only one winning signal at the circuit's output. Both synchronous and asynchronous solutions of this type of architectures are described in the literature. In the synchronous case [6], time allocated for competition on each layer is imposed by an external controlling generator. As a result, a *k*-phase clock circuit is required, where *k* is the number of layers in the tree. Each additional layer allows doubling of number of the possible input currents. Asynchronous solutions are, in general, simpler than their synchronous counterparts, as the controlling clock generator is now eliminated. As a result, the winning signal is selected in different instants [4, 5], depending mostly on values of the input signals. The additional advantage of asynchronous solutions is lower noise level, which is due to the lack of the clock generator. If asynchronous MIN/MAX circuit is used as a component in the bigger system, which is sampled, the additional synchronization is required, but only at the input and at the output to enable data exchange.

The advantage of the binary-tree MIN/MAX circuits is that they always provide an unambiguous solution. This is because exactly one winning signal is selected in each pair in the tree. Typically, the paths between the particular inputs and the filter's output consist of the *k* single competitive pairs. As a result, speed of such circuits depends linearly on the number of the layers in the tree (*k*). Since number of layers *k* and the number of the circuit inputs, *n*, are related by  $k = \log_2 n$ , speed of binary tree MIN/MAX circuit degrades relatively slowly with an increase of *n*.



Figure 1. Two applications of the MIN/MAX circuit: (a) the nonlinear filter (b) the WTA circuit used for example in the Kohonen's neural network

Examples of the asynchronous binary-tree current-mode MIN/MAX circuits are presented in [1, 2]. In those circuits comparing and copying operations of the signals require two different and independent blocks. The winning current in each pair is calculated on the basis of the input currents. As opposed to those solutions, in the circuit proposed in this paper the comparing and copying blocks cooperate more closely, but the main advantage is that the winning current is not calculated, but simply copied to the circuit's output, what increases both the overall speed of the circuit and its precision. A path between the given filter's input and the filter's output contains only several switches and current mirrors. Switches are controlled by the logical signals coming from comparators' outputs. Unlike in [1, 2], in our circuit no more than two transistors are placed between the supply VDD and VSS voltages. This allows to operate with the supply voltage as low as 0.5 V - 1 V for a CMOS 0.18 µm technology.

In [8], an example current-mode MIN/MAX circuit has been described, where all input currents are compared at only one stage (input layer) simultaneously. Due to the applied tree architecture, our circuit is considerably faster compared to that of [8]. The circuit proposed in this paper can be easily programmed to detect the maximum or the minimum input current, what is important from different applications point of view. This circuit has been designed on the basis of the solution presented by authors earlier in [9], but significant modifications make them in fact the quite different structure. The current solution has a very important advantage over the previous structure. In the former approach input currents were copied many times from one layer to the other. Each MIN/MAX pair contained two or three current mirrors between the input and the output, resulting in relatively large number of current mirrors between the bottom and the top of the tree. Each copying operation introduced a small error thus distorting the output signal. As a result, errors introduced by each layer cumulated at the top of the tree. The similar problem is rather typical in all binary tree solutions. In general, small errors do not make the problem in such applications like neural networks, as the adaptive mechanisms in these circuits are able to compensate such errors, but if the MIN/MAX circuit is going to be used in applications that require a higher precision, such as for example filters, these errors can became the significant problem. In the solution presented in this paper this problem has been strongly reduced, as propagation of currents from layer to layer has been eliminated. Now, separate copies of the input currents are delivered directly to particular MIN/MAX pairs in the tree. Paths between the circuit's inputs and each layer have been simplified, and always contain only one or maximum two current mirrors independently on the number of layers in the tree. This approach strongly reduces the copying errors, makes the error level independent on the number of layers in the tree, and additionally speeds up the circuit.

The paper is organized as follows. In the next section we present the general idea of the proposed MIN/MAX circuit in context of the 1-D signal processing. Then in section 3 the implementation of the circuit as a 2-D nonlinear image processor is described. Finally the conclusions are formulated in section 4.

# 2. PROPOSED MIN/MAX 1-DIMENSIONAL BINARY-TREE PROGRAMMABLE FILTER

The general structure of the proposed asynchronous binary tree MIN/MAX circuit is shown in Fig. 2 for example case of 8 inputs and 3 layers. The main building block in this structure is the MIMA2 (MIN/MAX) circuit shown in Fig. 3. This circuit compares two currents, generating the logical signals at the output that control connections in the tree. The binary tree structure when using as a nonlinear filter contains additionally the delay line with N+1 delay elements, where N is the filter order. In the filter mode the input signals IN1 – IN8 are delayed samples of one signal (IN). In case of filter described in this paper delay line is built from the standard sample-and-hold current mode elements and has the circular structure, what means that samples ones written to the memory cells remain on their positions as long as they are replaced by the new samples after N clock phases. This approach has various important advantages. Several advantages like e.g. reduction of errors and power dissipation have been indicated in the last section. The other additional very important advantage should be mentioned. In a given clock phase clk<sub>n</sub> the new sample is stored in a given delay element. As the other samples remain on their positions, only one path in the tree must be updated in each clock phase. For the tree with k layers this means that maximum only k MIMA2 circuits must be potentially switched over in the one clock phase. As each switching over requires an additional power expense, such approach is therefore very power efficient.



Figure 2. The binary tree MIN/MAX structure

The MIMA2 blocks have two analog current inputs and two digital outputs (one is a negation of the other). The MIMA2 elements compare input currents, generating the logical signals, which are used to establish connections between the filter's inputs with the winning currents and the next layers in the tree. Block diagram of the MIMA2 circuit, which is used in proposed binary tree of Fig. 2, is shown in Fig. 3 (a) and its full electrical scheme in Fig. 3 (b). Current mirrors are denoted by CM, while transresistance mode comparator as TMC. The TMC comparator is built as a cascade connection of digital inverters (NOT gates). In the scheme of Fig. 3 (b), parasitic gate-to-source capacitances of the PMOS and the NMOS transistors in the inverter serve as integrating elements. These capacitances are recharged depending on which current (negative or positive) has a bigger absolute value. At the inverter output, we obtain logical 1 or 0. In proposed circuit two options are possible. In the first option, the TMC includes four inverters, while in the second mode three inverters in the output path, what is programmed using one bit, as illustrated in Fig 3 (b). This allows us to detect both a maximum or minimum input current, depending on application. This feature is important in the 2-D image processor described in the next section, as enables various filtering modes.

The controlling digital circuitry in proposed solution is shown also in Fig. 2. It consists of AND gates, which control switches that connect chosen filter's inputs to given MIMA2 blocks in particular layers in the tree. Connections between the circuit's inputs and the MIMA2 blocks in the first layer are permanent and do not require any switches. Connections between inputs and the second layer require switches controlled directly by single  $s_{1n}$  signals coming from MIMA2 blocks in the first layer. Connections to all next layers in the tree require switches controlled by AND gates, where one gate's input is the enable (EN<sub>k-1,n</sub>) signal from the previous layer, while the second input is the new  $s_{kn}$  signal from a given layer. In multimedia applications filters are rather of low order what makes this circuit very simple.

The proposed MIN/MAX binary tree circuit works in an asynchronous mode, what is an advantage, as no clock generator is required inside the core circuit. Moreover, it exhibits a self-stabilization ability. The comparator integrates the difference between both input currents. As a result, the smallest is this difference the longer lasts the stabilization process. However, even a very small difference between these currents leads to a proper result if we assume enough time. One of consequences of using such comparators is that at each MIMA2 output, proper results appear after different time periods. For this reason, an incorrect result can appear at the tree output, but only temporarily. After a sufficiently long time period, the system goes towards a proper solution.



Figure 3. Block diagram (a) and CMOS circuit together with the C<sub>P</sub> gate-to-source parasitic capacitance (b) of the current-mode MIN/MAX (MIMA2) circuit used to build the binary tree of Fig. 2

2.1. Verification of the 1-D nonlinear filter in HSPICE simulations

The proposed binary tree filter with 8 inputs and 3 layers has been examined using HSPICE in the CMOS 0.18  $\mu$ m technology. Power consumption, detection speed and accuracy were tested. Supply voltage was VDD = 0.65 V, but the circuit is robust for this parameter adjusted in a wide range. The circuit has been successfully tested in a wide temperature range (-40, +125 C) for various transistor modes. The circuit worked correctly, although the speed of the circuit was different for different process-voltage-temperature (PVT) parameters. The example simulation results are illustrated in Fig. 4. The first panel illustrates operation of the delay line for the sinusoidal input signal. Particular curves in this figure are the delay line's output signals applied then to the binary tree. Depending on the MIN or MAX mode the binary tree produces at the output signals shown in the second panel. As the path between the input and the output is very short the resulting output signal is a good copy of a given minimal or maximal value. The enable signals from the third layer are illustrated in the third panel. This is worth noting that only one ENkn signal is equal to logical 1 in every time constant, what enables unambiguous selection of one input current. The circuit is very power efficient. Being sampled with the clock frequency of 1 MHz the average power dissipation is equal to about 4  $\mu$ W for the average values of the input currents equal to 100 nA. The IDD current is illustrated in the fourth panel of Fig 4.



Figure 4. Simulation results of the MIN/MAX binary-tree circuit shown in Fig. 2: (top panel) operation of the delay line in the filter mode; (second panel) the MIN/MAX filter's output in both the MIN and the MAX modes; (third panel) enable (EN) signals in the third layer in case of MIN mode; (bottom panel) current consumption in the circuit for the clock frequency equal to 1 MHz and for the average level of the input currents equal to 100 nA

#### 3. PROPOSED 2-D BINARY-TREE NONLINEAR IMAGE PROCESSOR

The proposed MIN/MAX circuit can be used in the very efficient image processing for such operations like erosion, dilatation, or combinations of them like opening and closing operations. These operations are useful in the binary image processing, but are not limited to this class of signals. The image processor proposed here is a fully parallel asynchronous structure. It consists of two separate sections connected serially. Each of these sections is programmed separately by one bit. The general idea of such a processor is shown in Fig. 5. The input signal is the image A. Particular samples of this image (black circles) are compared in horizontal lines using current mode comparators (white squares), producing logical signals s that control connections to the next layer i.e. comparators working in vertical lines (black squares). Finally we get the output image B, which is applied to the next section producing finally the output image C.



Figure 5. The general idea of proposed parallel nonlinear image processor. The first layer generated samples B(x,y) on the basis of input samples A(x,y), while the second layer generates the output samples  $C(\underline{x},y)$  on the basis of samples B(x,y). The white and black boxes and the white and black hexagons represent comparators in particular layers

There are two possible approaches to realization of such an image processor. In the first approach there are two separate sections connected in series, each section having only two calculation layers. This solution, in principle, differs a little bit from the second approach, described in the previous section. Here samples A(x, y) first are copied to the mid-image B, and then samples from this image are copied to the output image C. Both sections work independent, what means that the logical signals in each of them are completely independent one with respect to each other. In this approach we have two copy operations between the processor's inputs and the processor's outputs.

In the second approach there is only one block with four layers. This approach is a direct extension of solution presented in the previous section of this paper. In this case there is only one copy operation between the inputs and the outputs. In the second approach the accuracy is therefore a little bit better, but on the expense of more complicated digital circuitry.

Detailed diagram of proposed single 2-D section is shown in Fig. 6. Each input node produces maximum 10 copies of each sample A(x, y), using the circuit shown in Fig. 7. In this circuit particular copies are independent. Two copies (1<sup>st</sup> and 6<sup>st</sup>) are delivered to comparators in the first layer. Four copies (2<sup>nd</sup>, 5<sup>th</sup>, 7<sup>th</sup>, 10<sup>th</sup>) are delivered to comparators in the second layer, and four copies (3<sup>rd</sup>, 4<sup>th</sup>, 8<sup>th</sup>, 9<sup>th</sup>) are delivered to the output image *B*. The  $s_{1n}$  signals coming from comparators in the first layer are used in selection of winning samples from the image *A*, which are then delivered to the comparators in the second layer via switches  $k_1$ ,  $k_2$ ,  $k_5$ ,  $k_6$ . The comparator on the second layer generates logical signals  $s_{2m}$  that control the connection paths between the given samples A(x, y) and the output samples B(x, y) in such a way, that particular signals  $s_{1n}$  and  $s_{2m}$  are used to generate the enable signals (EN) that control switches  $k_3$ ,  $k_4$ ,  $k_7$ ,  $k_8$ .

The samples from signal B, e.g. B(1, 1), B(1, 2), B(2, 1), which are on the border of this image are calculated using simplified circuits with reduced number of switches and comparators, as illustrated in Fig. 6.



Figure 6. One layer of the 2-dimmensional parallel image processor for nonlinear image filtration



Figure 7. The single node used in proposed image processor to produce copies of particular samples A(x, y). The similar structure is used in the second layer of the processor to produce copies of the samples B(x, y)

### 3.1. Verification of the image processor in HSPICE simulations

The experimental 8x8 image processor has been implemented in CMOS 0.18  $\mu$ m technology and verified in the HSPICE simulations. This processor consists of two separate blocks, where each of them enables 2x2 nonlinear MIN/MAX filtration. Particular blocks are programmed using single bits. As a result, using two bits we get four different filtration modes collected in Table 1.

| $b_1$ | $b_2$ | Operation performed | Operation performed | Resultant operation     | Illustration |
|-------|-------|---------------------|---------------------|-------------------------|--------------|
|       |       | in the first stage  | in the second stage | obtained in both stages |              |
| 1     | 1     | MAX: dilatation     | MAX: dilatation     | MAX: dilatation         | Fig. 11      |
|       |       | 2 x 2 mask          | 2 x 2 mask          | 3 x 3 mask              |              |
| 1     | 0     | MAX: dilatation     | MIN: erosion        | MAX – MIN: closing      | Fig. 9       |
|       |       | 2 x 2 mask          | 2 x 2 mask          | 2 x 3 mask              |              |
| 0     | 1     | MIN: erosion        | MAX: dilatation     | MIN – MAX : opening     | Fig. 10      |
|       |       | 2 x 2 mask          | 2 x 2 mask          | 2 x 2 mask              |              |
| 0     | 0     | MIN: erosion        | MIN: erosion        | MIN: erosion            | Fig. 12      |
|       |       | 2 x 2 mask          | 2 x 2 mask          | 3 x 3 mask              |              |

Table 1. Operations realized in proposed two-stage image processor for different values of controlling bits  $b_1$  and  $b_2$ 

The proposed image processor has been tested by applying to its inputs signals chosen in the special way. In each experiment a different object has been defined in the input images, surrounded by the background pixels. In the experiments presented in this paper the images with three brightness levels have been used. To verify the dynamic parameters of the processor given pixels in these objects are sit up to be the pulse signals oscillating between the levels of 15 nA and 35 nA with the frequency of 66.67 kHz. The objects are surrounded by the background pixels sit up to the constant current level of 25 nA. Such signals force switching over the circuit every 7.5 µs. The example time diagrams of chosen pixels are shown in Fig. 8 for the case of the closing operation. The object in this case is the letter "A", where some errors have been introduced, as illustrated in Fig. 9 (a). Pixels (4, 3) and (6, 7) are wrong. The first section in the processor, programmed to perform the dilatation operation, produces the image B shown in Fig. 9 (b). The first and the second panel in Fig. 8 illustrate calculation of the first and the third horizontal lines of this image. The third and the fourth panel in Fig. 8 illustrate operation of the second section in the processor programmed to perform the erosion operation. The third panel in Fig. 8 illustrates calculation of the first line of the image C, while the fourth panel illustrates calculation of the third line of the image C. The resulting output image C is shown in Fig. 9 (c). As we can see, the processor made correction of both wrong pixels. There is some period denoted in the first panel of Fig 8 as "tx:. During a short time several output signals have value of 15 nA although the surrounding background is on the level of 25 nA. When the pixels in the object in the previous period had values 35 nA, the appropriate EN signals were sit up in the way that enabled signals with the highest values to be connected to the processor's output. After switching over values of these signals became small, resulting in temporarily wrong output signals, as long as comparators switched over, setting up the new values of the EN signals.

Switching over the circuit starts in time moments denoted in Fig. 8 as "p1" and "p2". There is some time required to get the stable output signal after the convergence process. This time results, in general, from the reaction time of the comparators in particular layers of the given section in the processor. We see that the first section in processor requires about 2  $\mu$ s to produce the stable signal *B* (time tr1), while the signal *C* becomes stable after about 4  $\mu$ s (time tr2). This means that processor is able to calculate images with the rate equal to about 250 kSps (two sections), but if only one section is used the data rate is doubled. The power dissipation is about 20  $\mu$ W.

Other experiments are illustrated in Figs. 10-12. In the second experiment illustrated in Fig. 10 the opening operation has been tested. The first section in the processor has been programmed to perform the erosion operation, while the second section to perform the dilatation operation. The object in this case is the letter "L", where some pixels in the surrounding background are noise. The opening operation helped to remove this noise.

In the third experiment both blocks in the processor has been programmed to perform the dilatation operations, resulting in 3x3 filtration (double 2x2 dilatation). In the fourth experiment illustrated in Fig. 12, both blocks in the processor has been programmed to the erosion operation, resulting in 3x3 filtration.



Figure 8. Experimental simulation results of the closing operation. First image B is calculated using the dilatation operation, then image C is calculated using the erosion operation







Figure 10. The opening operation: first erosion then dilatation - removing of the noise from the image



Figure 11. Double 2x2 dilatation operation resulting in 3x3 dilatation



Figure 12. Double 2x2 erosion operation resulting in 3x3 erosion

#### **5. SUMMARY**

In this paper, a novel analog asynchronous MIN/MAX binary-tree circuits are presented. The main advantage of these circuits is that input currents do not propagate from one layer to the other like in typical solutions described in literature. In proposed solutions input currents are copied directly to each later in the tree as well as to the filter output using separate paths with low number of elements. The particular competitive nodes (MIMA2) compare two input currents, producing only the output logical signals that control connections between the circuit's inputs and given layers, closing or opening the configuration switches in particular paths. This feature enables building of even very large binary tree structures, where accuracy does not depend on the number of layers in the tree. The paths between the inputs and the given layer or the output always contain the same number of elements independently on number of layers in the tree.

The proposed circuit has been used to perform the nonlinear filtration. There are two nonlinear filters proposed in the paper. The first one using the circular delay line is used in filtering of 1-dimensional signals. This filter works with input current signal with average value equal to 100 nA. The sampling frequency is 1 MHz and power dissipation is 4  $\mu$ W. In the second case all inputs are independent image pixels, and the circuit is used to perform 2-dimensional filtration. The experimental image processor has been designed that enable processing of 8x8 input data. The processor can be easily rebuilt to enable processing of images with larger dimensions.

The circuit is characterized by very low power dissipation that can be additionally easily calibrated suitably to a required speed by changing the level of the input signals. The presented simulation results concern a circuit designed for a  $0.18 \mu m$  TMSC CMOS process and are in a good agreement with theoretical predictions.

## REFERENCES

- 1. K. Wawryn, B. Strzeszewski, "Current mode AB class WTA circuit", *The 8th IEEE International Conference on Electronics, Circuits and Systems ICECS*, 2001, Volume 1, pp.293 296, 2001.
- K. Wawryn, B. Strzeszewski, "Prototype low power WTA circuits for programmable neural networks", *IEEE International Symposium on Circuits and Systems, ISCAS 2000*, Geneva, Switzerland, Volume 5, pp. 753-756, 2000.
- 3. A. Demosthenous, S. Smedley, J. Taylor, "A CMOS analog Winner-Takes-All network for large-scale applications", *IEEE Transactions on circuits and systems-I: Fundamental theory and Applications*, Volume 45, No. 3, 1998.
- 4. A. S. Sedra, C. W. Roberts, and F. Gohh, "The current-conveyors: History, progress and new results," *Proc. Inst. Elect. Eng., pt. G*, Volume 137, No. 2, 1990.
- 5. M. Sasaki, T. Inoue, Y. Shirai, F. Ueno, "Fuzzy multiple-input maximum and minimum circuits in current mode and their analyzes using bounded-difference equations", *IEEE Transactions Comput.*, Volume 39, pp. 768–774, 1990.
- 6. G. T. Tuttle, S. Fallahi, A. Abidi, "An 8-b CMOS vector A/D converter", *Proc. IEEE International Solid-State Conference (ISSCC)*, pp. 38–39, San Francisco, CA, 1993.
- 7. Y.-C. Hung, B.-D. Liu, "High-reliability programmable WTA/LTA circuit of O/N complexity using a single comparator", IEE Proc.-Circuits Devices Syst., Vol.151, No. 6, December 2004
- 8. W.W. Moses, E. Beuville, M.H.Ho, A "Winner-Take-All" IC for determining the crystal of interaction in PET detectors, *IEEE Transactions on Nuclear Science*, Volume 43, Issue 3, Part 2, pp. 1615-1618, 1996
- 9. R. Długosz, T. Talaśka, R. Wojtyna., New Binary-Tree-Based Winner-Takes-All Circuit for Learning on Silicon Kohonen's Networks, *International Conference on Circuits and Electronic Systems (ICSES)*, Łódź, Poland 2006.