Justified granulation aided noninvasive liver fibrosis classification system

The method was built based on a data set containing 33 different blood attributes
collected from 290 chronic viral hepatitis patients from the Dept. of Gastroenterology
and Hepatology of the Prof. Kornel Gibi?ski Central Clinical Hospital of the Silesian
Medical University in Katowice as well as 75 patients with other hepatitis etiologies.
Due to high count of missing values (over 66 %) in some attributes, eight of them
have been eliminated from the set, leaving 25 of them (see Table 1) for further processing. Eliminated attributes did not have values representation
for all stages of liver fibrosis.

Table 1. Data set characteristics

A set of patients’ blood attributes (K) has been determined as shown in Table 1. Based on previous research 10], patients’ age has been included as an additional attribute, thus the set K contains 26 attributes. Variable k?=?1,?…,?26 is used to denote the number of an attribute.

For every patient, the biopsy results have been collected as a reference value. Liver
biopsy examination was performed according to the METAVIR scoring system 31]. Fibrosis level was staged on a range of 0–4 with step 1: F0 – no fibrosis, F1 –
portal fibrosis without septa, F2 – few fibrosis, F3 – numerous septa without cirrhosis
and F4 – cirrhosis.

Many authors point out that some biopsy fibrosis stages are difficult to diagnose,
even for experienced doctors 10]. For this reason, after medical consultations, the new classification has been introduced.
Instead of F0 and F1 fibrosis stage, the low (n?=?1) level class S
1
was applied. Instead of F2 and F3, the medium (n?=?2) level class S
2
was used, while instead of F4 METAVIR cirrhosis, the S
3
class (n?=?3) was applied. It means that instead of five METAVIR scoring scale, the
three S n , n?=?1,?…,?3 fibrosis classes will be taken into consideration. The new classification
scores were introduced to the proposed medical support system.

Blood and age data were grouped according to the biopsy result and assigned to the
sets X k,n
, where k?=?1,?…,?26 is a given blood attribute and the number n?=?1,?…,?3 describes one of the new fibrosis S n
classes. The data is processed for every k th
blood and age attribute separately, so every set X k,n
represents the n th
fibrosis class of k th
attribute. Each set X k,n
includes up to P values, where P is a number of examined patients. For example the set X10,3
comprises values x i
, i?=?1,?…,?P‘P‘
???P of the blood attribute k?=?10 (ASPT) of all patients, who were diagnosed with fibrosis class n?=?3 (S 3
). It means that theoretically we can create k?×?n?=?26?×?3?=?78 various X k,n
sets. An exemplary set X k,n
is presented in Fig. 1.

Fig. 1. The illustration of X k,n
set for a given k and n in a value domain

Due to missing values in patients’ data, the cardinality of the X k,n
sets is various. Therefore, the granulation process is focused on the values distribution
within a set and not cardinality itself.

In the proposed method, medical data are processed to acquire useful information.
The medical data processing is realized inside of three functional blocks. The first
block, using X k,n
sets, creates intervals based on the justified granulation paradigm. This transition
is shown in the left part of Fig. 2. The clouds of black points are described as a series of intervals. In the middle
block the intervals are generalized to fuzzy sets using fuzzification procedure. Finally,
the intuitive classification algorithm is proposed to merge the obtained results using
voting procedure, which is a common approach in advanced biometric systems 26]. The system’s functional diagram in Fig. 2 presents information flow and changes of medical data representation.

Fig. 2. Data representation during classification process: clouds of black points (elements
of X k,n
sets), series of intervals and fuzzy sets. The intuitive classification algorithm
merges the obtained results using voting procedure

Information mining using justified granulation method

As mentioned before, due to missing data, the cardinality of X k,n
sets varies. Therefore, the granulation process is focused on the values distribution
within a set and not cardinality itself. Direct analysis of the raw blood and age
attributes for fibrosis stage evaluation could be troublesome, therefore the justified
granularity paradigm 27] was adopted for this task. This data mining technique creates an interval granule
over a set X k,n . To find an interval representation over a X k,n
set, its left and right boundary is determined using information function V family, defined in Eq. 1:

(1)

where:

X k,n
– set contains values of k th
blood attribute of patients with the n th
fibrosis class,

– median over a set X k,n
,

?– specificity parameter ??=?[0,?? max
].

The intuitive character of V family function was illustrated in Fig. 3.

Fig. 3. The illustration of X k,n
set for given ? in value domain. The functions V l
and V r
assume maximal values in proximity of local concentration of groups of elements of
the set

If the elements of the set are uniformly distributed then the maximum of V function is directly affected by ? value. The functions V l
and V r
favor the boundary values of a set for ??=?0 to values close to median for ??=?? max
. In practice, for values ????(0,?? max
), the V function family assumes maximal values in proximity of local concentration of groups
of elements of the set. These values can be treated as characteristic representation
of a set for a given ?. The balance between cardinality of the set and concentration of values inside of
this set can be tuned using ? parameter. Using defined ? value the representation of information granule g k,n,?
?=?G(X k,n, ?), as interval [a1,k,n,?
,?a2,k,n,?
] can be determined by finding the values for both v l
and v r
functions according to Eq. 2.

(2)

Before going to the next stage, the specificity parameter ? value should be normalized from [0,?? max
] to [0,1]. The normalization procedure was described thoroughly in 27], 32].

Eq. 2 provides a balance between the specificity of a granule and its size. The advantage
of proposed method over other solutions is only one parameter to tune. Moreover, the
? value influences the area where the group of values is searched. The interval representing
a set is found by its left and right boundary according to Eq. 2. An example of V l
functions family, which values was normalized to [0,1], is shown in Fig. 4.

Fig. 4. The example of normalized V l
function for various ? value. The maximal value of V l
function, for given ?, allows to find a local concentration of elements within X k,n
set

The v l
function values represent maximal values of V l
function for given ?, thus allows to find a local concentration of elements within X k,n
set. The granulation algorithm (G), which finds an interval representation for a given ? and X k,n
set, is defined as follows:

1. Calculate value a 1 , k,n,?
?=?v l
(X k,n
, ?).

2. Calculate value a 2,k,n,?
?=?v r
(X k,n
, ?).

3. Construct information granule g k,n,?
?
=?[a 1,k,n,?
, a 2,k,,n,?
].

The granulation algorithm processes elements of a set to find its representation.
Therefore, performing z???N? -cuts equally distributed within a range [0,1], will allow to finding a characteristic
values for each set. To find the pattern within all ranges of specificity level, as
illustrated in Fig. 4, the series of z interval granules for ? i
,?i???{0,?…,?z???1} are built, where values of ?
i
are equally distributed within [0,1]:

(3)

The result for parameter z?=?3 is a sequence of three granules generated for ? i
???{0,?0.5,?1}.The example for blood RBC attribute, in comparison with histogram,
is presented in Fig. 5 for each class.

Fig. 5. The RBC attribute interval representation for class 0, 1 and 2 respectively: (a) histogram (b) generated interval granules for z?=?3

The proposed approach does not favor the class with higher number of samples. Moreover,
the received intervals illustrate a change within a set. Intervals are crisp and do
not take under consideration values which are lying in close proximity to its boundaries.
Thus, fuzzification was proposed in next step to consider this feature.

Interval fuzzification procedure

The process of changing interval representation of information granule to fuzzy set
adds an uncertainty level. Based on two interval information granules constructed
over set X k,n
(, where j???{0,?…,?z???1}), a fuzzy set granule is build. In the proposed method, the ?0
is constant value equals 0 and it represents all values of the X k,n
set. The granulation function , which constructs fuzzy granule and its membership function, was defined as follows:

(4)

, where:

where:

k- number of an attribute,

n– liver fibrosis class,

d – generalization parameter d???[0,?1],

? j
– jth
?-cut evaluated using Eq. 3, j???{0,?…,?z???1},

?0
– constatnt equal 0 in the equation,

z – number of cuts,

inf/ sup – lower/upper boundary of an interval (g granule).

The proposed granule describes the n th
class of liver fibrosis for a given k th
attribute and jth
?-cut. The boundary values of are represented by constant ? 0
?=?0, therefore b 1
and b 4
will always assume respectively minimal and maximal value of X k,n
set. The trapezoidal fuzzy membership function ?(x) was selected as intuitive fuzzy representation of two intervals. Moreover, this
set representation simplifies the calculations and was successfully applied in many
medical works 28], 29]. Finally, the initial experiments with other fuzzy representation e.g. Gaussian,
triangle, bell-shape did not have impact on model accuracy. The introduced in Eq. 4 generalization parameter (d), illustrated in Fig. 6, allows to take under consideration the values, which are laying in close proximity
of a granule, but are not a part of it.

Fig. 6. membership function for a given ?j and changing d parameter: (a) example, (b) real estimated data for Age attribute

Figure 6 b shows changes within the age attribute. The created granules for ??=?0 overlap
significantly between classes n?=?1,…,3, thus they provide a little useful information for classification purposes.
Moreover, the median calculated for 2nd and 3rd class (n) is placed almost in the same place – around 55 years. Nevertheless, the shapes of
fuzzy sets for ????(0,?1) differ significantly and therefore can be used as information to find the
correct fibrosis class. The fuzzy set for ? j
=0.5 and ? j
=1.0 (Fig. 6 b) provides more information as offering smaller overlap of the sets. To improve
classification accuracy the attributes, which sets overlap significantly 33], are removed from further classification. Only selected set of attributes K’ ? K with the smallest overlap of fuzzy intervals between all fibrosis classes, are
processed further.

Classification process

The created granular model is used to evaluate the classes of liver fibrosis for test
patients. The data of a given patient is compared with the model using membership
function , where y k
is the medical examination result of a patient for a given k th
attribute. The membership function provides information, whether or not a patient
has n th
liver fibrosis class, according to k th
attribute. The aggregation (averaging) over all attributes (k) is performed for each ? j,j?=?0..?z???1 and n th
class separately. As a result, the average value for each class is evaluated. Finally,
the voting is performed based on the values obtained for each of z ?-cuts. The motivation of voting was based on the physician analysis scheme, where
similarities of symptoms are analyzed. In this case, the classification procedure
weight, how many fuzzy granules favor a given class. The aggregation and classification
process is illustrated in Fig. 7 for two attributes: age (k?=?1) and RBC (k?=?3).

Fig. 7. Classification example illustrated on two attributes: Age (a) and RBC (b). The values of attributes of classified patient are compared against the fuzzy representation
of classes. Then, using averaging (c) and weighted voting d, the patient’s fibrosis class is found

The black line in Fig. 7 illustrates the values of attributes of classified patient. If the patient does not
have an attribute marked the attribute is not taken into consideration. Using only
results for ? j
?
=?0 the classification is inconclusive (equal value of membership function for all
three classes). For the second and third column (? j
=0.5 and ? j
=1) age attribute favors n?=?1 class (marked by blue curve), while RBC attribute favors
(b / RBC) n?=?3 class (marked by green curve). To find the correct diagnosis the mean
value of membership function is calculated for given ? j
(Fig. 7c). As presented in example average values for ? j
?
=?0 carry no information, however the average values for ? j
equal 0.5 and 1 favors n?=?3 class. Finally, the weighted voting is performed between
the results acquired for ? j
???{0,?0.5,?1}. The class n?=?3 with the highest value (w) is selected.

The patient’s classification is performed formally based on his medical data set Y?=?{y k
,?k???K}, where y k
defines a value of k th
attribute. The weight of n th
fibrosis can be calculated using a following equation:

(5)

where: n is a number of fibrosis class, K?’ is set of selected attributes, card is cardinality of a set, z and d are the method parameters.

Finally, the class for which w function returns maximal value is treated as the patient’s fibrosis class (?):

(6)

where:

Y– a set of laboratory blood test results of the patient,

n – represent the liver fibrosis class, defined as n?=?1, …, 3,

z, d, K’ – parameters of proposed method.

The quality of recognition depends on differences between w z,d,K?’
function values. Significant difference between calculated weights, for various classes,
confirms that quality of diagnosis is high.