A comprehensive review on privacy preserving data mining

Recently, the relevance of privacy-preserving data mining techniques is thoroughly
analyzed and discussed by Matwin (2013]). Utilization of specific methods revealed their ability to preventing the discriminatory
use of data mining. Some methods suggested that any stigmatized group must not be
targeted more on generalization of data than the general population. Vatsalan et al.
(2013]) reviewed the technique called ‘Privacy-Preserving Record Linkage’ (PPRL), which
allowed the linkage of databases to organizations by protecting the privacy. Thus,
a PPRL methods based taxonomy is proposed to analyse them in 15 dimensions. Qi and
Zong (2012]) overviewed several available techniques of data mining for the privacy protection
depending on data distribution, distortion, mining algorithms, and data or rules hiding.
Regarding data distribution, only few algorithms are currently used for privacy protection
data mining on centralized and distributed data. Raju et al. (2009]) acknowledged the need to add or to multiply the protocol based homomorphic encryption
along with the existing concept of digital envelope technique in obtaining collaborative
data mining while keeping the private data intact among the mutual parties. The proposed
technique exhibited considerable influence on different applications.

Malina and Hajny (2013]) and Sachan et al. (2013]) analysed the current privacy preserving solutions for cloud services, where the
solution is outlined based on advanced cryptographic components. The solution offered
the anonymous access, the unlink ability and the retention of confidentiality of transmitted
data. Finally, this solution is implemented, the experimental results are obtained
and the performance is compared. Mukkamala and Ashok (2011]) compared a set of fuzzy-based mapping methods in the context of privacy-preserving
characteristics and the capability to maintain the same connection with other fields.
This comparison is subjected to: (1) the four front modification of the fuzzy function
definition, (2) the introduction of the seven ways to join different functional values
of a particular data item to a single value, (3) the utilization of several similarity
metrics for the comparison of the original data and mapped data, and (4) the evaluation
of the influence of mapping on the derived association rule.

Data distortion dependent PPDM

Kamakshi (2012]) proposed a novel idea to dynamically identify the sensitive attributes of PPDM.
Identification of these attributes depends on the threshold limit of sensitivity of
each characteristic. It is observed that the data owner modified the value under identified
sensitive attributes using swapping technique to protect the privacy of sensitive
information. The data is modified in such a manner that the original properties of
the data remain unchanged. Despite the novelty it remains time expensive. Subsequently,
Zhang et al. (2012a]) introduced a newly enhanced historical probability based noise generation strategy
called HPNGS. The simulation results confirmed that the HPNGS is capable in reducing
the number of noise requirements over its random complement as much as 90 %. Later,
they focused on the privacy protection and noise obfuscation in cloud computing (Zhang
et al. 2012b]). Consequently, a novel association probability based noise generation strategy (APNGS)
is developed. The analysis confirmed that the proposed APNGS significantly improved
the privacy protection on noise obfuscation involving association probabilities at
a reasonable extra cost than standard representative strategies.

Li et al. (2009a]) presented a low-cost and less risky anonymous perturbation technique via homomorphism
encryption and anonymous exchange. The proposed technique displayed robustness for
optimized parameters. It is complex, loss in utility of data. Kamakshi and Babu (2010]) introduced three models including clients, data centres, and database in every site.
The data centre is completely passive, so that the clients and the site database role
appear exchangeable. Islam and Brankovic (2011]) proposed an architecture involving different novel techniques that affected all
the attributes in the database. Experimental findings showed that the proposed architecture
is very efficient in preserving the original patterns in a perturbed dataset. Wang
and Lee (2008]) introduced a technique to prevent Forward-Inference Attacks, in the sanitized data
(implies original data) created by the sanitization.

Association rule based PPDM

An improved distortion technique for privacy preserving frequent item-set mining is
proposed by Shrivastava et al. (2011]), where two probability parameters (fp and nfp) are employed. Better accuracy is
achieved in the presence of a minor reduction in the privacy by tuning these two parameters.
Furthermore, this algorithm produced the optimum results when the fraction of frequent
items among all the available items is less. PPDM is used in various fields for its
enhanced efficiency and security. Presently, it is facing a rule mining challenge.
Vijayarani et al. (2010a]) explained the techniques of statistical disclosure control community, the database
community, and the cryptography community. Less utility of data requires high cost.
Aggarwal and Yu (2008]) emphasized two significant factors involving the association rule mining such as
confidence and support. For an association rule X = Y, the support is the percentage
of transactions in the dataset which includes X U Y. The confidence (also called strength)
of an association rule X = Y is the ratio of the transactions number by X. Furthermore,
Belwal et al. (2013]) reduced the basis of support and confidence of sensitive rules without modifying
directly the given database. However, alteration can indirectly be performed via newly
incorporating parameters associated to database transactions and association rules.
New additions include M support (modified support), M confidence (modified confidence)
and Hiding counter. The algorithm utilized the definition of support and confidence.
Thus, it hided the required sensitive association rule without any side effect. However,
it can hide only the rules for single sensitive item on the LHS.

Jain et al. (2011]) developed a new algorithm to enhance and reduce the support of the LHS and RHS rule
item to hide or secure the association rules. The proposed algorithm is found to be
advantageous as it made minimum modification to the data entries to hide a set of
rules with lesser CPU time than the previous work. It is limited to association rule
only. Naeem et al. (2010]) proposed an architecture which screened the restricted association rules with complete
removal of the known side effects such as the generation of unwanted, non-genuine
association rules while yielding no ‘hiding’ failure. In this architecture, standard
statistical measures are used instead of conventional framework of support and confidence
to create association rules, particularly weighing procedure based on central tendency.
Li and Liu (2009]) introduced an association rule mining algorithm for privacy preserving known as
DDIL. The proposed algorithm is based on inquiry limitation and data disturbance.
The original data can be hidden or disturbed by using DDIL algorithm to improve the
privacy efficiently. This is an effective technique to generating frequent items from
transformed data. Experimental results displayed that the proposed technique is efficient
to generating acceptable values of privacy balance with suitable selection of random
parameters.

Hide association rule based PPDM

Fast Hiding Sensitive Association Rules (FHSAR) algorithm is introduced by Weng et
al. (2008]). This secured the SAR with fewer side effects, where a strategy is established to
avoid hidden failures. Besides, two heuristic techniques are developed to improve
the efficiency of the system to solve the problems. The heuristic function is further
utilized to determine the earlier weight for each particular transaction so that the
order of modified transactions can be decided efficiently. Consequently, the connection
between the sensitive association rules and each transaction in the original database
are analyzed by successfully choosing the suitable item for modification. The efficient
sanitization of sensitive information for updated database need to be studied. Dehkordi
et al. (2009]) presented a new multi-objective technique to hide the sensitive association rules
and to enhance the security of database. In fact, this maintained the utility and
of mined rules at efficient level. The proposed algorithm is based on genetic algorithm
(GA) concept, where the privacy and accuracy of dataset are enhanced. Gkoulalas-Divanis
and Verykios (2009]) developed an exact border-based technique to obtain an optimal solution to hide
sensitive frequent item sets with minimum extension of the original database generated
synthetically via the database extension. This is accomplished via the following:
(1) by formulating the generation of the database extension as a constraint satisfaction
problem, (2) using mapping of the constraint satisfaction issues to an equivalent
binary integer programming problem, (3) via the manipulation of underutilized synthetic
transactions to increase the support of non-sensitive item sets, (4) employing the
minimally relaxing constraint satisfaction problem to offer an approximate solution
close to the optimal one when an ideal solution does not exist, and (5) by partitioning
the universe of the items to enhance the efficiency of the proposed hiding algorithm.

Li et al. (2009b]) proposed a new algorithm to sanitize a transactional database. This is item-set
oriented, where the support of large item-sets are considerably reduced below the
threshold defined by the client. Thus, no rules can be obtained from the specific
item-sets. A new technique is also introduced to select the items that required removal
from the dataset to avoid the detection of a set of rules. The main limitations are
associated with the selection of victim-items without affecting the non-sensitive
patterns when the sanitization of 3rd and the 4th sensitive transactions are defined.
Kasthuri and Meyyappan (2013]) presented a new technique to identify the sensitive items by hiding the susceptible
association rules. The proposed technique located the frequent item sets and produced
the association rules. Representative association rules concept is employed to detect
the sensitive items. Hiding the sensitive association rules using selected sensitive
items is worth looking. Quoc et al. (2013]) have developed heuristic algorithm based on the intersection lattice of frequent
item-sets to secure the set of sensitive association rules employing distortion method.
To reduce the side effects, the heuristic for confidence and support reduction based
on intersection lattice (HCSRIL) algorithm are used. This specified the victim item
and reduced the number of transactions by causing least impact on item-sets variations
in Gen(FI). In addition, Domadiya and Rao (2013]) introduced a heuristic based algorithm called Modified Decrease Support of RHS item
of Rule Clusters (MDSRRC) to secure the delicate association rules using multiple
items in consequent (RHS) and antecedent (LHS). This algorithm successfully addressed
the drawbacks of existing rule hiding DSRRC algorithm. Experimental findings revealed
the efficiency and capability of the proposed algorithm to maintaining the database
quality. By minimizing the modifications on database the efficiency can be enhanced
with reduced side effects.

Classification based PPDM

Xiong et al. (2006]) proposed a closet neighbour classification method based on SMC techniques to resolve
the privacy challenges in few stages including the pf selection of the privacy preserving
closet neighbour and the categorization of privacy preserving. The proposed algorithm
is balanced in terms of accuracy, performance, and privacy protection. Furthermore,
it is adaptable to the various settings to fulfilling different optimization condition.
Singh et al. (2010]) provided a simple and efficient privacy preserving classification for cloud data.
Jaccard similarity measure is used to compute the nearest neighbours for K-NN classification and the equality test is introduced to compute it between two encrypted
records. This approach facilitated a secured local neighbour computation at each node
in the cloud and classified the unseen records via weighted K-NN classification scheme. It is significant to focus on enabling the robustness of
the presented approach so that generalization to multiple data mining tasks can be
made, where security and privacy are needed.

Baotou (2010]) introduced an efficient algorithm based on random perturbation matrix to protect
privacy classification mining. It is applied on discrete data of character type, Boolean
type, classification type and number types. The experimental revealed the significantly
enhanced features of proposed algorithm in terms of privacy protection and accuracy
of mining computation, where the computation process is greatly simplified but at
higher cost. Vaidya et al. (2008]) developed an approach for vertically partitioned mining data. This technique could
modify and extend a variety of data mining applications as decision trees. More efficient
solutions are needed to find tight upper bound on the complexity. Kantarc?oglu and
Vaidya (2003]) emphasized the use of secure logarithm and summation, where the distributed naive
Bayes classifier are securely determined. The experimental results strongly supported
the concept of few useful protected protocols that facilitated the secure deployment
of different types of distributed data mining algorithms. The classification of privacy
preserving methods and standard algorithms for each class is reviewed by Sathiyapriya
and Sadasivam (2013]), where the merits and limitations of different methods are exemplified. The optimal
sanitization is found to be NP-Hard in the presence of privacy and accuracy trade-off.

Clustering based PPDM

Yi and Zhang (2013]) overviewed various earlier solutions to preserve privacy of distributed k-means
clustering and provided a formal definition for equally contributed multiparty protocol.
An equally contributed multiparty k-means clustering is applied on vertically partitioned
data, wherein each data site contributed k-means clustering evenly. According to basic
concept, data sites collaborated to encrypt k values (each associated to a distance
between the centre and point) with a common public key in each step of clustering.
Then, it securely compared k values and outputted the index of the minimum without
displaying the intermediate values. In some setting, this is practical and more efficient
than Vaidya–Clifton protocol (Vaidya et al. 2008]).

Associative classification based PPDM

An associative classification model based on vertically partitioned datasets is introduced
by Raghuram and Gyani (2012]). A scalar product based third party privacy preserving model is adopted to preserve
the privacy for data sharing process between multiple users. The accuracy of the presented
method is authenticated on its VCI databases with inspiring results. Lin and Lo (2013]) presented a set of algorithms comprising of Equal Working Set (EWS), Small Size
Working Set (SSWS), Request on Demand (ROD) and the Progressive Size Working Set (PSWS).
This repeated mining offered a scalable, fast and reliable service for different-tasks
on computing environments. The presented algorithms demonstrated an outstanding efficiency
in terms of scalability and execution time under different simulation conditions.
Although CARM is a fast and scalable distributed algorithm in comparison with previous
studies, the scalability is still limited. This is because the HD-Mine used in CARM
establishes the FP-tree in the main memory of the trusted node. In the absence of
any memory space to mine the conditional FP-tree in the trusted node, the reconstructed
conditional FP-tree is distributed to an available computing node for mining. The
trusted node must provide sufficient memory space for the original FP-tree. Clearly,
the scalability is restricted by the major memory size of the trusted node.

Harnsamut and Natwichai (2008]) developed a novel heuristic algorithm based on Classification Correction Rate (CCR)
of particular database to secure the privacy and sustain the quality of data. The
proposed algorithm is tested and the experimental results are validated. The heuristic
algorithm is found to be highly effective and efficient. Seisungsittisunti and Natwichai
2011]) highlighted the issues related to data transformation to protecting privacy for
data mining technique and associative classification in an incremental-data scenario.
An incremental polynomial-time algorithm is proposed to transform the data to maintain
a privacy standard called k-anonymity. Quality can still be maintained even under
transformation when constructing an associative classification model. Different experiments
are performed to evaluate developed algorithm performance and compared with non-incremental
algorithm. It is established to be more efficient in every problem setting. It is
worth to examine the stored data in the distributed systems rather than a single repository.

Privacy preserving outsourced data mining

Giannotti et al. (2013]) explained the issues involving the outsourcing of association rule mining task for
a corporate privacy-preserving network. An attack model is developed based on the
background knowledge for privacy preserving outsourced mining. An encryption scheme,
known as Rob Frugal is proposed. This is based on 1–1 substitution ciphers of items,
which included the fake transactions to share each cipher item with the same frequency
as ?k ? 1 to the others. A compact synopsis of the fake transactions is used for true
support of mined patterns from which the server can be recovered efficiently. It is
demonstrated that the proposed scheme is robust against adversarial attack which is
based on the actual items and their exact support. This framework assumed that the
attacker is unaware of such information. Furthermore, any relaxation may break our
encryption scheme and bring privacy vulnerabilities. They investigated encryption
schemes that could resist such privacy vulnerabilities. The strategies for the improvement
of the RobFrugal algorithm to minimize the number of spurious patterns are also explored.

Worku et al. (2014]) enhanced efficiency of the above scheme by reducing the computational intensive
operations such as bilinear mapping. The scheme revealed secure and efficient results
after a detailed analysis on security performance. However, the data block insertion
made the proposed scheme non-dynamic. Thus, the development of a fully dynamic and
secure public auditing scheme remains an open challenge for a cloud system. Arunadevi
and Anuradha (2014]) investigated the issues related to outsourcing of frequent item-sets for a corporate
privacy preserving architecture. An attack model is introduced by considering that
the attackers are fully aware of the items and support of the item. In addition, even
in the eventuality the attackers are totally conscious of the details of the encryption
algorithm and some pairs of item with the corresponding cipher values. These basic
assumptions remarkably improved the security of the system and eliminated the item
and item-set based attack as well as reduced the processing time.

Lai et al. (2014]) proposed the first semantically secured solution for outsourcing association rule
mining with data privacy, mining privacy and soundness. These solutions are non-deterministic
and secured against an adversary at cloud servers. It is capable to adaptively obtaining
plaintext–cipher text pairs as required by semantic security. The adversary may also
insert false data into the data mining results. In comparison, adversary models used
in previous works on outsourcing association rule mining assumed that the honesty
of adversary/server but remained curious. It is not capable to obtaining any plaintext–cipher
text pairs in attacks. Consequently, the sub-situation mappings based solutions are
neither semantically secured nor ensured the soundness for the data mining results.
Kerschbaum and Julien (2008]) presented a searchable encryption scheme for outsource data analysis. In this scheme
the client had to encrypt the data only once and transmit the encrypted information
to the data analyst. The data analyst conducted a number of queries for required permission
from the client to translate the data contents in the queries. The proposed encryption
schemes permitted the search of keyword and range queries. The scheme also allowed
queries to reprocess the output of earlier queries as tokens to make dependent queries
without interface. The proposed scheme is found to be secured. There are many open
questions in the area of search-able encryption. In case of outsourced data analytics,
it is most interesting to combine the efficiency improvements possible for range queries
with the necessary security requirements via pairing-based cryptography.

Distributed method based PPDM

Ying-hua et al. (2011]) surveyed the Distributed Privacy Preserving Data Mining (DPPDM) depending on different
underlying technologies. Existing techniques are categorized into three groups such
as (1) secure multi-party computation, (2) perturbation and (3) restricted query.
Li (2013]) elucidated the advantages and drawbacks of each method by developing and analyzing
a symmetric-key based privacy-preserving scheme to support mining counts. An incentive
consideration is proposed to the study the secure computation by presenting a reputation
system in wireless network. The proposed system offered an incentive for misbehaving
nodes to behave properly. Experimental results revealed the system effectiveness in
detecting the misbehaving nodes and enhancing the average throughput in the whole
network. Furthermore, Dev et al. 2012]) acknowledged the privacy risks related to data mining on cloud system and presented
a distributed framework to remove such risks. The proposed approach involved classification,
disintegration, and distribution. This avoided the data mining by preserving the privacy
levels, splitting the data into chunks and storing them into suitable cloud providers.
Though, the proposed system offered a suitable way to safe privacy from mining based
attacks, but it added a performance overhead as client accessed the data frequently.
For instance, client had to run a global data analysis for a complete dataset, where
the analysis required accessing the data through different locations with a degraded
performance.

Tassa (2014]) developed a protocol for secured mining of association rules in horizontally distributed
database. The proposed protocol possessed advantages over leading protocols in terms
of performance and security. It included two set of rules including (1) a multi-party
protocol to compute the union or intersection of private subsets possessed by each
client and (2) a protocol to test the presence of an element held by client in a subset
held by another. Techniques based on Field and Row-Level distribution of transactional
data are proposed by Chan and Keng (2013]). They presented a distributed framework to preserve outsourcing association mining
rules and explored the possibility of its deployment. Database information based on
its characteristics is distinguished for the distribution to multiple servers. Its
privacy notions are examined from two separate viewpoints such as distribution of
support values and K-anonymity. The proposed algorithms for allocating transactions
to outsourced servers are based on the importance of the types of privacy notion to
a user. Dong and Kresman (2009]) explained the relation between distributed data mining and prevention of indirect
disclosure of private data in privacy preserving algorithms, where two protocols are
devised to avoid such disclosures. The first one was a simple add-on to a protocol
used for different application, whereas the second one provided the suitability of
collusion resistance and fewer broadcasts. The simplicity of the proposed protocols
enabled minimal requirements for computation, easy data storage or data structures.
Consequently, the notion of trust is introduced and the performance of certain ID
assignment protocols is addressed.

Aggarwal et al. (2005]) discussed data encryption based methods, which caused a large overhead in query
processing. A new distributed framework is proposed to enable privacy-preservation
for the outsourced storage of data. Different techniques are used to decompose the
data. It demonstrated improved queries when implemented in such types of distributed
system. A new definition for privacy is coined based on hiding sets of attributes.
It discussed the secured privacy achievement of the proposed decomposition approaches
and identified the best privacy-preserving decomposition technique. Other future work
includes identifying improved algorithms for decomposition, expanding the scope of
techniques available for decomposition (supporting replication, and incorporation
of these techniques into the query optimization framework). Xu and Yi (2011]) investigated the privacy-preserving distributed data mining that passed through
different stages and persisted. Taxonomy is proposed to endorse the standardization
and assessment of the protocols efficiency. This might be applied to categorize such
PPDDM protocols based on predefined dimensions. The dimensions included the data partitioning
model, mining algorithms, privacy preservation methods and secured communication model.
This area is prospective. Yet, the solution and evaluation work is still open for
further investigation.

Inan and Saygin (2010]) presented a technique to assemble dissimilarity matrix for horizontal distributed
data mining. The comparison required all the record operations in the form of pair
for personal private datasets which are distributed horizontally to different sites.
This approach considered the data either in the form of character or numerical. For
these two different types of data sets, a number of comparison functions are made
available. However, as expected, ensuring privacy has its costs, considering the comparison
against the baseline protocol where private data is shared with third parties. We
used the secured comparison protocols for clustering horizontally partitioned datasets.
There are various other application areas of these methods such as record linkage
and outlier detection problems Nanavati and Jinwala (2012]) elaborated different approaches used to find global and partial cycles in a distributed
setup while keeping the privacy of the particular parties secured in a co-operative
setup. The interleaved algorithm is extended and modified to determine global cycles
in cyclic association rules privately. The privacy preservation techniques are recommended
on the basis of homomorphic approach and secret sharing. It is concluded that the
approaches based on Shamir’s secret sharing can be employed to detect the partial
global cycles. However, few open research challenges including the application of
these privacy preserving theories to other temporal rule mining methods like calendric
association rules and temporal predicate association rules need to be addressed. Another
research challenge also involves deciphering the most efficient and accurate technique
in this scenario by practically comparing the cost for each method.

Agrawal and Srikant (2000]) developed a uniform randomization method based association rule for the categorical
datasets. In this approach, before sending a data to server, the client is replaced
each item by a new item which is originally absent in the data. The substitution process
of specific values from datasets with other values is called uniform randomization.
This is a generalization of the Warner’s “randomized response” technique. In other
types of data reconstruction techniques the original data are put aside and are initiated
via sanitizing known as “knowledge base”. Thus, newly obtained data is then reassembled
based on the sanitized knowledge. The effectiveness of randomization with reconstruction
for categorical attributes is exemplified.

Wang et al. (2010]) proposed a modified algorithm called PPFDM and related computation technique based
on the Frequent Data Mining (FDM) to preserve privacy. The process involved the computation
of total support count along with the privacy-preserved technique while ensuring the
local large item-set and local support count source is covered. Thus, the time needed
for the communication is saved and secured the distributed data privacy at each site.
The experimental results demonstrated the effectiveness and suitability of the method
for practical application, especially in privacy preservation during mining process.

Nguyen et al. (2012]) presented an Enhanced M. Hussein et al.’s Scheme (EMHS) for secured privacy association
rules mining, where horizontally distributed database is used. EMHS (developed in
2008) is capable to modify the privacy and efficiency with increasing number of sites.
The efficiency of EMHS is discerned to be much better than MHS, particularly for databases
with increasing number of sites. A second approach is also presented for the other
types of datasets. It is important to solving the collusion of Initiator and Combiner.
Om Kumar et al. (2013]) used WEKA to predict the patterns in a single cloud. By using cloud data distributor
with a secured distributed approach they provided an effective solution that prevented
such mining attacks on cloud. Thus, it made the cloud a secured platform for service
and storage.

Mokeddem and Belbachir (2010]) proposed a distributed model to perform class-association rules detection for shared-nothing
framework. The solution of the proposed model is one of the fastest known sequential
algorithms (FP-growth) which is extended to produce classification rules in a parallel
setting. By using the proposed system, the data replication is avoided on these sites
with an option to communicate the required information. These choices are evaluated
by performing experimentations, which permitted us to analyze several important aspects
such as accuracy, scalability, speedup, memory usage, communication, synchronization,
and also the load balancing. Ibrahim et al. (2012]) developed a practical cryptographic model to calculate the KNN categorization over
the distributed cloud databases. Their experiments demonstrated similar accuracy of
the proposed as the naive scheme without security. It is believed that such schemes
may mitigate the users concerns and accelerate the paces towards the high adoption
of cloud computing. The extension of our secure classifier to work in the malicious
adversary security model will be reported elsewhere.

Patel et al. (2012]) proposed an operative algorithm to protect the secrecy distributed over K-Means
cluster using Shamir’s secret sharing model. The proposed approach computed the cluster
mean collaboratively and prevented the role of trusted third party. Upon comparison,
it is observed that the proposed framework is orders of magnitude faster as compared
to oblivious polynomial evaluation and homomorphic encryption techniques in terms
of computation cost and more reliable for huge databases. It is essential to extend
the proposed algorithm in vertical partitioning in the presence of malicious adversary
model. In addition, the results from a realistic distributed emulation are worth looking.
Kumbhar and Kharat (2012]) analysed and compared different techniques used for Privacy Preserving Association
Rule Mining (PPARM). The algorithm based on cryptography techniques, Homomorphic encryption,
Secure Scalar product and Shamir’s secret sharing technique are employed to satisfy
the privacy constraints for vertically partitioned dataset. However, for horizontally
partitioned dataset the algorithm with the combination of RSA public key cryptosystem
and Homomorphic encryption scheme are used. Paillier cryptosystem is employed to determine
the global supports. In practice, while calculating c.count collaboratively, participant
may deviate from algorithm and lead malicious behaviour. But algorithm is semantically
secured and prevents collusive behaviour with accurate results.

Nix et al. (2012]) implemented two sketching protocols for the scalar (dot) product of two vectors
which are used as sub-protocols in larger data mining tasks. Results through extensive
experimentations revealed their high accuracy, low data leakage, and orders of magnitude
improved efficiency. The security properties of these approximations under a security
definition are also analyzed. In contrast to the previous definitions these are found
to be very efficient approximation protocols. It is worth to explore the use of these
dot product protocols in other data mining tasks such as support vector machines,
neural networks, and clustering. The notion of a secure approximation and determination
of the relaxation extent of the posed restrictions by the security model need to be
looked at.

Keshavamurthy et al. (2013]) demonstrated that GA approach possessed two potential advantages than traditional
frequent pattern mining algorithm. It is found that in frequent pattern mining, the
population is formed only once. Conversely, in GA method the population is formed
for each generation that maximizes the sample set. However, the major drawback of
GA approach is connected to the duplication in its sequential generations. For privacy
preservation data mining over distributed dataset, the key goal is to permit computation
of collective statistics for complete database with assurance of the privacy for confidential
data of the contributing databases. Hence, the algorithms for privacy preservation
needs further improvement based on the trade-offs between reconstruction accuracy
and privacy. On top, the fitness function of GA plays an important role and the convergence
of search space is directly proportional to the effectiveness of fitness function.
In other words, superior fitness functions for a given problem leads to faster convergence
of GA.

K-anonymity based PPDM

For the sake of clarity, it is customary to render two important definition of K-anonymity.

The first definition tells that: QI being a quasi-identifier for a given table U with
T(A1…An)

,
fC:U?T,fg:T?U?
, where
U?U?
, a quasi-identifier of
T(QT)
is a set of attributes
{Ai…AJ}?{A1…An}
, where
?pi?U
such that
fg(fc(pi)[QT])=pi
(Sweeney 2002]). The second definition is stated as follows: a table T satisfies K-anonymity if for every tuple
t?T
there exist k ? 1 other tuples
ti1ti2…tk-1?T
such that
ti1[C]=ti2[C]=…tik-1[C]
for all
C?QT
(Machanavajjhala et al. 2007]).

A scalable solution for each repetition can examine at least one generalization for
each attribute involved in the linking. (Wang et al. 2004]) studied the data mining as a approach used for data masking called data mining based
on privacy protection. The data mining methods are inspected in terms of data generalization
concept, where the data mining is performed by hiding the original information instead
of trends and patterns. After data masking, the common data mining methods are employed
without any modification. Two key factors, quality and scalability are specifically
focused. The quality issue is settled via the trade-off between privacy and information.
The scalability issue is established employing new data architecture while focusing
on good generalizations. Loukides and Gkoulalas-divanis (2012]) proposed a novel technique to anonymize the data by satisfying the data publishers’
utilization necessities experiencing low information loss. An accurate information
loss measure and an effective anonymization algorithm are introduced to minimize the
information losses. Experimental investigations on click-stream and medical data revealed
that that the proposed technique allowed more reliable query answers than the state
state-of-the-art techniques which are equivalent in terms of efficiency. This work
opens up several promising avenues for future research. These include examining how
UAR can be extended to guard against both identity and sensitive information disclosure
and how to produce anonymized data with guaranteed utility in certain data mining
tasks, such as classification and association rule mining. Friedman et al. (2008]) extended the definitions of K-anonymity to prove that the data mining model does
not violate the K-anonymity of the clients represented in the learning examples. A
tool is provided to determine the amount of anonymity retained during data mining.
The proposed approach showed its employment capability to different data mining problems
including classification, association rule mining and clustering.

The K-anonymity is further combined with data mining approach to protect the respondent’s
identity. Ciriani et al. (2008]) highlighted the potential threats to K-anonymity, which are raised via the implementation
of mining to collect data and analyses of two main techniques to join K- anonymity
in data mining. The different approaches employed to detect K-anonymity violations
are also described. Subsequently, the elimination of these approaches in association
rule mining and classification mining are emphasized. He et al. (2011]) proposed an algorithm based on clustering to produce a utility-friendly anonymized
version of micro data. This method is found to outperform the non-homogeneous technique
where the size of QI-attribute is greater than 3. They achieved a clustering-based
K-anonymity algorithm, which revealed considerable improvement in the utility performance
when applied to several real datasets. Recently, K-anonymous privacy preservation
is widely employed. Further modification appeared to be increasingly difficult without
resolving several issues. Patil and Patankar (2013]) examined the standard K-anonymity techniques and its applications. Some of the multidimensional
K-anonymous investigation is carried out. Yet, the present are multidimensional data
sets based K-anonymity algorithms using nearest neighbour strategy are useful to enhancing
the quality of anonymity and reducing the information loss.

Lately, K-anonymity became one of the most important topics for privacy preservation.
This can effectively avoid privacy leaks due to link attacks. Certainly, K-anonymity
is one of the widely used approach in all fields (Zhu and Chen 2012]). Soodejani et al. (2012]) employed a version of the chase termed as standard chase, which put some restrictions
on the dependencies and constrains, such as being positive and conjunctive. This area
is prospective for future study in fathering investigations on the applicability of
other versions of the chase in the method. The anonymity principle of their method
reveals some similarities to the L-diversity privacy model. Investigation of other
privacy models such as t-closeness may provide a stronger privacy model for the proposed
method with extreme usefulness. Karim et al. (2012]) proposed a numerical method to mine maximal frequent patterns with privacy preserving
capability. This method showed an efficient data transformation technique, a novel
encoded and compressed lattice structure and MFPM algorithm. The proposed lattice
structure and MFPM algorithm reduced both the search space as well as the searching
time. The experimental results displayed that the MFPM algorithm outperformed PC_Miner
and existing maximal frequent pattern mining algorithms. Besides the lattice structure,
it outperformed FP-like tree and PC_tree algorithm as well.

Loukides et al. (2012]) proposed a rule-based privacy model that allowed data publishers to express fine-grained
protection requirements for both identity and sensitive information disclosure. Based
on this model, they developed two anonymization algorithms. Their first algorithm
worked in a top-down fashion, employing an efficient strategy to recursively generalize
data with low information loss. Conversely, the second algorithm used sampling and
a mixture of bottom-up and top-down generalized heuristics. This greatly improved
the scalability and maintained low information loss. Extensive experimentations show
that these algorithms significantly outperformed the state-of-the-art in context of
recalling data utilization, while keeping good protection and scalability. It provides
a foundation for some future studies. First, while identity and sensitive information
disclosure are the main concerns in data publishing, it is worth examining membership
disclosure, in which inferring whether an individual’s record is contained in the
published data is to be prevented. Second, it is worth to extend the proposed approach
to anonymize disk-resident data with small memory consumption and I/O overhead.

Vijayarani et al. (2010b]) studied K-anonymity as an interesting approach to protect micro data related to
public or semi-public sectors from linking attacks. The possible threats to K-anonymity
approach is described in detail. Particularly, the problems related to data and the
approaches are identified to combine K-anonymity in data mining. Nergiz et al. (2009]) improved and extended the definitions of K-anonymity to manifold relations definitions
of K-anonymity expression. It is shown that earlier developed techniques either failed
to secure privacy or as a whole reduced the data utilization, and data protection
in a multiple relations setting. A new clustering algorithms is introduced to obtain
multi-relational anonymity. Experimental results illustrated that the proposed technique
is an effective approach in terms of utility and efficiency. Support for arbitrary
schemes with multiple private entities must be considered.

The problem of secured outsourcing of frequent itemset mining on the multi-cloud environments
is studied by Tai et al. (2013]). Concerning the challenges in big data analysis, they suggested to partition the
data into several parts and outsourced each part independently to different cloud
based on pseudo-taxonomy, anonymization technique, known as KAT. They proposed DKNT
to ensure the privacy security for each partial data outsourced to different clouds.
Experimental results demonstrated excellent achievement in terms of protection and
better computation efficiency as compared to those on a single machine. Tai et al.
(2010]) presented K-support anonymity, which provided protection against a knowledgeable
attacker with the exact support information. To achieve the K-support anonymity, a
pseudo taxonomy tree is introduced with the third party mine for the generalized frequent
item-sets. The construction of the pseudo taxonomy tree facilitated the hiding of
the original items and limited the fake items introduced in the encrypted database.
The results showed very good privacy protection with moderate storage overhead. K-anonymity
is further enhanced and improved by Pan et al. (2012]). They analyzed and compared the developed K-anonymity models and discussed their
applications. The modified K-anonymity models such as the L-diversity, (?, K)-anonymity
and (?, L)-diversification K-anonymity overcome the existing limitations related to
privacy. Few K-anonymous methods are employed in obtaining the main technology.

Based on suppression, Deivanai et al. proposed a new K-anonymity technique called
‘kactus’ (Deivanai et al. 2011]). In the proposed technique, multi-dimensional suppression is performed. The values
are suppressed to a certain records based on other attributes without using the domain
hierarchy trees. Thus, this approach identified the attributes independent of classification
of the data records and suppressed these values to comply with K-anonymity. This approach
is implemented on different database to determine its accuracy and efficiency and
compared with other K-anonymity based techniques. It is affirmed that in a multiparty
environment, the anonymization can be performed with perturbation to preserve privacy.
A new definition of K-anonymity model for effective privacy protection of personal
sequential data is introduced (Monreale et al. 2014]). This method transformed the sequential datasets into a K-anonymous form, while
preserving the utility of data with reference to a variety of analytical properties.
A series of experimentation on different real-life sequential data bases exhibited
that the proposed approach substantially secured the sequential pattern mining results
not only in terms of extracted patterns but also the support. Furthermore, the results
appeared extremely interesting in the case of dense datasets.

Nergiz and Gök (2014]) and Nergiz et al. (2013]) introduced the hybrid generalizations. It not only performed the generalizations,
but also involved the mechanism for data relocation. In data process, the position
of certain cells is changed to some populated indistinguishable data cells. The relocation
process helped to generate anonymizations of finer granularity and ensured underlying
privacy. The data relocation is a trade-off among the utilization and reliability
of the data, where the trade-off is controlled by the provider parameter. The results
revealed that a small number of relocations could enhance the utility as compared
to the heuristic metrics and query answering accuracy. A Hybrid generalizations mechanism
to relocate the data is introduced (Nergiz and Gök 2014]). In data relocation process, data cells are relocated to certain populated small
groups of tuples which remained distinguishable from each other. Again, the data relocation
helped to generate anonymizations of finer granularity which ensured the data privacy.
It is demonstrated that a small number of relocations could remarkably enhance the
utility. New hybrid algorithms can be designed for other privacy metric such as diversity,
(?, k)-anonymity or ?-presence. This would be crucial in addressing different types
of adversaries. There is also room for improvement of the proposed hybrid algorithms.
For example, one can design hybrid algorithms that would theoretically bound to the
probability of identification against algorithm-aware adversaries.

Zhang et al. (2013a], 2014a]) investigated the issues related to scalability of sub-tree anonymization for huge
data storage on cloud. They developed a hybrid approach along with Top-Down Specialization
(TDS) and Bottom-Up Generalization (BUG) techniques. In this method, one of the two
components is selected automatically by comparing K-anonymity parameter with workload
balancing point which is defined by the clients. Both TDS and BUG are obtained in
a scalable way via a series of deliberately designed Map Reduce jobs. Based on the
contributions herein, it is worth exploring the next step on scalable privacy preservation
aware analysis and scheduling on large-scale datasets. Later, Zhang et al. (2014b]) introduced a two-phase TDS technique based on Map Reduce on cloud. In the first
phase, the data sets are anonymized and partitioned in parallel and intermediate results
are generated. In the second phase, these intermediate results are aggregated for
further anonymization to produce consistent K-anonymous datasets. The Map Reduce on
cloud is employed for data anonymization and a group of data is designed deliberately
to concretely achieve the specific computation in a scalable way. The results from
the implementation of this method on real-world datasets displayed that the presence
of scalability and competence of TDS made the performance much better than existing
methods. They have presented an efficient quasi-identifier index based technique to
preserve the privacy over incremental datasets on cloud. In the proposed technique,
QI-groups (QI: quasi-identifier) are listed using domain values in the current generalization
level, which allowed the access only to a small portion of records in any database
rather than admittance to the whole data base (Zhang et al. 2013b], c]). In addition, Ding et al. 2013]) introduced a distributed anonymization protocol for privacy-preserving data publishing
from multiple data providers in a cloud system. Their method performed a personalized
anonymization to satisfy every data provider’s requirements and the union formed a
global anonymization to be published. They also presented a new anonymization algorithm
using R-tree index structure.