Top PDF A Study on Selective Data Mining Algorithms

A Study on Selective Data Mining Algorithms

A Study on Selective Data Mining Algorithms

The training examples are vectors in a multidimensional feature space, each with a class label. The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples. In the classification phase, k is a user-defined constant, and an unlabelled vector (a query or test point) is classified by assigning the label which is most frequent among the k training samples nearest to that query point. Usually Euclidean distance is used as the distance metric; however this is only applicable to continous variables. In cases such as text classification, another metric such as the overlap metric (orHamming distance) can be used. Often, the classification accuracy of "k"-NN can be improved significantly if the distance metric is learned with specialized algorithms such as e.g. Large Margin Nearest Neighbor or Neighbourhood components analysis. A drawback to the basic "majority voting" classification is that the classes with the more frequent examples tend to dominate the prediction of the new vector, as they tend to come up in the k nearest neighbors when the neighbors are computed due to their large number.One way to overcome this problem is to weight the classification taking into account the distance from the test point to each of its k nearest neighbors.
Mostrar mais

5 Ler mais

Data Mining: A Preprocessing Engine

Data Mining: A Preprocessing Engine

HSV data set was experimented in two ways. First, the entire HSV data set which consists of 122 examples was considered as a training set and no testing set was required. Secondly, the data set was fragmented into two subsets: the training set which consists of 75% of the original HSV data (92 examples) and the testing set which consists of the 25% of the original HSV data (30 examples). The reason behind using the training set of 122 examples and the training set of 92 examples is to check the effect of number of examples in the training set on the accuracy, the simplicity and the tree growing time. Al Shalabi has tested the HSV data against mini- max normalization method [5] . This study was extended to take into account z-score and decimal scaling normalization methods. Comparisons between the three normalization methods were discussed in this paper.
Mostrar mais

5 Ler mais

Formalization of Learning Patterns Through SNKA

Formalization of Learning Patterns Through SNKA

I NTEGRATION OF D IC A LGORITHM WITH S NKA Many Data mining methods were analyzed. The classification method was tried to classify the user profile classification based on their profile characterization. Their supportive algorithms are decision tree, Naïve Bayesian, Neural network and C&RT. The clustering methods were also mainly used for studying information retirval patterns. There were two ways to solve social networking data analysis. I) Graph Based requires study on nodes generated on the web structure and ii) test is based on clustering where new clusters were created based on the behavior of numerical data. Figure1 depicts the implication of SNKA with data mining algorithms. The user can manage his knowledge by following the SNKA framework to maintain his knowledge base stronger. This may impact more on their decisions making stronger.
Mostrar mais

5 Ler mais

Optimization of Building Energy Consumption through Data Mining Using Modern Science

Optimization of Building Energy Consumption through Data Mining Using Modern Science

percent of the information to discover the knowledge and to be analyzed and evaluated. According to MIT University, the new data mining knowledge is one of ten sciences developing in the next decade makes the technological to meet the revolution. Although the concept of data mining for the first time was introduced by doctor GregoryShapiro 1 at the Conference on Artificial Intelligence in 1989 and it is more than two-decade-old, however, it continues to be remembered as the foundation of new science and in some areas it has not found its place; and even the offered concept definitions of data mining are multiple. [3] The Gartner Institute has defined it as "data mining is to discover correlations, patterns and trends with significant and new that are the result of checkout process of large amounts of data stored in cache, using pattern recognition techniques with mathematical and statistical methods." [4] In another definition of data mining is considered the process to extract reliable information, from the previously unknown, understandable and reliable large databases and use it in decision-making. [5] In fact, data mining is a bridge between science of statistics, computer science, software, artificial intelligence, pattern recognition, machine learning and visual re-representations of the data. So far, the knowledge of data mining seriously, has entered branches of sciences such as marketing, finance, manufacturing, medicine, tracking, text mining, web, forecasting and organizational learning and problems. However, in some other issues related, the surveys have also been conducted sporadically. In the field of construction industry also in the international community, especially, concerning the optimization of energy consumption that is the subject of this article; Case studies have been carried out and the research ahead has provided a broader view in regard to the creation of a new study area and basis for the development of
Mostrar mais

9 Ler mais

Modified Weighted PageRank Algorithm using Time Spent on Links

Modified Weighted PageRank Algorithm using Time Spent on Links

The World Wide Web is an interactive medium to preach information in a huge, diverse and dynamic way. Web Mining is the process of extraction of unspecified, valuable and comprehensible patterns of information from a large web data repository[1]. Web Mining is categorised into Web Structure Mining, Web Content Mining and Web Usage Mining. Web Structure Mining is used to mine the structure of hyperlinks in the web itself. Web Content Mining used to extract useful information from the content of web. Web Usage Mining is the application of Web Mining technique on large web log repository to extract suitable knowledge about user’s behavioural patterns. With the dynamic growth and increasing data on the web, it is very difficult to find relevant information for a user. When a user makes a query from search engine, the result displayed has relevant and non-relevant web pages containing the query. So, ranking algorithms are obligatory to prioritize the search result so that more relevant pages are displayed on the top. Various ranking algorithms based on web structure mining and web usage mining such as PageRank, Weighted PageRank, PageRank with VOL and Weighted PageRank with VOL have been developed.
Mostrar mais

8 Ler mais

Mining Frequent Itemsets from Online Data Streams: Comparative Study

Mining Frequent Itemsets from Online Data Streams: Comparative Study

In [Jin R. and Agrawal. G., 2005], an algorithm called StreamMining is proposed. It is built on the idea of [Karp R.M. et al, 2003] to determine frequent items (or 1-itemsets). In [Karp R.M. et al, 2003] a two pass algorithm was presented for this purpose, which requires only (1/) memory, where  is the desired support level. Their first pass computes a superset of frequent items, and the second pass eliminates any false positives. StreamMining algorithm addressed three major challenges in applying their idea for frequent itemset mining in a streaming environment. First, it developed a method for finding frequent k-itemsets, while still keeping the memory requirements limited. Second, it developed a way to have a bound on the superset computed after the first pass. Third, it developed a data structure and a number of other implementation optimizations to support efficient execution. This data structure called TreeHash, which implements a prefix tree using a hash table. It has the compactness of a prefix tree and allows easy deletions like a hash table. It also uses a relaxed minimum support threshold ϵ, like almost all the mining algorithms for data streams, so the memory requirements increase proportional to 1/ϵ. So, this algorithm should had to compute k-itemsets approximately after the first pass, without requiring any out-of-core or large summary structure, and ensure a provable bound on the accuracy of the results after the first pass on the dataset; because in streaming environments, second pass on the dataset is usually not feasible. Therefore, it is important that the set K computed above does not contain many false positives. It was different with [53] 2002 in the space requirements. As, for finding frequent items, it takes O(1/) space, while [Manku G., and Motwani R., 2002] requires O((1/) log (N)) space. As [Manku G., and Motwani R., 2002] requires an out-of-core data structure, while it used an in-core data structure. It also has deterministic bounds on the accuracy. One exception is
Mostrar mais

9 Ler mais

Overview Of Streaming-Data Algorithms

Overview Of Streaming-Data Algorithms

Due to recent advances in data collection techniques, massive amounts of data are being collected at an extremely fast pace. Also, these data are potentially unbounded. Boundless streams of data collected from sensors, equipments, and other data sources are referred to as data streams. Various data mining tasks can be performed on data streams in search of interesting patterns. This paper studies a particular data mining task, clustering, which can be used as the first step in many knowledge discovery processes. By grouping data streams into homogeneous clusters, data miners can learn about data characteristics which can then be developed into classification models for new data or predictive models for unknown events. Recent research addresses the problem of data-stream mining to deal with applications that require processing huge amounts of data such as sensor data analysis and financial applications. For such analysis, single-pass algorithms that consume a small amount of memory are critical.
Mostrar mais

10 Ler mais

Towards a New Approach for Mining Frequent Itemsets on Data Stream

Towards a New Approach for Mining Frequent Itemsets on Data Stream

Frequent pattern mining has been a focused theme in data mining research for over a decade. Abundant literature has been dedicated to this research and tremendous progress has been made, ranging from efficient and scalable algorithms for frequent itemset mining in transaction databases to numerous research frontiers, such as sequential pattern mining, structured pattern mining, correlation mining, associative classification and frequent pattern-based clustering, as well as their broad applications. In this article, we provide a brief overview of the current status of frequent pattern mining and discuss few promising research directions. We believe that frequent pattern mining research has substantially broadened the scope of data analysis and will have deep impact on data mining methodologies and applications the long run. However, there are still some challenging research issues that need to be solved before frequent pattern mining can claim a cornerstone approach in data mining applications.
Mostrar mais

5 Ler mais

A Data Mining view on Class Room Teaching Language

A Data Mining view on Class Room Teaching Language

From ancient period in India, educational institution embarked to use class room teaching. Where a teacher explains the material and students understand and learn the lesson. There is no absolute scale for measuring knowledge but examination score is one scale which shows the performance indicator of students. So it is important that appropriate material is taught but it is vital that while teaching which language is chosen, class notes must be prepared and attendance. This study analyses the impact of language on the presence of students in class room. The main idea is to find out the support, confidence and interestingness level for appropriate language and attendance in the classroom. For this purpose association rule is used.
Mostrar mais

6 Ler mais

Seismic Hazard Prediction Using Seismic Bumps: A Data Mining Approach

Seismic Hazard Prediction Using Seismic Bumps: A Data Mining Approach

Some studies carried out in the literature using seismic bumps are as follows. Bilen et. al. [1] proposed a system for earthquake prediction by analyzing seismic bump data. 94.11% classification accuracy was achieved in the study in which k nearest neighbor algorithm was used as classification algorithm. Celik et al. [2] proposed an intelligent system for earthquake prediction by analyzing seismic bump data. 91% classification accuracy was achieved in the study in which support vector machine was used as classification algorithm. Dehbozorgi and Farokhi [3] used neuro-fuzzy system algorithm in their study conducted by using seismometer data. In the study, 82% accuracy rate was obtained. Zhang et. al. [4] proposed multi-scale wavelet analysis for single-component recordings. Colaket. al. [5] used the wavelet method and average energy value for the detection of seismic wave arrival time in three-component stations. Xu et. al. [6] carried out analysis on data obtained from DEMETER satellite. From the satellite they obtained information such as seismic band information electron density, electron temperature, ion temperature and oxygen ion intensity. In this study, seismic bump data obtained from the coal mines of Poland has been used for earthquake prediction. Knowledge discovery has been performed from obtained data by using data mining methods. ELM which is an effective and fast algorithm was used at the classification stage.
Mostrar mais

6 Ler mais

A Recent Review on XML data mining and FFP

A Recent Review on XML data mining and FFP

Data Mining is referred to as Knowledge Discovery in Databases. It deals with issues such as representation schemes for the concept or pattern to be discovered, design of appropriate functions and algorithms to find patterns. However data on the web and bioinformatics databases often lack such a regular structure called semi- structured. This survey papers gives a brief survey of XML data mining using association rules and fast frequent pattern in various fields, the modifications made to the association rules according to the applications they were used and its effective results. Thus association rules prove themselves to be the most effective technique for frequent pattern matching over a decade. XML has become very popular for representing semi structured data and a standard for data exchange over the web. Mining XML data from the web is becoming increasingly important. However, the structure of the XML data can be more complex and irregular than that. Association Rule Mining plays a key role in the process of mining data for frequent pattern matching. First Frequent Pattern-growth, for mining the complete set of frequent patterns by pattern fragment growth. First Frequent Pattern-tree based mining adopts a pattern fragment growth method to avoid the costly generation of a large number of candidate sets and a partition-based, divide-and-conquer method is used. This paper shows a review of XML data mining using Fast Frequent Pattern mining in various domains.
Mostrar mais

6 Ler mais

A Hybrid Approach to Privacy Preserving in Association Rules Mining

A Hybrid Approach to Privacy Preserving in Association Rules Mining

Nowadays, data mining is a useful, yet dangerous technology through which useful information and the relationships between items in a database are detected. Today, companies and users need to share information with others for their progress and they should somehow manage this information sharing for preserving sensitive information. Privacy preserving in data mining was introduced for managing information sharing. This paper presents a hybrid algorithm with distortion technique with both support-based and confidence-based approaches for privacy preserving. The proposed algorithm tries to maintain useful association rules and hide sensitive rules from the perspective of the database owner. It also has no limit on the number of items in the left-hand side and the right-hand side of rules. This paper also compares the proposed algorithm with MDSRRC algorithm and 1.b algorithm. The proposed algorithm has less lost rules compared with the MDSRRC and 1.b algorithms and its CPU usage is less then.
Mostrar mais

4 Ler mais

CLASSIFICATION ALGORITHMS FOR BIG DATA ANALYSIS, A MAP REDUCE APPROACH

CLASSIFICATION ALGORITHMS FOR BIG DATA ANALYSIS, A MAP REDUCE APPROACH

Since many years ago, the scientific community is concerned about how to increase the accuracy of different classification methods, and major achievements have been made so far. Besides this issue, the increasing amount of data that is being generated every day by remote sensors raises more challenges to be overcome. In this work, a tool within the scope of InterIMAGE Cloud Platform (ICP), which is an open-source, distributed framework for automatic image interpretation, is presented. The tool, named ICP: Data Mining Package, is able to perform supervised classification procedures on huge amounts of data, usually referred as big data, on a distributed infrastructure using Hadoop MapReduce. The tool has four classification algorithms implemented, taken from WEKA’s machine learning library, namely: Decision Trees, Naïve Bayes, Random Forest and Support Vector Machines (SVM). The results of an experimental analysis using a SVM classifier on data sets of different sizes for different cluster configurations demonstrates the potential of the tool, as well as aspects that affect its performance.
Mostrar mais

5 Ler mais

Towards a Data Mining Class Library for Building Decision Making Applications

Towards a Data Mining Class Library for Building Decision Making Applications

As we can see in the image above; we have for the cluster- ing package a super class called Cluster and each technique inherits from it. The classes that inherits directly from it are fuzzy clust and hard, each one in its own subpackage. The fuzzy clust and hard are also super classes of the different type of algorithms implemented. We have a FIS package in where we have two clases each one implementing a fuzzy model. One implements the Mamdani fuzzy model and the second one implements TSK. Both uses the cluster class because this algorithms use clustering techniques in order to determinate the appropied rules to work with the data provided. The fuzzy classes are based on a Java Library called JT2FIS [4]. And we also have the statics subpackage in which we have the decision tree implementation the ID3 called DecisionTreeID3 and the clases TreeNode and Attribute which are used by the ID3 implementation.
Mostrar mais

5 Ler mais

Analyzing Chatbots Data with Data Mining

Analyzing Chatbots Data with Data Mining

presented an approach to expert-guided subgroup discovery. This algorithm is a rule in- duction system based on beam search algorithms. Beam search algorithms use breadth-first search to build its search tree. At each level of the tree, it generates all successors of the states at the current level, sorting them in increasing order of an utility function. This al- gorithm is guided by expert knowledge: instead of defining an optimal measure to discover and automatically select the subgroups, the objective is to help the expert in performing flexible and effective searches on a wide range of optimal solutions. Discovered subgroups must satisfy the minimal support criteria and must also be relevant. The algorithm keeps the best subgroups descriptions in a fixed width beam and in each iteration a conjunction (an- tecedent of the rule) is added to every subgroup description in the beam, replacing the worst subgroup in the beam by the new subgroup if it is better. It supports target values of the categorical type, such as SubgroupMiner. The main quality measure used by this algorithm is a precision measure called Qg (Eq. 2.7). To describe the subgroups found, this algorithm uses conjunctions of pairs attribute-value and the operators =, < and >.
Mostrar mais

132 Ler mais

Detection and Evaluation of Cheating on College Exams using Supervised Classification

Detection and Evaluation of Cheating on College Exams using Supervised Classification

Abstract. Text mining has been used for various purposes, such as document classification and extraction of domain-specific information from text. In this paper we present a study in which text mining methodology and algorithms were properly employed for academic dishonesty (cheating) detection and evaluation on open-ended college exams, based on document classification tech- niques. Firstly, we propose two classification models for cheating detection by using a decision tree supervised algorithm. Then, both classifiers are compared against the result produced by a domain expert. The results point out that one of the classifiers achieved an excellent quality in detecting and evaluating cheating in exams, making possible its use in real school and college environments.
Mostrar mais

22 Ler mais

A Survey Paper on Crime Prediction Technique Using Data  Mining

A Survey Paper on Crime Prediction Technique Using Data Mining

Crime prediction is an attempt to identify and reducing the future crime. Crime prediction uses past data and after analyzing data, predict the future crime with location and time. In present days serial criminal cases rapidly occur so it is an challenging task to predict future crime accurately with better performance. Data mining technique are very useful to solving Crime detection problem. So the aim of this paper to study various computational techniques used to predict future crime. This paper provides comparative analysis of Data mining Techniques for detection and prediction of future crime.
Mostrar mais

5 Ler mais

A P2P Botnet Virus Detection System Based on Data-Mining Algorithms

A P2P Botnet Virus Detection System Based on Data-Mining Algorithms

The nodes in a P2P network are usually connected through an ad hoc network [3], and the main idea is to form a logical network through the existing physical network, rather than reconstructing a new physical network. No matter what kind of logical network structure is adopted, the clients still have to transfer data through the physical layer. For simplicity, the installation of P2P software is usually done by clicking on a button. When the users are not familiar with its setting, their personal information can easily be exposed to others on the Internet. Besides, their computers may be infected by botnet viruses if they don’t have the sense of network security or there is no anti-virus software installed.
Mostrar mais

15 Ler mais

DATA CHARACTERIZATION TOWARDS MODELING FREQUENT PATTERN MINING ALGORITHMS

DATA CHARACTERIZATION TOWARDS MODELING FREQUENT PATTERN MINING ALGORITHMS

the ratio of the number of distinct items to the number of transactions is extremely small as shown on Table 5. Based on the observation of the characteristics of the input data, the number of the found sets shown in Table 5 indicates the ratio of the number of distinct items to the number of transactions is not necessarily perfect as the measure of the number of frequent sets. This is not surprising, of course, and we need to introduce some other parameters such as distribution to preconjecture the number of frequent sets from the input data. With the number of the found sets is moderately small; Apriori algorithm is always faster than FP-growth algorithm regardless of support again. Different from papertitle, however, both Apriori algorithm and FP-growth algorithm increase execution times as support decreases. One more thing to be considered is that Apriori algorithm increases the execution time in the main part of the algorithm as support decreases (Figure 21), however, FP-growth algorithm spends more time in reading as support decreases (Figure 22). Figure 23 also shows the length of the found sets is relatively small, which is from 2 to 4, and the situation is favorable for Apriori algorithm. This result also supports why Apriori algorithm outperformed FP-growth algorithm.
Mostrar mais

20 Ler mais

An Intelligent Association Rule Mining Model for Multidimensional Data Representation and Modeling

An Intelligent Association Rule Mining Model for Multidimensional Data Representation and Modeling

The traditional association rule mining algorithms to recognize frequent events in form of itemsets were widely-used example of association rule mining is Market Basket Analysis (Agrawal et al., 1993) were among the first to address the problem of pattern Classification by using breast cancer dataset[14] from the database. The work on association rules was extended from patterns [1,2,11] ,the authors explored data cube-based [2] rule mining algorithms on multidimensional databases, where each tuple/transaction consisted of multi-dimensional data features.In the area of multi-dimensional data sets [11], authors discussed a multidimensional data model, in which the multidimensional data was viewed as a value in the multidimensional space. Based on this model, efficient data mining have been performed using data cubes[2] based on aggregates of dimensions were computed in [9,10]. Rule mining is another well studied data mining problem and over the years many techniques have been designed to construct decision trees for mining the patterns in the data [8].However, it is necessary to perform classification in addition to association rule mining for effective decision making. Therefore, this paper focuses on the integration of ARM with Fuzzy rule mining for better decision.
Mostrar mais

8 Ler mais

Show all 10000 documents...