For the use ofdataminingtechniques in health care, we must transform the data according to the requirement. Heart attack related data records with 15 medical attributes (factors) were obtained from the Cleveland Heart Disease database In order to set the transformation parameters we must discuss attributes corresponding to heart vessels. The LAD, RCA, LCX and LM numbers represent the percentage of vessel narrowing (or blockage) compared to a healthy artery. Attributes LAD, LCX and RCA were partitioned by cutoff points at 50 and 70%. In the cardiology field, a 70% value or higher indicates significant coronary disease and a 50% value indicates borderline disease. A value lower than 50% means the patient is healthy. The most common cutoff value used by the cardiology community to distinguish healthy from sick patients is 50%. The LM artery is treated different because it poses higher risk than the other three arteries. Attribute LM was partitioned at 30 and 50%. The reason behind these numbers is both the LAD and the LCX arteries branch from the LM artery and then a defect in LM is more likely to cause a larger diseased heart region. That is, narrowing (blockage) in the LM artery is likely to produce more disease than blockages on the other arteries. That is why its cutoff values are set 20% lower than the other vessels. The nine heart regions (AL, IL, IS, AS, SI, SA, LI, LA, AP) were partitioned into two ranges at a cutoff point of 0.2, meaning a perfusion measurement greater or equal than 0.2 indicated a severe defect. Cholesterol CHOL was partitioned with cutoff points 200 (warning) and 250 (high). These values correspond to known medical settings. Decision tree rules with numeric dimensions and automatic splits.
Neural network for DataMining: In traditional DBMS, Data is stored as a shape of structured records. When any query is submitted, database system searches for and retrieves records that match user’s query criteria. Artificial neural network offers an excellent way for the recognition of intelligent query processing in large databases, especially for data retrieval and knowledge extraction based on partial matches. Neural network uses different methods. It does not need to identify empirical rules in order to make predictions. A neural network generates a network by examining a database and by identifying and mapping all significant patterns and relationships that exist among different attributes. The network then uses a particular pattern to predict an outcome. The neural network tries to identify an individual mix of attributes that reveals a particular pattern. This process is repeated using a lot of training data, consequently making changes to the weights of the data for more accurate pattern matches. The patterns that exist among the attributes in the database can be identified, and the influence of each attribute can be quantified. Neural network concentrate on identifying these patterns
Abstract: Due to the large number of influencing factors, it is difficult to predict the earthquake which is a natural disaster. Researchers are working intensively on earthquake prediction. Loss of life and property can be minimized with earthquake prediction. In this study, a system is proposed for earthquake prediction with dataminingtechniques. In the study in which Cross Industry Standard Process for DataMining (CRISP-DM) approach has been used as datamining methodology, seismic bumps data obtained from mines has been analyzed. Extreme learning machine (ELM) which is an effective and rapid classification algorithm has been used in the modeling phase. In the evaluation stage, different performance evaluation criteria such as classification accuracy, sensitivity, specificity and kappa value have been used. The results are promising for earthquake prediction.
Dynamics of geospatial objects is presented in the form of spatiotemporal trend of geographic data, which requires modeling of spatiotemporal dynamics. The model is visualized in two dimensional grid, which is presented on mn cells or pixels, each cell has a state variables. Spatial interaction presents the relationship between a cell with its surrounding cells. While the temporal interaction, presents cell value changes at present to the next time. Several techniques had been studied in the exploration of spatiotemporal dynamics knowledge  , one of which is, knowledge spatiotemporal trends. Predictionof trends is the analysis adopted in datamining, which has reference time. Trend prediction can be explored by applying regression analysis and artificial neural network model. Time series analysis can also provide a fairly accurate approach on knowledge extraction, using sequential patterns to extract the temporal correlation ofdata.
Most of the heart failure predictive systems [10, 11] follow best model approach, they either tune the model or transform the available data for the better performance of the predictive systems. Such systems involve complex mathematical models and are difficult to convert to optimized applications; they lack the generality where they can be easily converted to a full-fledged product. Data transformation may result in either addition of redundant information or loss of valuable information from the data. In the proposed multi model approach we try bring in the required generality, by using models in their absolute form. Open source datamining tools are used to generate such predictive models. We employ both classification and clustering techniques with an aim to model more information into the system. Classification models are built for prediction and error of these models are clustered which helps in deciding the participation of models in the prediction. The proposed methodology was implemented in two phase, phase 1: development of predictive models and consolidation of the predictions using static weights, phase 2: development of clustering based error model which would have dual purpose of deciding the participation of model in the prediction and calculation of the dynamic weight.
Globalization makes competition in supply chain management more intense. Pressure on improving the efficiency, guarantee that goods arrive on time and reduce the cost of shipment became higher. Shipment passes through different continents and cultures, dispersed around the world and encounter different conditions and risks. These risks are unexpected events that might disrupt the flow of materials or the planned operations. It can be due to late delivery, inaccuracy in forecasting, natural disasters like hurricane and earthquake or sociocultural events like strike. An effective use of supply chain risk management methods which includes risk identification, risk assessment, risk mitigation, and risk control is important for the organization to survive. For that reason, I was part of a team in XXX organization who has a goal to develop a predictive model to predict shipment delays for company’s customers.
Given the high relevance of the problem at hands, we chose to include six modelling algorithms which have proven effectively in past studies, with the most recent performing better for our case, namely the neural network and the support vector machine. To assure a proper validation procedure, we adopted a k-fold cross-validation scheme, with eight folds. As the support vector machine achieved an accuracy of 93.7% in predicting the malign cases, it was chosen for unveiling through a sensitivity analysis how each feature contributes for modelling breast cancer. Also, it should be stated that the value and contribution of this study lies onusing real data from patients, which is usually not directly available to researchers, as it may pose confidentiality issues, considering it is personal data. Nevertheless, none of the patients were identified, only the tissue samples provided the features for our experiments.
Datamining involves the use of sophisticated data analysis tools to discover previously unknown, valid patterns and relationships in large data set. These tools can include statistical models, mathematical algorithm and machine learning methods. Consequently, datamining consists of more than collection and managing data, it also includes analysis and prediction. Classification technique is capable of processing a wider variety ofdata than regression and is growing in popularity . There are several applications for Machine Learning (ML), the most significant of which is datamining. People are often prone to making mistakes during analyses or, possibly, when trying to establish relationships between multiple features. This makes difficult for them to find solutions to certain problems. Machine learning can often be successfully applied to these problems, improving the efficiency of systems and the designs of machines. Numerous ML applications involve tasks that can be set up as supervised. In the present paper, we have concentrated on the techniques necessary to do this. In particular, this work is concerned with classification problems in which the output of instances admits only discrete, unordered values. Our next section presented various classification methods. Section III described evaluating the performance of classifier. Finally, the last section concludes this work.
K-Nearest Neighbor (KNN) classification  classifies instances based on their similarity. An object is classified by a majority of its neighbors. K is always a positive integer. The neighbors are selected from a set of objects for which the correct classification is known. The training samples are described by n dimensional numeric attributes. Each sample represents a point in an n-dimensional space. In this way, all of the training samples are stored in an n- dimensional pattern space. When given an unknown sample, a k-nearest neighbor classifier searches the pattern space for the k training samples that are closest to the unknown sample. "Closeness" is defined in terms of Euclidean distance. The unknown sample is assigned the most common class among its k nearest neighbors. When k=1, the unknown sample is assigned the class of the training sample that is closest to it in pattern space. In WEKA this classifier is called IBK
Crime is classically unpredictable. It is not necessarily random, neither does it take place persistently in space or time. A Good theoretical understanding is needed to provide practical crime prevention solutions that equivalent to specific places and times. Crime analysis takes past crime data to predict future crime locations and time . Crime prediction for future crime is a process that find out crime rate change from one year to the next and project those changes into the future. Crime predictions can be made through both qualitative and quantitative methods. Qualitative approaches to forecasting crime, as environmental scanning, scenario writing, are useful in identifying the future nature of criminal activity. In contrast, perceptible methods are used to predict the future scope of crime and more specifically, crime rates a common method for develop forecasts is to projects annual crime rate trends developed through time series models. This approach also involve relating past crime trends with factors that will influence the future scope of crime .
a vital role in predictionof the occurrence of the future crime. Much work has been reported in this areas and it is mostly confined in predicting the crime occurrence based on the location, which helps for to analyze the locations chosen by the criminals. Analyzing the dataof crime, finding the frequent locations they are using the for particular type of crime, time of crime , and social status of the people, crime links etc. will be of more use. This helps the law keepers to predict the crime and take the nessacary action.
Bhardwaj and Pal  conducted a significant datamining research using the Naïve Bayes classification method, on a group of BCA students (Bachelor of Computer Applications) in Dr. R. M. L. Awadh University, Faizabad, India, who appeared for the final examination in 2010. A questionnaire was conducted and collected from each student before the final examination, which had multiple personal, social, and psychological questions that was used in the study to identify relations between these factors and the student’s performance and grades. Bhardwaj and Pal  identified their main objectives of this study as: “(a) Generation of a data source of predictive variables; (b) Identification of different factors, which effects a student’s learning behavior and performance during academic career; (c) Construction of a prediction model using classification dataminingtechniqueson the basis of identified predictive variables; and (d) Validation of the developed model for higher education students studying in Indian Universities or Institutions” . They found that the most influencing factor for student’s performance is his grade in senior secondary school, which tells us, that those students who performed well in their secondary school, will definitely perform well in their Bachelors study. Furthermore, it was found that the living locati on, medium of teaching, mother’s qualification, student other habits, family annual income, and student family status, all of which, highly contribute in the students’ educational performance, thus, it can predict a student’s grade or generally his/her performance if basic personal and social knowledge was collected about him/her.
Several methods, classified as theoretical, statistical and empirical, may be applied for forecasting physical phenom- ena. Theoretical models must be embedded in datamining or data assimilation codes for predicting solar events because of the large complexity of the problem (Hochedez, 2004). Sta- tistical methods (Boffeta, 1999; Wheatland, 2001; Moon et al., 2001) are used extensively for predicting the probability of the occurrence of a solar flare on the next day. Wheatland presented a method for flare predictionusing only observed flare statistics and the assumption that flares obey Poisson statistics in time, and power-law statistics in size. The per- centage probabilities are based on the number of flares pro- duced by regions classified using the McIntosh classification scheme (McIntosh, 1990) during cycle 22. This approach is followed by the Flare Prediction System of the NASA God- dard Space Flight Center 1 . Some statistical methods take dynamic flare data and static active region data as input, but none of these methods takes into account data about the dynamics of the active region (i.e. the area, the number of sunspots, magnetic classification of the active region).
From the above table we found that all the top 20 rules revolve around only 8 metrics among 17 metrics: UWCS, INST, LMC, NOM, AVCC, LCOM2, CBO and FOUT. 9 metrics are deleted from analysis. According to our objective we want to find all those metrics as predictors using association mining. But in our problem we are finding frequent software metrics in every class. From observation we found that if any metric is found in antecedent part of the relation and other metrics comes in consequent part. Then it means there is no need to use both of the metrics in the relation for developing fault prediction model. Because they can share same type of information in predictionof fault. After giving more focus on the generated top 20 rules, we found that:
of the null hypothesis will always be almost zero. Bonferroni correction, a theorem of statistics that gives a statistically naive way to avoid these false positive responses to a search through the data, has been used widely in the past with large datasets. However, avoiding false discoveries is still an active research area and several new methods have been proposed in last two decades (Benjamini and Hochberg, 1995;
Heat wave is a meteorological event reaching extreme dry bulb temperatures that may impact on ani- mal production. These events have been more frequent lately due to the global climatic changes, however, very little is known about their impact on Brazilian broiler production. The COPA/COGECA (2004), an agricul- tural committee from the Europe Union that produces a report about European heat wave impact and losses in agriculture, shows general economical losses of 15- 30% in poultry production for the heat wave that hit European producer countries in 2003. St-Pierre et al. (2003) estimated that in the United States the produc- tion loss can reach 128 million dollars when environ- mental conditions depart from the thermoneutral zone, based on the temperature and humidity index (THI) calculated from meteorological station databases.
In 1947, the Harvard Mark II was being tested by Grace Murray Hopper and her associates when the machine sud- denly stopped. Upon inspection, the error was traced to a dead moth that was trapped in a relay and had shorted out some of the circuits. The insect was removed and taped to the machine’s logbook . This incident is believed to have coined the use of the terms “bug”, “debug” and “debugging” in the field of computer science. Since then, the term debugging is associated to the process of detecting, locating and fixing faulty statements in computer programs. In software development, a large amount of resources is spent in the debugging phase. It is estimated that testing and debugging activities can easily range from 50 to 75 percent of the total development cost . This is due to the fact that the process of detecting, locating and fixing faults in the source code is not trivial and is error-prone. Even experienced developers are wrong almost 90% of the time in their initial guess while trying to identify the cause of a behavior that deviates from the intended one .
Thereunto, in the static analysis the application intents to construct a function call graph (FCG), used to determine all function call paths to sensitive APIs, and an activity call graph (ACG), used to define the activity related to the sensitive source functions. Combining this data an expected activity switch path is created and then used in the dynamic analysis stage, together with the UI interaction simulator, to determine which UI elements can trigger this behavior on the application. With this information, the application automatically analyses the system applications searching for malicious behaviors. Although, the presented results seem positive for simple indirect trigger conditions, based on UI, as stated by the authors the system still cannot reveal some complex indirect conditions between the device user behavior with the system calls generated .
DM with time series data is popular and many applica- tions can be found in the literature, for instance, for earthquake forecasting , characterization of ozone behavior , or flood prediction . Other application example is that of financial decision making. A decision support tool for financial forecasting, named as EDDIE, is presented in . In , a new architecture that im- plements a binary neural network, AURA, to produce discrete probability distribution as forecasts, using high frequency data sets, is presented. The use of support vector machines and back propagation neural networks to predict credit ratings is presented in .
Sometimes, the knowledge that results from the method proposed herein can be huge, and if the process is not fully automated it can be an exhaustive task to analyse these results. This suggests further research related to DOKS ability to deal with the size growth ofdata used in the Ontology Learning process, can be identified in three areas: (i) speed to process large sets ofdata as it can be really slow. Research can be taken in methods to, for instance, take advantage of multi-core processor technology in order to use parallelization techniques to improve the speed of the matching process; (ii) way to present results for evaluation by an expert, although this work provided a colour scheme to represent the strength of the matching process (refer to Table 4.3 above). This means to improve the way in which the results are shown, by using more graphics (e.g. graphs to represent relations). This will provide a better efficiency of the method itself and allow for faster reasoning of the results; (iii) method to process large/huge and complex sets ofdata, also known as Big Data. Big Data is the nouvelle sub domain ofDataMining that studies solutions to the problem of big and complex sets ofdata.