• Nenhum resultado encontrado

I NTRODUCTION INING TO D ATA M

N/A
N/A
Protected

Academic year: 2022

Share "I NTRODUCTION INING TO D ATA M"

Copied!
14
0
0

Texto

(1)

I NTRODUCTION TO D ATA M INING

Chiara Renso

KDDLab - ISTI – CNR, Italy

http://www-kdd.isti.cnr.it

email: [email protected]

1

Data Mining UFPE, June 2012

(2)

Knowledge Discovery and Data Mining

Laboratory,

ISTI – National Research Council, Italy - Pisa

Analysis of Complex Data:

•  Mobility Data Mining

•  Complex and Social Networks

•  Privacy

.

Collaboration with UFSC in a EU project called SEEK

www.seek-project.eu/

2

Data Mining UFPE, June 2012

(3)

T

HE

SCHEDULE

Day 1

  Introduction to Data Mining and Data Preprocessing

  Classification

  Practice with Weka Day 2

  Clustering

  Association analysis

  Practice with Weka Day 3

  Practice & Advances issues: semantic data

mining and trajectory data mining 3

Data Mining UFPE, June 2012

(4)

M

ATERIAL FOR THE COURSE

Slides have been adapted from the slides

associated to the Book: Introduction to Data Mining, by Tan, Steinbach, Kumar(1)

http://www-users.cs.umn.edu/~kumar/dmbook/

Some sample book chapters can be downloaded from this site

Practice with the Opensource Weka Tool: Data Mining Software in Java

http://www.cs.waikato.ac.nz/ml/weka/

Weka 3.6 is the latest stable version of Weka,

4

Data Mining UFPE, June 2012

(5)

D

ATA

M

INING

: I

NTRODUCTION

5

Data Mining UFPE, June 2012

“Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner.”

David Hand, Heikki Mannila & Padhraic Smyth, Principles of Data Mining, MIT Press, 2001

(6)

W

HY

M

INE

D

ATA

? C

OMMERCIAL

V

IEWPOINT

  Lots of data is being collected and warehoused

  Web data, e-commerce

  purchases at department/

grocery stores

  Bank/Credit Card transactions

  Computers have become cheaper and more powerful

  Competitive Pressure is Strong

  Provide better, customized services for an edge (e.g. in Customer Relationship Management)

6

Data Mining UFPE, June 2012

(7)

W

HY

M

INE

D

ATA

? S

CIENTIFIC

V

IEWPOINT

  Data collected and stored at enormous speeds (GB/hour)

  remote sensors on a satellite

  telescopes scanning the skies

  microarrays generating gene expression data

  scientific simulations

generating terabytes of data

  Traditional techniques infeasible for raw data

  Data mining may help scientists

  in classifying and segmenting data

  in Hypothesis Formation 7

Data Mining UFPE, June 2012

(8)

MINING LARGE DATA SETS - MOTIVATION

  There is often information “hidden” in the data that is not readily evident

  Human analysts may take weeks to discover useful information

  Much of the data is never analyzed at all

The Data Gap

Total new disk (TB) since 1995

Number of analysts

From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”

8

Data Mining UFPE, June 2012

(9)

W

HAT IS

D

ATA

M

INING

?

Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

The Knowledge Discovery Process - KDD

Non-trivial extraction of implicit, previously unknown and potentially useful information from data

9

Data Mining UFPE, June 2012

(10)

W

HAT IS

(

NOT

) D

ATA

M

INING

?

 

What is Data Mining?

–  Certain names are more prevalent in certain US

locations (O’Brien, O’Rurke, O’Reilly… in Boston area) –  Group together similar documents returned by

search engine according to their context (e.g. Amazon rainforest, Amazon.com,)

 

What is not Data Mining?

–  Look up phone number in phone directory

–  Query a Web search engine for information about

“Amazon”

10

Data Mining UFPE, June 2012

(11)

O

RIGINS OF

D

ATA

M

INING

 Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems

 Traditional Techniques may be unsuitable due to

  Enormity of data

  High dimensionality of data

  Heterogeneous, distributed nature of data

Machine Learning/

Pattern Recognition Statistics/

AI

Data Mining

Database systems

11

Data Mining UFPE, June 2012

(12)

D

ATA

M

INING

T

ASKS

Prediction Methods

Use some variables to predict unknown or future values of other variables.

 

Description Methods

 

Find human-interpretable patterns that describe the data.

12

Data Mining UFPE, June 2012

(13)

D

ATA

M

INING

T

ASKS

...

 Classification [Predictive]

 Clustering [Descriptive]

 Association Rule Discovery [Descriptive]

 Sequential Pattern Discovery [Descriptive]

 Regression [Predictive]

 Deviation Detection [Predictive]

13

Data Mining UFPE, June 2012

(14)

C

HALLENGES OF

D

ATA

M

INING

  Scalability

  Dimensionality

  Complex and Heterogeneous Data

  Data Quality

  Data Ownership and Distribution

  Privacy Preservation

  Streaming Data

  ….

14

Data Mining UFPE, June 2012

Referências

Documentos relacionados

nais. Entre as mudanças ocorridas, o autor salienta o surgimento de novas aldeias, a in- fluência decisiva dos missionários, sobretudo dos missionários protestantes, as rupturas

Partindo dessas considerações, o presente estudo foi realizado em um supermercado localizado em Cataguases, no interior de Minas Gerais, com o objetivo de identificar

O trabalho (T7) também sugere que a reflexão a respeito de uma nova relação entre os seres humanos, e entre estes e os outros elementos da natureza, teria se dado a partir

A partir do levantamento das marcas registadas entre 1883 e 1900, em que se verifica uma forte presença de marcas de vinhos e produtos vinícolas, procuramos perceber de que forma

Este trabalho, que se insere em um dos núcleos de Inglês do Programa Residência Pedagógica da Universidade Federal de Sergipe, desenvolvido no Colégio Estadual Gonçalo

O presente estudo rejeita as hipóteses nulas de que não se verificam diferenças de translucidez entre diferentes tipos e espessuras de zircónias, uma vez que se verificaram

The work done until know indicates that the D2 structures of Cribas region (fig. Sumasse, N-Hacrum, S-Hacrum, Sarec and Tuquete faults) are part of a major system,

Lock-up dos Minoritários: Com o objetivo de garantir que os Acionistas que aceitem a Oferta recebam o mesmo tratamento dispensado ao antigo acionista controlador da Companhia, cada