CETAX - All Rights Reserved. Hadoop Weekend

(1)

Hadoop Weekend

(2)

AGENDA DO EVENTO

PARTE 1 – BIG DATA

 Conceitos e Fundamentos

 Ferramentas e Arquitetura

 Aplicações de Big Data

PARTE 2 – HADOOP

 Arquitetura Hadoop

 HDFS – Map Reduce

 YARN

 Blocks

 Comandos HDFS

PARTE 3 – INGESTÃO DE DADOS

 Sqoop

 Pig

 Hive

PARTE 4 – PROCESSANDO DADOS

 Hive

 Spark

(3)

• A Cetax é uma empresa de consultoria e treinamento especializada em sistemas de Business Intelligence e Data Warehouse.

• Existe desde 2000 trabalhando exclusivamente com BI e DW.

• Nossos treinamentos são exclusivos sem cursos semelhantes no Brasil

• Outros cursos são ministrados em parcerias com outras empresas do mercado ou mesmo profissionais que possuem experiência diferenciada

• Parceria Hortonworks (Hadoop), Talend (ETL), Tibco (Analytics)

APRESENTAÇÃO CETAX

(4)

MARCO ANTONIO GARCIA

• 20 anos de experiência em TI, sendo 15 exclusivamente com Inteligência - Business Intelligence e Data Warehouse.

• MBA pela FGV, Formado pela FATEC em Processamento de Dados.

• Certificado pelo Kimball University nos EUA, onde teve aula pessoalmente com Ralph Kimball, um dos principais gurus do data Warehouse, treinamentos

realizados no TDWI, maior entidade de pesquisa de Data Warehouses do mundo.

• Vivência profissional em diversos projetos, passando por Bancos e Financeiras,

APRESENTAÇÃO - INSTRUTOR

(5)

A PALAVRA DO MOMENTO

(6)

LITERATURA DISPONÍVEL - BRASIL

(7)

LITERATURA DISPONÍVEL - EUA

(8)

• Muitas definições podem cercar o assunto : – Alto Volume.

– Alta Velocidade.

– Diversas Fontes.

• Uma combinação de tudo isso e muito mais.

• Assim como BI, é um termo “guarda-chuva”.

BIG DATA = GRANDES DADOS?

(9)

• Além dos sistemas utilizados em empresas de todos os portes, temos milhares de outros dispositivos que geram dados diariamente :

– Em 2010 existiam 5 bilhões de celulares no mundo.

– Um avião Boeing pode gerar até 20 TB/hora para seus engenheiros examinar em tempo real.

– Em pouco tempo teremos muito mais equipamentos ligados a internet gerando informações para análise “internet das coisas”

MUITOS DADOS GRANDES

(10)

• Big Data = Big Business

• Todos os negócios de tecnologia serão beneficiados pelo Big Data

BIG DATA = BIG BUSINESS

(11)

• Volume – o volume crescente de dados em todas as áreas e empresas, Mb ->

Gb -> Tb -> Pb

• Velocidade – o tempo necessário para disponibilizar os dados para análise é cada vez menor

• Variedade – a variedade de dados é cada vez maior, sensores, imagens, dados não estruturados ou semi estruturados.

3 Vs – UMA DEFINIÇÃO

(12)

3 Vs – DETALHES

(13)

• VOLUME

• VELOCIDADE

• VARIEDADE

• VIRTUDE

• VALOR

5 Vs – UMA DEFINIÇÃO

(14)

• VAST - VASTO, AMPLO

• VOLUME - ALTO VOLUME

• VIGOROSITY - VIGOR

• VERIFIED - VERIFICADOS

• VEXINGLY - “ATORMENTADOR”

• VARIABLE - VARIAVEIS

• VERBOSE - “ELOQUENTE”

• VALUABLE - VALIOSOS

• VISUALIZED - VISUALIZADOS

• VELOCITY - VELOCIDADE

10 Vs – É NECESSÁRIO?

(15)

• Big Data representa um conjunto de dados que não pode mais ser facilmente gerenciado ou analisado com as ferramentas atuais de dados, métodos ou arquitetura disponível até então.

BIG DATA – DEFINIÇÃO SIMPLES

(16)

• E então ?

• Quais softwares serão utilizados ?

• Quais devo aprender ?

E ENTÃO?

(17)

(18)

(19)

• “Dados são o novo Petróleo”

• No ano de 2012 a

• Como petróleo, precisam ser refinados !

DATA IS THE NEW OIL!

(20)

• Web log

• Click stream

• Sensor data

• Email

• Call center voice logs

• Images/video

• Dados RFID

• Dados de Localização e Geográficos

FONTES PARA O BIG DATA

(21)

CURVA DE ADOÇÃO DO BIG DATA

(22)

• Serviços de Informação – Imagens de Satélites

• Varejo – Otimização de Preço, Inteligência de Vídeo

• Utilities/Utilidades – Consumo em tempo real

• Propaganda – Content Targeting OnLine

• Seguros – Detecção de Fraudes

• Saúde – Diagnósticos e Detecção.

• Manufatura – Controle de Qualidade, Controle de Máquinas

EXEMPLOS POR SETOR

(23)

• Machine Learning

• Sentiments

• Text Processing

• Image Processing

• Video Analytics

• Log Parsing

• Collaborative Filtering

• Context Search

• Email & Content

APLICAÇÕES POSSÍVEIS – BIG DATA

(24)

• Machine Learning :

– Aprendizado de Maquina

– Sistemas que podem aprender com os dados e tomar decisões de acordo com modelos previamente estudados.

– Existem diversos algoritimos que podem ser aplicados em Machine Learning.

APLICAÇÕES POSSÍVEIS – BIG DATA

(25)

• Sentiments :

– Sentimentos ou Análise de Sentimentos

– Também conhecida como Mineração de Opinião ou Opnion Mining

– São algoritimos utilizados para identificar o “sentimento”

através de uma informação textual.

APLICAÇÕES POSSÍVEIS – BIG DATA

(26)

• Text Processing:

– Processamento de Textos

– Pode ser utilizado para identificação, reconhecimento ou análise de dados através de textos processados.

– Pode ser usado na identificação de comportamentos ou sentimentos

APLICAÇÕES POSSÍVEIS – BIG DATA

(27)

• Image Processing / Video Analytics:

– Imagens e Videos podem ser analisados em ferramentas de Big Data.

– As imagens podem ser decompostas e atraves de processos de reconhecimento, identificar pessoas, padroes suspeitos ou mesmo padroes que possam ser analíticos.

– Os videos podem ser decompostos em uma série de imagens e essas imagens processadas.

APLICAÇÕES POSSÍVEIS – BIG DATA

(28)

• Log Parsing

• Collaborative Filtering

• Context Search

• Email & Content

APLICAÇÕES POSSÍVEIS – BIG DATA

(29)

• Load ( Carga )

• Structure ( Estrutura )

• Response ( Resposta )

• Complex Workload ( Processamentos Complexos )

• Economics ( Retorno ao Investimento )

REQUERIMENTOS DO BIG DATA

(30)

DADOS POR VALIDADE PARA BIG DATA

(31)

• Bancos de Dados Relacionais

• Ferramentas de ETL

• Ferramentas de BI

• Mas esse é o melhor ferramental?

• Com certeza não ! COMO ATENDEMOS HOJE?

(32)

• Novos Tipos de Dados

• Novos Volumes

• Novas Análises

• Novas Cargas de Processamento

• Novos Métodos

Tudo isso = Menos Performance !

QUAL O IMPACTO DE TUDO ISSO NO DW ATUAL?

(33)

Categorias Novas Tecnologias

Infra Estrutura Big Data e Data Warehouse Appliances Tecnologias In-Memory

SSD Storage Fast Networks Cloud Computing Tecnologias Móveis

Softwares In-Memory Databases

Hadoop, Cassandra e NoSQL Databases Colunar DBMS

ETLs com integração Hadoop – Informatica, Talend, etc.

Algoritimos Mahout

Arquiteturas Pré- Configuradas

IBM, Teradata, Kognitio, EMC, Cloudera, HortonWorks, Cirro, Intel, Cisco UCS, Pivotal, Oracle , MapR

INOVAÇÕES SÃO NECESSÁRIAS

(34)

• NoSQL = Not Only SQL

• Baseado no Teorema CAP

• Consistency

• Availability

• Partition Tolerance

• Usualmente não requer tabelas fixas no schema nem sempre usa o conceito de join

• Todos NoSQL “relaxa” uma ou mais as propriedades do ACID NOSQL

(35)

• Consistency: todos os nodes vem os mesmos dados ao mesmo tempo

• Availability: Garantia que cada request receba uma resposta de dados

• Partition Tolerance: O sistema continua funcionando mesmo com alguma perda de partição ou partes do sistema

TEOREMA CAP

(36)

• Alguma variações de NoSQL Databases :

• XML ( myXMLB, Tamino, Sedna )

• Wide Column (Cassandra, Hbase, Big Table)

• Key/Value (Redis, Memcached comBerkley DB)

• Graph ( neo4j, InfoGrid )

• Document Store ( CouchDB, MongoDB ) NOSQL

(37)

PANORAMA DE BANCO DE DADOS

(38)

UMA APOSTA? HADOOP!

(39)

YARN é uma re-arquitetura do Hadoop, que permite multiplas aplicações rodarem na mesma plataforma.

HADOOP 1.x

HDFS

(redundant, reliable storage)

MapReduce

(cluster resource management & data processing)

HDFS

(redundant, reliable storage)

YARN

(cluster resource management)

MapReduce

(data processing)

Others

(data processing)

HADOOP 2.x

HADOOP 1 x HADOOP 2 (YARN)

(40)

• Criado dentro do Yahoo em 2005.

• O nome Hadoop não é um acrônimo, é o nome do Elefante de brinquedo do filho de Doug Cutting.

• O Hadoop tem como base um file system para armazenamento rápido e barato de dados !

• Tem como objetivo o processamento de grandes datasets de dados HADOOP SURGE COMO ALTERNATIVA

(41)

• Apache Software Foundation

• Cloudera

• HortonWorks

• MongoDB

• IBM Big Insights

• EMC Pivotal

• Teradata Aster

• Oracle Big Data Appliance

• Intel Hadoop Distribution

• MapR

• Datastax

• Rainstor

• QueryIO PLAYERS DE MARCADO

(42)

• Estudo disciplinado dos dados e informações inerentes ao negócio e todas as visões que podem cercar um determinado assunto.

• Ciência que estuda as informações, seu processo de captura, transformação, geração e análise de dados.

• A Ciência de dados envolve diversas disciplinas como :

• Computação

• Estatística

• Matemática CIÊNCIA DE DADOS

(43)

CIÊNCIA DE DADOS

(44)

REQUISITOS ESPERADOS DO CIENTISTA DE DADOS

(45)

• Profissional Multidisciplinar responsável por transformar dados em informações ou produtos de informações dentro de uma corporação.

• Deve ser responsável pela formulação dos problemas, escolha de modelos de simulação e estatística e entrega dos produtos de dados.

DEFINIÇÃO – CIENTISTA DE DADOS

(46)

• Data Scientist – Participa da formulação do problema, hipóteses de resolução e análise de resultados.

• Business Analyst – Analisa os dados gerados em relação ao negócio ou empresa avaliada

• Data Analyst – responsável por analisar os dados disponibilizados em busca de solução para o problemas enfrentados

DATA SCIENTIST x BUSINESS ANALYST x DATA ANALYST

(47)

• Para trabalhar com Big Data acreditamos que o melhor caminho seria conhecer as ferramentas utilizadas

• Ter perfil misto : técnico e negócios

• Conhecer de Business Inteligence e Data Warehouse

• Entender os processos da empresa

• Conhecer estatística e matemática QUERO TRABALHAR COM BIG DATA

(48)

• Vemos 3 papéis claros:

• Cientista ou Analista de Dados

• Desenvolvedor

• Administrador PAPEIS E FUNÇÕES

(49)

• Responsável por atender as demandas das áreas de negócio ou planejamento da empresa.

• Participa da formulação dos problemas e respostas.

• Nível mais próximo ao negócio

• Deve conhecer as ferramentas de consulta e acesso aos dados.

• Deveria conhecer estatística ANALISTA DE DADOS

(50)

• Responsável por Desenvolver os processos necessários para geração dos dados.

• Processos de Captura, Transformação e Carga de Dados.

• Deve conhecer tecnicamente as ferramentas envolvidadas

• Deve conhecer sobre programação

• Será responsável pelo desenvolvimento de novas rotinas e processos.

DESENVOLVEDOR

(51)

• Responsável por manter os ambientes e ferramentas funcionando da melhor maneira.

• Deve conhecer sobre os sistemas operacionais utilizados, principalmente Linux.

• Deve conhecer sobre arquitetura de hardware e redes para garantir a melhor performance.

• Deve conhecer sobre os processos de Tunning das ferramentas.

ADMINISTRADOR

(52)

• Programação – as ferramentas ainda são pouco automatizadas na geração de código.

• Linux – a maioria dos softwares rodam em Linux, é necessário conhecer comandos básicos para execução de processos.

• Modelagem de Dados

CONHECIMENTOS TÉCNICOS PARA TRABALHAR COM BIG DATA

(53)

• Conhecer sobre o negócio ou sobre os processos da empresa.

• Conhecer ou ter noções mínimas de estatística e matemática aplicada a dados.

CONHECIMENTOS TÉCNICOS PARA TRABALHAR COM BIG DATA

(54)

Hadoop Weekend

PARTE 2 – ENTENDENDO HADOOP

ARMAZENAMENTO E PROCESSAMENTO

(55)

Atividade

Configurar acessos as VM’s

( utilizar Apêndice do Caderno do Aluno )

(56)

Exercício 1

Colocando a VM no ar e conhecendo suas

principais estruturas

(57)

+

/directory/structure/in/

memory.txt

Resource management + scheduling

Disk, CPU, Memory

Core

NameNode

HDFS

ResourceManager

YARN

Hadoop daemon

User

application

NN RM

DataNode HDFS

NodeManage r YARN

Worker Node

(58)

“I have a 200 TB file that I need to store.”

“Wow - that is big data! I will need to distribute

that across a cluster.”

Hadoop Client

HDFS

“Sounds risky! What happens if a drive

fails?”

“No need to worry! I am designed for failover.”

WHAT IS HDFS?

(59)

fsimage edits

3. A client application creates a new file in HDFS.

hadoop fs -put foo.log bar/foo.log

4. The NameNode logs that transaction in the edits file.

1. When the NameNode starts, it reads fsimage and edits files from disk.

2. The transactions in edits are merged with fsimage, and edits is emptied.

NameNode

hdfs journals hdfs snapshots

THE NameNode

(60)

“I’m still here! This is my latest heartbeat.”

“I’m here too! And here is my latest

heartbeat.”

“Hey DataNode1, Replicate block 123 to

DataNode 3.”

NameNode

DataNode 1 DataNode 2 DataNode 3 DataNode 4

THE NameNodes

(61)

1. Client sends a request to the NameNode to add a file to HDFS.

3. For every block, the client will request the NameNode to provide a new blockid and a list of destination DataNodes.

4. The client will write the block directly to the first DataNode in the list.

NameNode

DataNode 1 DataNode 2 DataNode 3

2. NameNode gives client a lease to the file path.

(62)

Exercício 2

Entendendo HDFS e Bloco de Armazenamento

(63)

Break a large problem into sub-solutions

Map Process

Map Process Data

Data Data

Data Data Data

Data Data

Data Data

Data

Data Map Process

Reduce Process

Data

Read & ETL

Shuffle & Sort

Aggregation

Data Data Data

Data Data

Data

WHAT IS MAPREDUCE?

(64)

HDFS

constitution.txt The mappers read the file’s blocks from HDFS line-by-line 1

We the people, in order to form a...

The lines of text are split into words and output to the reducers

2

The shuffle/sort phase combines pairs with the same key

3 The reducers add up the

“1’s” and output the word and its count

4

<We, 1>

<the,1>

<people,1>

<in,1>

<order, 1>

<to,1>

<form,1>

<a,1>

<We, (1,1,1,1)>

<the, (1,1,1,1,1,1,1,...)>

<people,(1,1,1,1,1)>

<form, (1)>

<We,4>

<the,265>

WORDCOUNT IN MAPREDUCE

(65)

Data is shuffled across the network

and sorted

Map Phase

Shuffle/Sort

Reduce Phase

DataNode

Mapper

DataNode

Mapper

DataNode

Mapper

DataNode

Reducer

DataNode

Reducer

**SELECT word, COUNT(*) FROM constitution**

WHERE….

GROUP BY (word)

ORDER BY JOIN

DISTINCT

Hive/Pig compile as Reduce side function

Examples of more Reduce side

(66)

Input split

The map method outputs <k2,v2>

MapOutputBuffer

<k2,v2> <k2,v2>

Records are sorted and spilled

to disk when the buffer reaches a

threshold Spill files are merged into a

single file 1

Mapper

3 4

The InputFormat generates <k1,v1>

pairs

2 5

6

Mapper output = Reducer input

7

(67)

1. The Reducer fetches the data

In-memory

buffer Spill files Merged

input

Reducer In-memory

buffer Spill files Merged

input

2.

3.

4. 5.

Mapper output = Reducer input

NodeManager

Mapper output = Reducer input

NodeManager

HDFS

NodeManager

(68)

Client submits an application

1 The AsM finds an appropriate

NodeManager

2

ResourceManager

NodeManager

NodeManager creates an ApplicationMaster

3

ApplicationMaster

Containers execute their given task on NodeManagers in the

AM asks RM for resources. RM provides a list of Containers to

AM

4

5

LIFECYCLE OF A YARN APPLICATION

(69)

NodeManager Container

Node 1

ResourceManager

Scheduler AsM

NodeManager AM

Node 2

Node 3

Node 4

NodeManager

Node 5

NodeManager AM

Node 6

NodeManager

Container

Node 7

NodeManager

Node 8

NodeManager

Node 9

Container

A CLUSTER VIEW EXAMPLE

(70)

Exercício 3

Executando Wordcount com MapReduce

(71)

Hadoop Weekend

PARTE 3 – INTEGRAÇÃO DE DADOS

(72)

MapReduce

WebHDFS hadoop fs -put

Vendor Connectors

Hadoop nfs gateway

Hue Explorer

INTEGRAÇÃO DE DADOS

(73)

• The put command to uploading data to HDFS

• Perfect for inputting local files into HDFS – Useful in batch scripts

• Usage:

hadoop fs –put mylocalfile /some/hdfs/path

• POSIX utility commands such as ls, mv, cp, touch, cat, mkdir are also supported

• Full list of commands hadoop fs

THE HADOOP CLIENT

(74)

Relational Database

Enterprise Data Warehouse

Document-based Systems

1. Client executes a sqoop command

2. Sqoop executes the command as a MapReduce job on the

3. Plugins provide connectivity to various data sources

Hadoop Cluster Map

tasks

OVERVIEW OF SQOOP: DATABASE IMPORT/EXPORT

(75)

Exercício 4

Carregando Dados com Sqoop

(76)

Hadoop Weekend

PARTE 4 – PROCESSAMENTO DE DADOS

(77)

• An engine for executing programs on top of Hadoop

• It provides a language, Pig Latin, to specify these programs HADOOP ECOSYSTEM: PIG

(78)

• Maybe we want to join two datasets, from different sources, on a common value, and want to filter, and sort, and get top 5

WHY USE PIG?

(79)

Exercício 5

Executando Scripts com Pig

(80)

Sensor Weblog

Store and query all data in Hive

Use existing SQL tools and existing SQL processes

NN RM

HADOOP ECOSYSTEM: HIVE

(81)

• Data warehouse system for Hadoop

• Create schemas/table definitions that point to data in Hadoop

• Treat your data in Hadoop as tables

• SQL 92

• Interactive queries at scale WHAT IS HIVE?

(82)

SQL Datatypes SQL Semantics

INT SELECT, LOAD, INSERT from query

TINYINT/SMALLINT/BIGINT Expressions in WHERE and HAVING

BOOLEAN GROUP BY, ORDER BY, SORT BY

FLOAT CLUSTER BY, DISTRIBUTE BY

DOUBLE Sub-queries in FROM clause

STRING GROUP BY, ORDER BY

BINARY ROLLUP and CUBE

TIMESTAMP UNION

ARRAY, MAP, STRUCT, UNION LEFT, RIGHT and FULL INNER/OUTER JOIN

DECIMAL CROSS JOIN, LEFT SEMI JOIN

CHAR Windowing functions (OVER, RANK, etc.)

VARCHAR Sub-queries for IN/NOT IN, HAVING

HIVEQL

(83)

User issues SQL query

Hive parses and plans query Query converted to MapReduce and executed on Hadoop

2 3

Web UI JDBC /

ODBC CLI Hive

SQL

1

1 HiveServer2 Hive

MR/Tez Compiler Optimizer

Executor 2

Hive

MetaStore

(MySQL, Postgresql, Oracle)

MapReduce or Tez Job

Data Data

Data

Hadoop Data-local processing 3

HIVE ARCHITECTURE

(84)

• Hive component

• Glue between Pig & Hive

– Schema visibility to Pig Scripts & MapReduce

• REST API to

– Access Hive schemas – Submit DDL

– Launch Hive queries – Launch Pig jobs – Launch MR

– Notifications to message broker

HADOOP ECOSYSTEM: HCatalog

(85)

Hive ODBC Driver

BI Tools Analytics Reporting

OVERVIEW OF THE HIVE ODBC DRIVER

(86)

Hadoop Weekend

PARTE 5 - INTRODUCING APACHE SPARK

(87)

• Apache open source project, originally developed at AmpLab at UC-Berkeley – 2009: Research project; BDAS (Berkley Data Analysis Stack)

– Jun 2013: Accepted into Apache Incubator – Feb 2014: Became a top-level Apache project – Dec 2014: Included in HDP 2.2

• A general data processing engine, focused on in-memory distributed computing use-cases

• APIs in Scala, Python and Java

– Recently API for R was introduced

WHAT IS APACHE SPARK?

(88)

THE SPARK ECOSYSTEM

(89)

• Elegant Developer APIs: Data Frames/SQL, Machine Learning, Graph algorithms and streaming

– Scala, Python, Java and R

– Single environment for importing, transforming, and exporting data

• In-memory computation model

– Effective for iterative computations

• High level API

– Allows users to focus on the business logic and not internals WHY SPARK?

(90)

• Supports wide variety of workloads – Mllib for Data Scientists

– Spark SQL for Data Analysts

– Spark Streaming for micro batch use cases

– Spark Core, SQL, Streaming, Mllib, and GraphX for Data Processing Applications

• Integrated fully with Hadoop and an open source tool

• Faster than MapReduce WHY SPARK CONT.

(91)

• Higher level API

• In-memory data storage

– Up to 100x performance improvement

pyspark

Java MapReduce

SPARK vs MAPREDUCE

(92)

• Why is Spark faster?

– Caching data to memory can avoid extra reads from disk – Scheduling of tasks from 15-20s to 15-20ms

– Resources are dedicated the entire life of the application

– Can link multiple maps and reduces together without having to write intermediate data to HDFS

– Every reduce doesn’t require a map SPARK vs MAPREDUCE CONT

(93)

Hadoop Weekend

PARTE 6 - PROGRAMMING WITH APACHE SPARK

(94)

• The Spark Shell provides an interactive way to learn Spark, explore data, and debug applications

• Available for python and scala – pyspark

– spark-shell

• REPL

HOW TO START USING APACHE SPARK?

(95)

• Main entry point for Spark applications

• All Spark applications require one

• The SparkContext has a few responsibilities – Represent the connection to a Cluster

– Used to create RDDs, accumulator and broadcast variables on the cluster

• The REPLs automatically create one for you

– In Spark 1.3 and on, the shell creates a SQL context too THE SPARK CONTEXT

(96)

Attributes:

• sc.appName: Spark application name

• sc.master: Spark Master (local, yarn-client, etc)

• sc.version: Version of Spark being used Functions:

• sc.parallelize(): create an RDD from local data

• sc.textFile(): create RDD from a text file in HDFS

• sc.stop(): stop the spark context WORKING WITH THE SPARK CONTEXT

(97)

• An Immutable collection of objects (or records) that can be operated on in parallel

– Resilient: can be recreated from parent RDDs - An RDD keeps its lineage information

– Distributed: partitions of data are distributed across nodes in the cluster

– Dataset: a set of data that can be accessed

– Each RDD is composed of 1 or more partitions - The user can control the number of partitions - More partitions => more THE RESILIENT DISTRIBUTED DATASET

(98)

• Load data from a file (HDFS, S3, Local, etc) – From a single file

rdd1 = sc.textFile(“file:/path/to/file.txt”) rdd2 =

sc.textFile(“hdfs://namenode:8020/mydata/data.txt”) – Also accepts a comma separated list of files, or a wildcard list of files

rdd3 = sc.textFile(“mydata/*.txt”)

rdd4 = sc.textFile(“data1.txt,data2.txt”) CREATE NA RDD

(99)

• The count() action returns the number of elements in the RDD

data = [5, 12, -4 , 7, 20]

rdd = sc.parallelize(data) rdd.count()

The output is: 5

ACTIONS – count()

(100)

Dataset:[5, 12, -4 , 7, 20]

rdd.first(): 5

rdd.take(3): [5, 12, -4]

rdd.saveAsTextFile(“myfile”)

SPARK ACTIONS: EXAMPLES

(101)

• Keep some elements based on a predicate

rdd=sc.parallelize([1, 2, 3, 4, 5])

rdd.filter( lambda x: x%2 == 0).collect() [2, 4]

rdd.filter( lambda x: x<3).collect() [1, 2]

TRANSFORMATIONS: filter()

(102)

Exercício 6

Iniciando com Apache Spark

(103)

Hadoop Weekend

PARTE 7 - SPARK SQL AND DATAFRAMES

(104)

• A module built on top of Spark Core

• Provides a programming abstraction for distributed processing of large-scale structured data in Spark

• Data is described as a DataFrame with rows, columns and a schema

• Data manipulation and access is available with two mechanisms – SQL Queries

– DataFrames API SPARK SQL OVERVIEW

(105)

• A DataFrame is inspired by the dataframe concept in R (dplr,

Dataframe) or Python (pandas), but stored using RDDs underneath in a distributed manner

• A DataFrame is organized into named columns – Underneath: an RDD of “Row” objects

• The DataFrame API is available in Scala, Java, Python, and R THE DATAFRAME ABSTRACTION

(106)

THE DATAFRAME VISUALLY

(107)

• DataFrames from HIVE data

– Reading/writing HIVE tables

• DataFrames from files:

– Built-in: JSON, JDBC, Parquet, HDFS

– External plug-in: CSV, HBASE, Avro, memsql, elasticsearch DATAFRAMES CAN COME FROM VARIOUS SOURCES

(108)

from pyspark.sql import HiveContext hc = HiveContext(sc)

hc.sql(“use demo”)

df1 = hc.table(“crimes”)

.select(“year”, “month”, “day”, “category”) .filter(“year > 2014”).head(5)

EXAMPLE: USING THE DATAFRAMES API

(109)

from pyspark.sql import HiveContext hc = HiveContext(sc)

hc.sql(“use demo”) df1 = hc.sql(“““

SELECT year, month, day, category FROM crimes

WHERE year > 2014”””).head(5)

SAME EXAMPLE, USING SQL SYNTAX

(110)

• Spark-SQL uses an optimization engine (Catalyst)

• Catalyst understands the structure of data & semantics of operations and performs optimizations accordingly

DATAFRAMES vs SPARK-CORE?

(111)

• Load the entire table

df=hc.table(“patients”)

• Load using a SQL Query

df1 = hc.sql(“SELECT * from patients WHERE age>50”) df2 = hc.sql(“““

SELECT col1 as timestamp, SUBSTR(date,1,4) as year, event

FROM events

WHERE year > 2014”””)

CREATING A DATAFRAME: FROM A TABLE IN HIVE

(112)

• From a JSON file

df = hc.read.json(“somefile.json”)

df = hc.read.format(“json”).load(“somefile.json”)

• From Parquet file

df = hc.read.parquet(“somefile.parquet”)

• From a CSV file:

df = hc.read.format(“com.databricks.spark.csv”)

CREATING A DATAFRAME: FROM A FILE

(113)

EXEMPLE DATAFRAMES

(114)

• first() – return the first row

• take(n) – return n rows

df1.first()

Row(age=23, cid=u’104’, name=u’Bob’, state=u’nc’)

df1.take(2)

[Row(age=45, cid=u’104’, name=u’Ram’, state=u’fl’) DATAFRAME OPERATIONS: INSPECTING CONTENT (1)

(115)

• limit(n): reduce the dataframe to n rows

– Result is still a dataframe, not a python result list

• show(n): prints the first n rows to the console DATAFRAME OPERATIONS: INSPECTING CONTENT (2)

(116)

• Remove duplicate rows

df1.distinct().show()

• Removing duplicate rows by key

df1.drop_duplicates([“name”]).show() DATAFRAME OPERATIONS: REMOVING DUPLICATES

(117)

Exercício 7

Explorando Spark SQL

(118)

• Perguntas ?

• Não deixem de acessar nosso site e se cadastrem para as promoções, vagas: www.cetax.com.br

MUITO OBRIGADO!

FINALIZANDO