BIGDATA MRJOB AND PIG LATIN SCRIPTS
Create two data files with names and datasets as shown below and then implement the following queries using mrjob package in python for map reduce programming and using pigalso for the same.Compare the two processing types.
We have dataset of employee and expenses in form of txt
Employee.txt
101,Abhay,20000,1
102,Shiv,10000,2
103,Aarav,11000,3
104,Anubhav,5000,4
105,Palash,2500,5
106,Aman,25000,1
107,Sahil,17500,2
108,Ram,14000,3
109,Karan,1000,4
110,Priya,2000,5
111,Tushar,500,1
112,Ajay,5000,2
113,Jay,1000,1
114,Maddy,2000,2
Expenses.txt
101,200
102,100
110,400
114,200
119,200
105,100
101,100
104,300
102,300
Q1 Top 5 employees (employeeid and employee name) with highest rating. (In casetwoemployees have same rating, employee with name coming first in dictionary shouldgetpreference).
MRJOB COMMANDS
PIG LATIN COMMANDS
employee_info = LOAD ‘gs://piglatinquery/employee.txt’ USING PigStorage(‘,’) AS (emp_id:int,emp_name:chararray,emp_salary:int,emp_rating:int);
Emp_desc_sort = ORDER employee_info BY emp_rating DESC; //sort in descending order
Emp_top_5 = LIMIT Emp_desc_sort 5; //top 5 emp with rating
employees_id_name = FOREACH top_5_employees GENERATE emp_id, emp_name;
DUMP employees_id_name;
OUTPUT
Q2 Top 3 employees (employee id and employee name) with highest salary, whoseemployee idis an odd number. (In case two employees have same salary, employee with namecoming firstin dictionary should get preference)
MRJOB
PIG LATIN COMMAND
employee_info = LOAD ‘gs://piglatinquery/employee.txt’ USING PigStorage(‘,’) AS (emp_id:int,emp_name:chararray,emp_salary:int,emp_rating:int);
odd_emp_id = FILTER employee_info BY emp_id%2 == 1; //for checking odd condition
emp_sort_salary = ORDER odd_emp_id BY emp_salary DESC; //sorting descending salary
Sal_top3_emp = LIMIT emp_sort_salary 3; //printing the salry of first 3 emp
Dump Sal_top3_emp;
OUTPUT
Q3
Employee (employee id and employee name) with maximum expense (In casetwoemployees have same expense, employee with name coming first in dictionaryshould getpreference).
MRJOB
PIG LATIN COMMANDS
employee_info = LOAD ‘gs://piglatinquery/employee.txt’ USING PigStorage(‘,’) AS (emp_id:int,emp_name:chararray,emp_salary:int,emp_rating:int);
emp_expenses = LOAD ‘employee_expenses.txt’ AS (emp_id:int, expense:int);
Join_emp = JOIN employee_info BY emp_id, emp_expenses BY emp_id;
grouped_emp_info = GROUP Join_emp ALL;
max_expense = FOREACH emp_info GENERATE MAX (Join_emp. emp_expenses::emp_expense) AS expense;
max_expense_emp = FILTER Join_emp BY emp_expenses::expense == max_expense.expense;
DUMP max_expense_emp;
OUTPUT
Q4List of employees (employee id and employee name) having entries in expenses.txt.
MRJOB
PIG ATIN COMMANDS
employee_info = LOAD ‘gs://piglatinquery/employee.txt’ USING PigStorage(‘,’) AS (emp_id:int,emp_name:chararray,emp_salary:int,emp_rating:int);
emp_expenses = LOAD ‘employee_expenses.txt’ AS (emp_id:int, expense:int);
JOIN= JOIN employee_info BY emp_id, emp_expenses BY emp_id;
emp_information = FOREACH JOIN GENERATE emp_info::emp_id, employee_details::emp_name;
final_join = DISTINCT emp_information; //UNIQUE
DUMP final_join;
OUTPUT
Q 5.List of employees (employee id and employee name) having no entry in expenses.txt.
employee_info = LOAD ‘gs://piglatinquery/employee.txt’ USING PigStorage(‘,’) AS (emp_id:int,emp_name:chararray,emp_salary:int,emp_rating:int);
emp_details_LEFTJOIN = JOIN employee_info BY emp_id LEFT OUTER, emp_expenses BY emp_id;
JOIN_details = FILTER emp_details_LEFTJOIN BY emp_expenses::emp_id IS NULL;
emp_details= FOREACH JOIN_ details GENERATE employee_info:: emp_id, employee_info::emp_name;
DUMP emp_details;