BIGDATA MRJOB AND PIG LATIN SCRIPTS

Karan Choudhary
3 min readOct 9, 2020

Create two data files with names and datasets as shown below and then implement the following queries using mrjob package in python for map reduce programming and using pigalso for the same.Compare the two processing types.

We have dataset of employee and expenses in form of txt

Employee.txt

101,Abhay,20000,1

102,Shiv,10000,2

103,Aarav,11000,3

104,Anubhav,5000,4

105,Palash,2500,5

106,Aman,25000,1

107,Sahil,17500,2

108,Ram,14000,3

109,Karan,1000,4

110,Priya,2000,5

111,Tushar,500,1

112,Ajay,5000,2

113,Jay,1000,1

114,Maddy,2000,2

Expenses.txt

101,200

102,100

110,400

114,200

119,200

105,100

101,100

104,300

102,300

Q1 Top 5 employees (employeeid and employee name) with highest rating. (In casetwoemployees have same rating, employee with name coming first in dictionary shouldgetpreference).

MRJOB COMMANDS

PIG LATIN COMMANDS

employee_info = LOAD ‘gs://piglatinquery/employee.txt’ USING PigStorage(‘,’) AS (emp_id:int,emp_name:chararray,emp_salary:int,emp_rating:int);

Emp_desc_sort = ORDER employee_info BY emp_rating DESC; //sort in descending order

Emp_top_5 = LIMIT Emp_desc_sort 5; //top 5 emp with rating

employees_id_name = FOREACH top_5_employees GENERATE emp_id, emp_name;

DUMP employees_id_name;

OUTPUT

Q2 Top 3 employees (employee id and employee name) with highest salary, whoseemployee idis an odd number. (In case two employees have same salary, employee with namecoming firstin dictionary should get preference)

MRJOB

PIG LATIN COMMAND

employee_info = LOAD ‘gs://piglatinquery/employee.txt’ USING PigStorage(‘,’) AS (emp_id:int,emp_name:chararray,emp_salary:int,emp_rating:int);

odd_emp_id = FILTER employee_info BY emp_id%2 == 1; //for checking odd condition

emp_sort_salary = ORDER odd_emp_id BY emp_salary DESC; //sorting descending salary

Sal_top3_emp = LIMIT emp_sort_salary 3; //printing the salry of first 3 emp

Dump Sal_top3_emp;

OUTPUT

Q3

Employee (employee id and employee name) with maximum expense (In casetwoemployees have same expense, employee with name coming first in dictionaryshould getpreference).

MRJOB

PIG LATIN COMMANDS

employee_info = LOAD ‘gs://piglatinquery/employee.txt’ USING PigStorage(‘,’) AS (emp_id:int,emp_name:chararray,emp_salary:int,emp_rating:int);

emp_expenses = LOAD ‘employee_expenses.txt’ AS (emp_id:int, expense:int);

Join_emp = JOIN employee_info BY emp_id, emp_expenses BY emp_id;

grouped_emp_info = GROUP Join_emp ALL;

max_expense = FOREACH emp_info GENERATE MAX (Join_emp. emp_expenses::emp_expense) AS expense;

max_expense_emp = FILTER Join_emp BY emp_expenses::expense == max_expense.expense;

DUMP max_expense_emp;

OUTPUT

Q4List of employees (employee id and employee name) having entries in expenses.txt.

MRJOB

PIG ATIN COMMANDS

employee_info = LOAD ‘gs://piglatinquery/employee.txt’ USING PigStorage(‘,’) AS (emp_id:int,emp_name:chararray,emp_salary:int,emp_rating:int);

emp_expenses = LOAD ‘employee_expenses.txt’ AS (emp_id:int, expense:int);

JOIN= JOIN employee_info BY emp_id, emp_expenses BY emp_id;

emp_information = FOREACH JOIN GENERATE emp_info::emp_id, employee_details::emp_name;

final_join = DISTINCT emp_information; //UNIQUE

DUMP final_join;

OUTPUT

Q 5.List of employees (employee id and employee name) having no entry in expenses.txt.

employee_info = LOAD ‘gs://piglatinquery/employee.txt’ USING PigStorage(‘,’) AS (emp_id:int,emp_name:chararray,emp_salary:int,emp_rating:int);

emp_details_LEFTJOIN = JOIN employee_info BY emp_id LEFT OUTER, emp_expenses BY emp_id;

JOIN_details = FILTER emp_details_LEFTJOIN BY emp_expenses::emp_id IS NULL;

emp_details= FOREACH JOIN_ details GENERATE employee_info:: emp_id, employee_info::emp_name;

DUMP emp_details;

--

--