ABOUT DATASET

The dataset consists of various item which are brought by user in various transactions . The goal of the competition is to predict which products will be in a user’s next order. The dataset is anonymized and contains a sample of over7501 rows with 20 plus unique item in the dataset transcations . For each transactions we have, with the sequence of products purchased in each order. All other information we want to conclude for better selling of products with another products.


Recommender System is an information filtering tool that seeks to predict which product a user will like, and based on that, recommends a few products to the users. For example, Amazon can recommend new shopping items to buy, Netflix can recommend new movies to watch, and Google can recommend news that a user might be interested in. The two widely used approaches for building a recommender system are the content-based filtering (CBF) and collaborative filtering (CF).

Majorly we have 3 types of recommended system : -

  • Collaborative Filtering
  • Content-Based Filtering
  • Hybrid Recommendation Systems

Collaborative Filtering

Collaborative filtering methods are based…


Implementation of linear regression through pyspark library in Databricks. before starting of the implementation we must familiar to the databricks platform.

This is the homepage of the databricks developed specially for spark to implement on it and its is cost free. Before starting of the session we have to make cluster and then for properties of the cluster in terms of their ram and other requirements based on the workspace you want to do

In Databricks we can apply spark RDD’s , SQl , MLIB , graphX . …


FEATURE SELECTION TECNIQUES
1.variance threshold
→constant
→quasi constant
→duplicate features
2.fisher score
3.chi-square test
4.mutual information

# In[1]:

#importig the libararies
import pandas as pd
import numpy as np

# In[2]:

bc_data=pd.read_csv(‘breast_cancer.csv’)

# In[8]:

bc_data.head()

# In[10]:

bc_data.shape

# In[5]:

bc_data.columns

# In[6]:

bc_data.info()

# In[7]:

bc_data.corr()

# In[11]:

X=bc_data.drop(labels=[‘diagnosis’],axis=1)
y=bc_data[‘diagnosis’]

# In[13]:

X.shape

# In[15]:

y.shape

# In[16]:

#splitting the data into graining ad testing
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.4,random_state=0)

# In[22]:

#shape of raining and testing will be
print(‘Independent variable \nTraining ->’,X_train.shape,”Testing ->”,X_test.shape)
print(‘Dependent variable \nTraining ->’,y_train.shape,”Testing ->”,y_test.shape)

# In[24]:

#1.VARIANCE THRESHOLD METHOD

# In[25]:

#Removing The constant…


Create two data files with names and datasets as shown below and then implement the following queries using mrjob package in python for map reduce programming and using pigalso for the same.Compare the two processing types.

We have dataset of employee and expenses in form of txt

Employee.txt

101,Abhay,20000,1

102,Shiv,10000,2

103,Aarav,11000,3

104,Anubhav,5000,4

105,Palash,2500,5

106,Aman,25000,1

107,Sahil,17500,2

108,Ram,14000,3

109,Karan,1000,4

110,Priya,2000,5

111,Tushar,500,1

112,Ajay,5000,2

113,Jay,1000,1

114,Maddy,2000,2

Expenses.txt

101,200

102,100

110,400

114,200

119,200

105,100

101,100

104,300

102,300

Q1 Top 5 employees (employeeid and employee name) with highest rating. (In casetwoemployees have same rating, employee with name coming first in dictionary shouldgetpreference).

MRJOB COMMANDS


Importing the libraries.

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

Reading the dataset

df = pd.read_csv(“heart.csv”)

df.head() #printing the all the columns with first 5 rows

Dataset with independent and dependent variables.

age

age in years

Independent Variable

sex

(1 = male; 0 = female)

Independent Variable

cp

chest pain type

Independent Variable

trestbps

resting blood pressure (in mm Hg on admission to the hospital)

Independent Variable

chol

serum cholestoral in mg/dl

Independent Variable

fbs

(fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

Independent Variable

restecg

resting electrocardiographic results

Independent Variable


NASA ASTEROID PREDICTION DATASET FROM KAGGLE TOPICS TO BE COVERED:

1.KNN

2.SVC

3.RANDOM FOREST CLASSIFIER

4.ADA BOOST

5.GRADIENT BOSSTING

6.XG BOOST

7.FEATURE SELECTION,CROSS VALIDATION ,HYPERPARAMETER TUNING,RUC AUC GRAPH FOR ALL THE MODELS

8.ARTIFICAL NEURAL NETWORK (DEEP LEARNING)

In [ ]:

#Importing libraryimport pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn import metrics#importing datset
dataset=pd.read_csv("www.kaggle.com/karanchoudhary103/nasa_as/nasa.csv")
# import time for comparing Tuning methods
from time import time
start=0
#droping dataset coloumns
dataset.drop( ['Close Approach Date'], axis = 1, inplace = True)
dataset.drop( ['Orbiting Body'], axis = 1, inplace = True)
dataset.drop( ['Orbit Determination Date'], axis = 1, inplace = True)
dataset.drop( ['Equinox'], axis = 1, inplace = True)
dataset.drop(dataset.iloc[:,[0,1,2,4,5,6,7,8,9,10,11,12,13,14,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34]],axis=1, inplace = True)
#data conversion into dependent and target varible
X = dataset.iloc[:,:-1]
Y = dataset.iloc[:,-1]

In [ ]:


Clustering algorithms are a powerful technique for machine learning on unsupervised data. The most common algorithms in machine learning are hierarchical clustering and K-Means clustering. These two algorithms are incredibly powerful when applied to different machine learning problems.

Cluster analysis can be a powerful data-mining tool for any organization that needs to identify discrete groups of customers, sales transactions, or other types of behaviors and things. For example, insurance providers use cluster analysis to detect fraudulent claims, and banks use it for credit scoring.

Algorithm mostly in →

  1. Identification of fake news .
  2. Spam filter
  3. marketing sales
  4. Classify network traffic.


Artificial Intelligence directly translates to conceptualizing and building machines that can think and hence are independently capable of performing tasks, thus exhibiting intelligence. If this advancement in technology is a boon or a bane to humans and our surroundings is a never-ending debate.

Every coin has its two faces so it is difficult to judge on the basis as a human being and make a quick decision about it and say about the technology. …


Self Organizing Map(SOM) is an unsupervised neural network machine learning technique. SOM is used when the dataset has a lot of attributes because it produces a low-dimensional, most of times two-dimensional, output. The output is a discretised representation of the input space called map.

How SOM works?

The points in input space have a correspondent points in output space. In the Kohonen Networks, a kind of SOM, there is a single layer with two dimensions and the input points are fully connected with neurons on this layer.

At the start of Self Organization process the weights are initialized with random values. After that…

Karan Choudhary

Intern at TSF | Writer at Data Driven Investors and Startups |Data Science Enthusiastic | frontend developer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store