Feasible Analytics: data prep

Showing posts with label data prep. Show all posts

Tuesday, October 13, 2015

DFS and Optimization: Data

Like any analytics problem, let's start by getting our hands on data. For the optimization problem, we'll need at least two pieces of information:

Salary information per player
Player information (position, league)
Player statistics / metrics to measure value

Ideally we could download this information straight from Draft Kings. However, I didn't want to create an account and it didn't seem straight forward. I took the easier route using Google to find someone that was already posting some of the data I needed.

Draft King Salaries

It was a bit difficult to access the salary information directly on draft kings, but RotoGuru is nice enough to post the daily data for us. Using the httr, dplyr and stringr packages was easy enough to scrape his website and pull down the salary data.

ESPN Game Score

Next up was some metrics and statistics for each player. My first though was go to ESPN, they have everything right? Well, yes, however, it wasn't easy to grab. Their daily notes section gives lots of tips on who to pick up, including a nice metric called Game Score for pitchers. Here's some code that we'll use to grab that data.

Fangraphs Advanced Metrics

Well, game score is certainly handy, but it'd be nice to have a great metric for hitters too. Since I'm a SABR person, I figured why not go for some advanced metrics. Fangraphs is a great site with articles discussing baseball in terms of advanced metrics and hosting an accompanying glossary for those unfamiliar with them. Here's the code for downloading that data:

Monday, July 28, 2014

Better Fantasy Football Data for Optimization

It's been awhile, but finally got around to parsing a better data source. FFToday provides projections for multiple statistics per position. This fills the gap for Special Teams and Kickers that ESPN was not providing. In addition, you can filter to get the projected total points according to several different websites default scoring and rules. The link I used calculates total points for standard ESPN leagues, but you can easily adjust and pull for your own league or site.

Well, let's get into the fun part. In general, all of the positions are relatively simple in pulling the data but vary in the columns that are pulled. We'll only use two libraries for parsing the website and then combining the data frames.

require("XML")
require("plyr")
options(stringsAsFactors = FALSE)

Next up, let's pull the raw data. As you'll note, the first column and row are not useful for us so we'll drop it and then change the column names. The last piece looks at removing some weird characters that ended up inside of the Name field.

link<-"http://fftoday.com/rankings/playerproj.php?Season=2014&PosID=10&LeagueID=26955"

x<-readHTMLTable(link)
qb <- x[11]$'NULL'
qb <- qb[-1,-1]
names(qb)<-c("Name","Team","Bye.Week","Passing.Completions","Passing.Attempts",
"Passing.Yards","Passing.TDs","Passing.INTs",
"Rushing.Attempts","Rushing.Yards","Rushing.TDs","FFPTs")

qb$Name<-sapply(qb$Name,function(x) substring(x,3))

We then repeat this for the other positions.

link<-"http://fftoday.com/rankings/playerproj.php?Season=2014&PosID=20&LeagueID=26955"
x<-readHTMLTable(link)
rb <- x[11]$'NULL'
rb <- rb[-1,-1]
names(rb)<-c("Name","Team","Bye.Week","Rushing.Attempts","Rushing.Yards","Rushing.TDs",
"Receiving.Catches","Receiving.Yards","Receiving.TDs","FFPTs")
rb$Name<-sapply(rb$Name,function(x) substring(x,3))

link<-"http://fftoday.com/rankings/playerproj.php?Season=2014&PosID=30&LeagueID=26955"
x<-readHTMLTable(link)
wr <- x[11]$'NULL'
wr <- wr[-1,-1]
names(wr)<-c("Name","Team","Bye.Week","Receiving.Catches","Receiving.Yards","Receiving.TDs",
"Rushing.Attempts","Rushing.Yards","Rushing.TDs","FFPTs")
wr$Name<-sapply(wr$Name,function(x) substring(x,3))

link<-"http://fftoday.com/rankings/playerproj.php?Season=2014&PosID=40&LeagueID=26955"
x<-readHTMLTable(link)
te <- x[11]$'NULL'
te <- te[-1,-1]
names(te)<-c("Name","Team","Bye.Week","Receiving.Catches","Receiving.Yards","Receiving.TDs","FFPTs")
te$Name<-sapply(te$Name,function(x) substring(x,3))

link<-"http://fftoday.com/rankings/playerproj.php?Season=2014&PosID=80&LeagueID=26955"
x<-readHTMLTable(link)
k <- x[11]$'NULL'
k <- k[-1,-1]
names(k)<-c("Name","Team","Bye.Week","Field.Goals.Made","Field.Goals.Attempts","Field.Goals.Pct",
"Extra.Points.Made","Extra.Points.Attempts","FFPTs")
k$Name<-sapply(k$Name,function(x) substring(x,3))

link<-"http://fftoday.com/rankings/playerproj.php?Season=2014&PosID=99&LeagueID=26955"
x<-readHTMLTable(link)
def <- x[11]$'NULL'
def <- def[-1,-1]
names(def)<-c("Name","Bye.Week","Sacks","Fumble.Recovered","Interceptions",
"Defensive.TDs","Points.Against","Passing.Yards.per.Game","Rushing.Yards.per.Game",
"Safety","Kicking.TDs","FFPTs")
def$Name<-sapply(def$Name,function(x) substring(x,3))

Lastly, we'll add a field with the position and then combine the datasets. Note since each of the data frames has different columns, we need to use a function from the PLYR package to combine the data frames. This leaves NAs where the column wasn't in the data frame. We can easily fix that with the last statement.

qb$pos <- as.factor("QB")
rb$pos <- as.factor("RB")
wr$pos <- as.factor("WR")
te$pos <- as.factor("TE")
def$pos <- as.factor("DEF")
k$pos <- as.factor("K")

d <- rbind.fill(qb,rb,wr,te,def,k)
d[is.na(d)]<-0

Hooray, better data to work with in our optimization. I'm looking into using a different solver that looks to be a bit more user friendly. Keep an eye out for a blog on localsolver up next!

Tuesday, June 24, 2014

Downloading ESPN Fantasy Football Projections

I'm starting another series of posts that will look at determining the best fantasy football draft using optimization and game theory. However, first we will need some data to play with. Using FantasyFootballAnalytics code, I adjusted it to download stats off of the ESPN rankings page.

First up, let's load some useful packages.

require("XML")
require("stringr")
require("ggplot2")
require("plyr")

Next, I want to make sure we read in all strings as strings and not create factors from them.
options(stringsAsFactors = FALSE)

Here's the hard part. We need to go out to the website, download the data and clean it up.
clean <- function(pos) {
link<-"http://games.espn.go.com/ffl/tools/projections?&seasonTotals=true&seasonId=2014&slotCategoryId="
x <- readHTMLTable(paste(link,pos,sep=""))$playertable_0
#update the column names
names(x) <- c("position.rank","name.team.pos", "passing.completions.per.attempt", "passing.yards", "passing.tds", "passing.interceptions", "rushing.rushes", "rushing.yards", "rushing.tds", "receiving.catches", "receiving.yards", "receiving.tds", "total.points")

#remove header row
x <- x[-1,]

#separate out the completions and attempts
x$passing.completions <- str_sub(x$passing.completions.per.attempt, end=str_locate(string=x$passing.completions.per.attempt, '/')[,1]-1)
x$passing.attempts <- str_sub(x$passing.completions.per.attempt, start=str_locate(string=x$passing.completions.per.attempt, '/')[,1]+1)
x[,"passing.completions.per.attempt"] <- NULL

if (pos!="16") {
x$name <- str_sub(x$name.team.pos, end=str_locate(string=x$name.team.pos, ',')[,1]-1)
x$team <- str_sub(x$name.team.pos,
start=str_locate(string=x$name.team.pos, ',')[,1]+2,
end = str_locate(string=x$name.team.pos, ',')[,1]+4)
x$team <- str_trim(x$team, side="right")
x$team <- toupper(x$team)
} else {
x$name <- str_sub(x$name.team.pos, end=str_locate(string=x$name.team.pos, ' ')[,1]-1)
team.lookup<-as.data.frame(matrix(c("DEN","Broncos","GB","Packers","NO","Saints","CAR","Panthers",
"WSH","Redskins","DET","Lions","IND","Colts","PHI","Eagles","OAK","Raiders",
"SEA","Seahawks","SF","49ers","DAL","Cowboys","ATL","Falcons","NE","Patriots",
"SD","Chargers","MIN","Vikings","CHI","Bears","KC","Chiefs","CIN","Bengals",
"PIT","Steelers","NYG","Giants","ARI","Cardinals","MIA","Dolphins",
"BAL","Ravens","TB","Bucaneers","CLE","Browns","HOU","Texans","STL","Rams",
"BUF","Bills","NYJ","Jets","TEN","Titans","JAC","Jaguars"),ncol=2,byrow=TRUE))
names(team.lookup)<-c("team","name")
x<-merge(x=x,y=team.lookup,by="name")
}
x[,"name.team.pos"] <- NULL

#change from character to numeric
for (c in names(x)) {
if (!(c %in% c("pos","name","team"))) x[,c] <- as.numeric(x[,c])
#replace NAs with 0
x[is.na(x[,c]),c] <- 0
}
return(x)
}

Now that the function is prepped for us, we just need to fill in the appropriate tail to each website. Note that since we'll need more than 40 RBs and 40 WRs, we'll grab the top 120 players at each position.

qb <- clean("0")
rb <- rbind(clean("2"),clean("2&startIndex=40"),clean("2&startIndex=80"))
wr <- rbind(clean("4"),clean("4&startIndex=40"),clean("4&startIndex=80"))
te <- clean("6")
def <- clean("16")
k <- clean("17")

Now let's add in a variable for each position.

qb$pos <- as.factor("QB")
rb$pos <- as.factor("RB")
wr$pos <- as.factor("WR")
te$pos <- as.factor("TE")
def$pos <- as.factor("DEF")
k$pos <- as.factor("K")

Finally, let's combine all of the position specific datasets.

d <- rbind(qb,rb,wr,te,def,k)

We can now rank each player by the point totals.

d$rank <- rank(-d$total.points, ties.method="min")

Finally, we can order the dataset by that rank.

d <- d[order(d$rank),]

Well this is a great start. Note that we have total points for Defense and Kicker, but no accompanying statistics. I'm going to explore the ESPN website to see what I can find but this will do in the meantime.

Wednesday, December 4, 2013

Optimization in SAS: Background & Data Prep

Throughout the next few posts, I will describe various ways to tackle a typical optimization problem encountered in the marketing industry. As you might guess, a customer is available to receive multiple promotions at the same time but is only likely to give their attention to at most one promotion. Assuming that you can only send one promotion, which one is the best to send them? This is where optimization comes in combined with other marketing strategies can help determine the "best" promotion to send a customer. My sample problem will assume that you are a local financial services company that has several types of credit cards like Travel Rewards or Cash Rewards. As you might guess, there are various of options within each of the types of credit cards, like incentives and APRs that can vary per customer.

Like any analytics project, the first step is understanding your data and cleaning it up to use inside of your modeling process. For this exercise, I will be generating random data to use inside of the optimization model. I have no rhyme or reason behind the distributions I chose but decided mix it up. This model will be using four different data sources: Product Data, Pricing Data, Customer Data, Modeling Data.

The first data set describes the various products to pick from and their parameters. In this instance, we will set a volume of available offers for each product.
data product_data;
length product $12;
infile datalines dsd;
input product $ @13 volume;
datalines;
TRAVEL 500
CASH Rewards 500
HOTEL 500
;
run;

The second set will define various APRs (or prices) available for each card type.
data price_data;
input price $ volume;
datalines;
10.99 500
;
run;

The third data set defines the list of customers available for a promotion and their current customer status.
data customer_data;
call streaminit(123);
do i = 1 to 1000;
customer_id = compress(put(i,5.));
customer_status = rand('BERNOULLI',.25);
output;
end;
drop i;
run;

The last dataset will randomly assign values to represent each customer's likelihood to apply for each card product at each price point. In addition, this dataset will include the estimated profit that your bank is likely to make off each customer if they decide to apply (and get accepted) for each card. In reality, these scores are usually developed using a predictive modeling technique like linear regression or a Bayesian network.
data model_scores;
length product $12
price $5
;
format expected_profit dollar6.2
likelihood_to_apply 3.2
;
call streaminit(123);
do i = 1 to 1000;
customer_id = compress(put(i,5.));
price = '10.99';
product = 'TRAVEL';
expected_profit = 100*rand('UNIFORM')+1;
likelihood_to_apply = rand('TRIANGLE',.6);
output;
product = 'CASH Rewards';
expected_profit = 100*rand('UNIFORM')+1;
likelihood_to_apply = rand('TRIANGLE',.6);
output;
product = 'HOTEL';
expected_profit = 100*rand('UNIFORM')+1;
likelihood_to_apply = rand('TRIANGLE',.6);
output;
end;
drop i;
run;

Now that the data is defined, the next step is to actually run a simple optimization model to pick out the best card for each customer. Keep an eye out for the next post that will get into the heart of PROC OPTMODEL and optimization.