Showing posts with label optimization. Show all posts
Showing posts with label optimization. Show all posts

Tuesday, October 20, 2015

DFS and Optimization: Simple Optimization

Now that we have data, let's actually get into the point of having this data: choosing what players to play each day. For our first go, we're going to take a simple approach. We'll start out with just pitchers and maximizing over just one metric, game score. This will allow us to get a refresher in how to run an optimization problem in R and create a set of code that we will be able build out in the future.

First step is merging the datasets that have the two data points we need. Using this dataset, we'll create an empty Integer Program. Our decision variables will be binary variables representing if we will choose that player or not. The IP will have two constraints:

  1. We must select 2 pitchers
  2. The total salary used must be below a threshold
Using lpSolve, we can construct the problem and solve it. The results are interesting as it's easy to see what players were optimal selection. That being said, this is just a start.


Tuesday, October 6, 2015

Daily Fantasy Sports and Optimization

One of the newest trends in fantasy sports is the Daily Fantasy Sports (DFS) leagues such as Draft Kings. These leagues allow you to draft a new team each day with a set budget in order to fill out your team. Each league can be slightly different, but the general idea is the same: maximize the points each player can score for you while remaining under budget. This becomes an opportunity to find "deals" everyday so you can spend money on players likely to score you high points, while taking risks on cheaper players that might score above their usual production. The next few posts will talk about the data involved, analysis to determine useful data, and the optimization problem you can run every day to maximize output. As of right now, here's the few posts I plan to do in the near future:

  1. Downloading the data from ESPN, Draft Kings, and Fangraphs.
  2. Simple optimization over pitchers
  3. Good indicators of performance in that day's games
  4. Optimization of an entire team
  5. Adding a little uncertainty to the optimization
  6. Wrap up into a Shiny app
I hope that in the end you'll pick up on some useful R code and a little understanding of how you can use optimization in fantasy sports.

Tuesday, September 15, 2015

Fantasy Football Optimization and PuLP

I've written previously about how to solve a basic fantasy football optimization using lpSolve in R. It's not terribly easy to read and just as hard to debug. That's why I was thrilled to find a Python package called PuLP. I found this package to be very simple and much easier to use than any open source solvers in R. Using the same logic described in my previous post, I implemented it using this package. Here's the code for reference:


Wednesday, July 30, 2014

LocalSolver License

I spent the last few days translating the LPSolve code from my previous optimization problem to using the LocalSolver API. I'll admit, I like the syntax and the accompanying R package to the solver quite a bit. I found it incredibly easy to import data, setup constraints and solve the problem. However, there was one big downside, I'm not a student or teacher and can't afford to buy a commercial license.

Anyway, here's the code in case you have a license that exceeds the limitations for the free license.

Tuesday, July 1, 2014

Fantasy Football Optimization Part 1

I decided to break up the problem into baby steps. This first part will deal with building out the initial structure of the optimization problem. For those that read my other post on optimization in R, I'll be using the same libraries and style for setting up this problem.

First up, let's read in the data we created in the last post. We'll add a simple column that creates a numeric ID per player.
d <- read.csv(file=paste(getwd(),"/Data/ESPN-Projections.csv", sep=""))
d$id <- as.integer(factor(paste(d$name,d$team)))

Now that the data is all set, we can load the required solver libraries.
require("lpSolve");require("lpSolveAPI");

We can set the number of teams in the league. Given the number of teams in the league, we can set up a vector of team IDs.
num.teams <- 10
teams <- seq(1,num.teams)

Similarly, we can grab the number of players in our dataset and create a vector of the ids.
num.players <- length(unique(d$id))
players <- unique(d$id)

I'm going to create a data frame with the decision variables for our problem. First up is creating the cross product of all players and teams. We'll then merge in our player data and add in a team ID.
vars <- data.frame(player.id=rep(players,num.teams))
vars <- merge(x=vars,y=d,by.y="id",by.x="player.id")
vars <- vars[,c("player.id","pos","name")]
vars$team.id <- rep(seq(1,num.teams),num.players)

The data is set up and it's time to create the actual Integer Programming problem. Note that these decision variables are also binary, either a player is assigned to that team or he isn't.
ip <- make.lp(0,num.players*num.teams)
set.type(ip,seq(1,num.players*num.teams),type="binary")

The objective function is simply to maximize the number of projected points.
set.objfn(ip,rep(d$total.points,num.teams))
lp.control(ip,sense="max")

We need to add constraints for each player that ensures that if they are assigned to a team, that they are assigned to one and only one team.
for (p in players) {
  add.constraint(ip,
                 rep(1,num.teams),
                 "<=",
                 1,
                 which(vars$player.id==p)
                 )
}

Now for the team constraints. First up, the positions required for each team. For simplicity, I'm using the lineup that ESPN uses in their standard league. Here are the minimum number of positions to be drafted:

  • 1 QB
  • 2 RB
  • 2 WR
  • 1 RB/WR/TE (Flex player)
  • 1 TE
  • 1 DEF
  • 1 K


for (t in teams) {
  #This constraint covers having at least 1 QB  
  add.constraint(ip,
                 rep(1,sum(vars$pos=="QB")/num.teams),
                 ">=",
                 1,
                 which(vars$team.id==t & vars$pos=="QB")
  )
  #This constraint covers having at least 2 WR
  add.constraint(ip,
                 rep(1,sum(vars$pos=="WR")/num.teams),
                 ">=",
                 2,
                 which(vars$team.id==t & vars$pos=="WR")
  )
  #This constraint covers having at least 2 RB
  add.constraint(ip,
                 rep(1,sum(vars$pos=="RB")/num.teams),
                 ">=",
                 2,
                 which(vars$team.id==t & vars$pos=="RB")
  )
  #This constraint covers having at least 1 DEF
  add.constraint(ip,
                 rep(1,sum(vars$pos=="DEF")/num.teams),
                 ">=",
                 1,
                 which(vars$team.id==t & vars$pos=="DEF")
  )
  #This constraint covers having at least 1 K
  add.constraint(ip,
                 rep(1,sum(vars$pos=="K")/num.teams),
                 ">=",
                 1,
                 which(vars$team.id==t & vars$pos=="K")
  )
  #This constraint covers having at least 1 TE
  add.constraint(ip,
                 rep(1,sum(vars$pos=="TE")/num.teams),
                 ">=",
                 1,
                 which(vars$team.id==t & vars$pos=="TE")
  )
  #This constraint covers having at least 1 flex player. Note that the other constraints require at least 1 TE, 2 RB, 2 WR. In order to cover a flex player, the total sum of players from those positions needs to be at least 6.
  add.constraint(ip,
                 rep(1,sum(vars$pos=="TE",vars$pos=="RB",vars$pos=="WR")/num.teams),
                 ">=",
                 6,
                 which(vars$team.id==t & (vars$pos=="TE" | vars$pos=="RB" | vars$pos=="WR"))
  )
  #This constraint covers each team having 16 players
  add.constraint(ip,
                 rep(1,num.players),
                 "=",
                 16,
                 which(vars$team.id==t)
  )
}

Well that's it for our basic set of constraints. If you're interested in seeing what the model formulation looks like, execute the "write.lp" statement below.
write.lp(ip,paste(getwd(),"/modelformulation.txt",sep=""),type="lp",use.names=T)

Now the fun part, solving the integer program. Following that it is feasible (and it is) we get the objective function value and the solution.
solve(ip)
get.objective(ip)
get.variables(ip)

Although seeing the solution looks relatively complex, we can simply keep the assignments and print them out.

sol<-vars[get.variables(ip)==1,c("name","team.id","pos")]
View(sol[order(sol$team.id,sol$pos),])

One huge downside to this approach is the lack of actual drafting strategy or complications. This problem simply looks at dividing talent evenly across teams. I particularly dislike the results of some teams ending up with more than one kicker. No one should ever own more than one kicker.

My next step is to either improve the formulation of this problem, probably by using some options mentioned in this Fantasy Football Analytics post, or to look at applying a different algorithm to solving this problem.

Wednesday, March 5, 2014

Optimization using R and LPSolve

I saw a recent post on OR-Exchange about what programming language is best to for optimization. While I agree with @JFPuget that languages are just a wrapper for various solvers, there is a learning curve behind how to use each wrapper. My previous entries are about how to program in SAS using optmodel. I took this opportunity to write up the same example inside of R and using lpsolve.

First up, let's setup the data.

products<-data.frame(
   product=c("TRAVEL","CASH Rewards","HOTEL"),
   volume=rep(500,3))
prices<-data.frame(
   price=c("10.99"),
   volume=c(500))
set.seed(123)
customers<-data.frame(
   id=seq(1,1000),
   customer_status=rbinom(1000,1,0.25))
set.seed(123)
require("triangle")
model.scores<-rbind(
   data.frame(
      id=seq(1,1000),
      price=rep("10.99",1000),
      product=rep("TRAVEL",1000),
      expected.profit=runif(1000,1,100),
      likelihood.to.apply=rtriangle(1000,0.0,1.0,0.6)),
   data.frame(
      id=seq(1,1000),
      price=rep("10.99",1000),
      product=rep("CASH Rewards",1000),
      expected.profit=runif(1000,1,100),
      likelihood.to.apply=rtriangle(1000,0.0,1.0,0.6)),
   data.frame(
      id=seq(1,1000),
      price=rep("10.99",1000),
      product=rep("HOTEL",1000),
      expected.profit=runif(1000,1,100),
      likelihood.to.apply=rtriangle(1000,0.0,1.0,0.6)))

Next Up, the optimization code. I will note that this is somewhat confusing but I will break it up into sections similar to what I did with the SAS version. I will be using these two libraries for lpSolve. The first one provides access to the solver itself but the API is what actually makes this relatively easy to code.

require("lpSolve");require("lpSolveAPI")

The first step is making an empty optimization problem named ip. I start with zero constraints and add in the number of decision variables required.

ip <- make.lp(0,nrow(customers)*nrow(products)*nrow(prices))

Next I declare each decision variable as binary.

set.type(ip,seq(1,nrow(customers)*nrow(products)*nrow(prices)),type="binary")

This next part will seem like a relatively pointless step, but it will help in adding constraints. This data.frame contains every combination of customer, product, and price available. This will provide the index for assigning values in constraints and the objective function.

combos <- expand.grid(prices$price,products$product,customers$id)
names(combos)<-c('price','product','id')
rownames(combos)<-do.call(paste, c(combos[c("id", "product","price")], sep = "_"))
rownames(model.scores)<-do.call(paste, c(model.scores[c("id", "product","price")], sep = "_"))
model.scores.ordered<-model.scores[order(match(rownames(model.scores),rownames(combos))),]

By default, the objective function is set to minimize the problem. First I will set the coefficients to the decision variables in the objective function and then change it to be a maximization problem.

set.objfn(ip,model.scores.ordered$expected.profit*model.scores.ordered$likelihood.to.apply)
lp.control(ip,sense="max")

Here are two constants we will use in defining the constraints.

prod.per.person<-1
price.per.product<-1

First up, let's add in the constraints around the number of products per person.

for (c in customers$id) {
   add.constraint(ip,
                  rep(1,nrow(products)*nrow(prices)),
                  "<=",
                  prod.per.person,
                  which(combos$id==c)
   )
}

The next set of constraints will limit the number of price points per product per customer.

for (c in customers$id) {
   for (p in products$product) {
      add.constraint(ip,
                     rep(1,nrow(prices)),
                     "<=",
                     price.per.product,
                     which(combos$product==p & combos$id==c)
      )
   }
}

The next set of constraints assign the volume constraints for price points.

for (q in prices$price) {
   add.constraint(ip,
                  rep(1,nrow(customers)*nrow(products)),
                  "<=",
                  prices$volume[which(q==prices$price)],
                  which(combos$price==q)
   )
}

The next set of constraints assign the volume constraints for products.

for (p in products$product) {
   add.constraint(ip,
                  rep(1,length(which(combos$product==p))),
                  "<=",
                  products$volume[which(p==products$product)],
                  which(combos$product==p)
   )
}

Finally, let's clean up the formulation a little and assign names to the decision variables and constraints.

colnames <- character()
for (c in customers$id) {
   for (p in products$product) {
      for (q in prices$price) {
         colnames<-c(colnames,paste(c,p,q,sep="_"))
      }
   }
}
rownames <- character()
for (c in customers$id) {
   rownames<-c(rownames,paste("prod_per_cust",c,sep="_"))
}
for (c in customers$id) {
   for (p in products$product) {
      rownames<-c(rownames,paste("price_per_prod",c,p,sep="_"))
   }
}
for (q in prices$price) {
   rownames<-c(rownames,paste("price_vol",q,sep="_"))
}
for (p in products$product) {
   rownames<-c(rownames,paste("prod_vol",p,sep="_"))
}

dimnames(ip) <- list(rownames, colnames)

It's possible to write out the formulation in case you would like to review it. Note that this can be very overwhelming with the increase in any of the three dimensions to the problem.

write.lp(ip,"modelformulation.txt",type="lp",use.names=T)

Last, but not least, let's solve the problem. Following the ability to solve the problem are commands to view the objective value, decision variable values and constraint values.

solve(ip)
get.objective(ip)
get.variables(ip)

get.constraints(ip)

I'm sure there are more elegant ways for writing up this code, but it still remains a relatively painful way to code an optimization problem. I'll skip the visualization part for now. I'm going to spend time writing up a Python version next and then hope to write a comparison between the three versions.

Thursday, January 23, 2014

Optimization in SAS: Results

After spending all the time writing up code, next comes the fun part and what we really get paid for in the consulting world. I'm going to walk through three pieces of the results having hit "run". The three pieces are:
  1. The Log: This contains information behind the compilation of the code, including errors and high level results.
  2. The Results Window (Tab if you are in SAS EG): Contains the default results of having written a basic PROC OPTMODEL.
  3. Graphing the results: Some bonus code at the end to make some sense 
First up, the log produced from the program. SAS uses the log in order to print out the model description and the high level results. I'll skip the boring parts of the log where it reprints the code and get right to the important part. This set of "notes" in SAS describes what was created as the optimization formulation. It will also give the stats around "presolving" the problem and the solution status. SAS "presolves" the optimization problem by removing extraneous information like variables and non-binding constraints.

NOTE: The problem has 3000 variables (0 free, 0 fixed).
NOTE: The problem has 3000 binary and 0 integer variables.
NOTE: The problem has 4004 linear constraints (4004 LE, 0 EQ, 0 GE, 0 range).
NOTE: The problem has 12000 linear constraint coefficients.
NOTE: The problem has 0 nonlinear constraints (0 LE, 0 EQ, 0 GE, 0 range).
NOTE: The OPTMODEL presolver removed 0 variables, 3003 linear constraints, and 0 nonlinear constraints.
NOTE: The OPTMODEL presolved problem has 3000 variables, 1001 linear constraints, and 0 nonlinear constraints.
NOTE: The OPTMODEL presolver removed 8000 linear constraint coefficients, leaving 4000.
NOTE: The OPTMILP presolver value AUTOMATIC is applied.
NOTE: The OPTMILP presolver removed 0 variables and 0 constraints.
NOTE: The OPTMILP presolver removed 0 constraint coefficients.
NOTE: The OPTMILP presolver modified 0 constraint coefficients.
NOTE: The presolved problem has 3000 variables, 1001 constraints, and 4000 constraint coefficients.
NOTE: The MIXED INTEGER LINEAR solver is called.
          Node  Active    Sols    BestInteger      BestBound      Gap    Time
             0       1       2  41528.0754238  41528.0754238    0.00%       1
             0       0       2  41528.0754238              .    0.00%       1
NOTE: Optimal.
NOTE: Objective = 41528.0754.

Hooray, the model not only solved, but it found an optimal solution. Next part is to check the results of the program. This is where SAS will put all printouts from the procedures used in the program. There are three sections to this problem in the results section. The first one is the test print we had to verify that our data was read in to the model correctly.


The next part shows the model summary. You'll noticed that this looks like the output in the log, but forced into a table format. I don't particularly use this table very often other than to verify the size of the table. It can be helpful to note the size of the table in reference to the run-time for the problem.

problem summary

The last piece shows the solution summary. This gives you some of the relevant information you need to evaluate your model.

Solution Summary


The last step for me is taking a stab at visualizing the results produced by the optimization model. Since this optimization problem is relatively simple, I want to color code who was mailed and who wasn't for each of the products. Since my objective function is the product of Profit and the Likelihood To Apply, I use each as an axis inside of the chart. Below is the simple macro I use to build the chart for each product. First I use SQL to generate a dynamic header for the chart with the total expected profit for that product. I then use SGPLOT to draw the scatter plot. 

proc format;
    value mail
        1="Mail"
        0="No Mail"
    ;
run;

%macro graph_results(product);
goptions device=svg;
ods graphics on / antialiasmax=10000;
proc sql noprint;
    select sum(e_profit*lta) format=dollar12. 
        into :profit 
        from results 
        where mail=1
        and product = "&product"
    ;
quit;

proc sgplot data=results;
    title1 "Optimization Results for &product.";
    title2 "Total Expected Profit = &profit.";
    where product = "&product";
    format mail mail.;
    scatter x=e_profit y=lta / 
        group=mail 
        markerattrs=(size=5 symbol=circlefilled)
        transparency=0.7
        name="scatter"
    ;
    keylegend "scatter" / across=1 position=topleft noborder location=inside ;
    xaxis label="Expected Profit" values=(0 to 100 by 20) display=(noticks);
    yaxis label="Likelihood to Apply" values=(0 to 1 by .1) display=(noticks);
run;
title;
ods graphics off;
goptions reset=all;
%mend;

%graph_results(TRAVEL);




Note that it isn't the greatest picture, but it gives you an idea of the solution. I hope to give the same visualization a shot in Tableau to demonstrate that without coding up a chart, you can still get a nice visualization.

Sunday, December 8, 2013

Optimization in SAS: Proc OPTMODEL

Now that the data has been prepped, it's time to build the optimization model. Let's take a step back an formulate the problem first. Every constraint based optimization problem is divided into three parts: the objective function, the decision variables, and the constraints. 

Remember that the problem is to identify the best promotion available for each person. The decision variable is binary, whether a person will receive a specific promotion or not. In this example, a specific promotion is identified by a combination of credit card type and APR.

The objective we will maximize over will be our idea of profit. Profit will be made up of two pieces of information that are typically modeled:

  • Expected profit is defined as the profit (over a period of time) that can be earned from the customer if they open up a specific credit card. 
  • Likelihood that a customer will apply for a particular credit card if they receive a promotion from us.

The resulting objective function becomes the sum of all expected profit against the likelihood they apply for that account, if we choose to mail them a promotion for that account.

For this first example, we will keep our list of constraints short and simple. Here's the list of constraints we will implement in this first iteration:

  1. Limit the number of promotions that each customer can receive per campaign.
  2. Each credit card that is mailed to a customer needs to have only one price.
  3. Limit the number of total promotions mailed per card type and price point. 
Now that we have the formulation out of the way, let's code! I will walk you through my recommendations behind using OPTMODEL. I've found this style is the easiest to explain to colleagues and follow. 

The first step in using OPTMODEL is to define the list of "iterators" for which the optimization model will use. An iterator is a list of items that can be used to iterate through when defining constraints or reading in data. Although not necessary, it certainly helps keep your code short and easy to review. For this problem, we will define three iterators:

  1. Customers: the list of customers to analyze inside of this campaign
  2. Products: the list of credit card types to be promoted
  3. Prices: the list of various price points for each of the credit card types
proc optmodel;
    set <str> customers;
    set <str> products;
    set <str> prices;

The next step will be defining all of the data that will be used inside of the optimization model. Here are the variables we will reading in:
  1. Expected_profit: defined as the expected profit earned on a credit card at that price for a particular customer
  2. Likelihood_to_apply: defined as the likelihood that a customer will apply for a particular credit card at a that price
  3. Product_volume: defined as the number of promotions that can be sent for each credit card
  4. Price_volume: defined as the number of promotions that can be sent for each price 
  5. NUM_PRODUCTS_PER_CUSTOMER: a constant that defines the number of credit card promotions that can be sent to a person.
    number expected_profit {customers, products, prices};
    number likelihood_to_apply {customers, products, prices};
    number product_volume {products};
    number price_volume {prices};

    number NUM_PRODUCTS_PER_CUSTOMER = 1;

The next step is to populate those variables with actual data. Given the data we produced back my last post, we will use the "read data" phrase to get data into our optimization model. Since we have four datasets, we will use four different "read data" phrases. A "read data" phrase can be constructed as follows:

    read data datasetname into iterator=[columnname] product_volume=col("columnname");

We will start with the product_data dataset. Since this dataset is unique at the product level, we can use it to set values for the products iterator we defined above. Note that the only variable that is defined at the product level is product_volume.

    read data product_data into products=[product] product_volume=col("volume");

The next dataset is read in will be data related to the prices. Note that the only variable that is defined at the price level is price_volume.

    read data price_data into prices=[price] price_volume=col("volume");

Note we don't have any customer data used in the model, but we will still need a way to identify each customer. 

    read data customer_data into customers=[customer_id];

Last but not least, we need to read in the two model scores. Since this dataset is unique at the combination of customer, product and price we put all three indices in the brackets. Note that since all three iterators were defined, we don't need to list which one is mapped to which but order the variables as they were defined.

    read data model_scores into [customer_id product price] likelihood_to_apply expected_profit;

As with any model, I like to verify that the data is correct and read in properly. The following print statement will print out the two variables for the first customer (with ID='1').

    print {c in customers,p in products,q in prices: c='1'} likelihood_to_apply
          {c in customers,p in products,q in prices: c='1'} expected_profit
    ;

Now that the data has been read it, let's define the model. First up, our decision variable. Note we need to make a decision on whether to mail each person which product at what price. 

    var mail {customers, products, prices} binary;

Next is to define the objective function, profit. I will define profit as the sum of expected profit times to likelihood to apply times whether they get a promotion for all customers, credit cards, and prices.

    max profit = sum{c in customers, p in products, q in prices} 
        likelihood_to_apply[c,p,q]*expected_profit[c,p,q]*mail[c,p,q];

Next I will define each of the constraints mentioned above. The first constraint limits the number of promotions a customer can receive. The second one limits the number of pricing points per credit card. The third and fourth constraints limit the number of promotions that can be mailed per credit card and price respectively.

    constraint PRODUCT_PER_CUST{c in customers}:
        sum{p in products, q in prices} mail[c,p,q] <= NUM_PRODUCTS_PER_CUSTOMER;

    constraint PRICE_PER_PROD{c in customers,p in products}:
        sum{q in prices} mail[c,p,q] <= 1;

    constraint CONS_price_volume{q in prices}:
        sum{c in customers,p in products} mail[c,p,q] <= price_volume[q];

    constraint CONS_product_volume{p in products}:
        sum{c in customers,q in prices} mail[c,p,q] <= product_volume[p];

Now the easy part, solve the problem. There are multiple options involved with the solver inside of SAS but I will start out simply and identify the solver for Integer Programming problems.

    solve with milp;

Given the nature of solvers, I will clean up the solution to eliminate any potential rounding errors.

    for {c in customers, p in products, q in prices} mail[c,p,q] = round(mail[c,p,q],1E-4);

I will also create a dataset with the results of the optimization problem. This "create data" statement will create a dataset with three ID columns: customer_id, product, and price. I will then append in three data columns: lta, e_profit and mail.

    create data results from
        [customer_id product price] = {c in customers, p in products, q in prices}
        lta = likelihood_to_apply[c,p,q]
        e_profit = expected_profit[c,p,q]
        mail = mail[c,p,q]
    ;
quit;

Well that was pretty long, but it will provide as the basis for future enhancements and explorations into the problem. You can find the full listing of code here. My next post will explore the log and the results of this program.

Monday, December 2, 2013

SAS and Optimization

Like most people, I would prefer to use other optimization software out there than SAS. However, it is possible to decrease friction in your analytics workflow by keeping everything inside of SAS. Typical modeling processes involve using SAS's procedures to train and score predictive models. These scores can then be used as the input into a optimization model to help with the decision process.

SAS has a great product, SAS Marketing Optimization, for doing optimization in the marketing industry. Through work in that product, I have found that it lacks flexibility to incorporate various strategies that are commonly found in campaign planning. This series of blog posts will introduce the SAS procedure OPTMODEL. This procedure allows you to define an optimization problem and the solver in which to solve it.