[R] The 'subset matching' challenge

Detlef Steuer detlef.steuer at hsu-hamburg.de
Thu Oct 29 16:53:20 CET 2009


If you look for "subset sum problem", you will find relevant 
information.

A starter:
http://en.wikipedia.org/wiki/Subset_sum_problem

Detlef

On Thu, 29 Oct 2009 15:47:22 +0100
Yvonnick Noël <yvonnick.noel at uhb.fr> wrote:

> Dear all,
> 
> The following problem just has been submitted to me by an accountant.
> 
> In his new job, he has to close some old accounts. He has yearly 
> amounts, and a list of products that have been bought over the years, at 
> certain prices for which he has an exhaustive record. The problem is: He 
> does not know what product was bought this or that year (don't ask). He 
> does not want to find back the real story, but just write realistic 
> accounts, for which the sum of a subset of product prices will give the 
> exact yearly amount.
> 
> Here is a real example from his data:
> 
> # A list of 64 product prices
> products = 
> c(30500,30500,30500,30500,42000,42000,42000,42000,42000,42000,42000,42000,42000,42000,71040,90900,76950,35100,71190,
> 53730,456000,70740,70740,533600,83800,59500,27465,28000,28000,28000,28000,28000,26140,49600,77000,123289,27000,27000,27000,
> 27000,27000,27000,80000,33000,33000,55000,77382,48048,51186,40000,35000,21716,63051,15025,15025,15025,15025,800000,1110000,
> 59700,25908,829350,1198000,1031655)
> 
> # Global amount
> amount = 4748652
> 
> Now he wants to find all subsets of the 'product' vector which sums to 
> 'amount'.
> 
> I wrote the following code, which is clearly not optimal:
> 
> # Create a matrix of subsets of size r among the integer set 1:n
> subsets <- function(n, r, v = 1:n) {
>   if(r <= 0) vector(mode(v), 0)
>   else if(r >= n) v[1:n]
>   else rbind(cbind(v[1], Recall(n-1, r-1, v[-1])),Recall(n-1, r, v[-1]))
> }
> 
> # Main function
> find.amount = function(amount,products) {
>  
>   if(sum(products)<amount) {
>     cat("There is no solution.\n")
>     return()
>   }
>  
>   l = length(products)
>   cat("\nThere are",l,"product prices\n\n")
>   names(products) = paste("Product",1:l,sep="")
>   products = sort(products)
> 
>   for(i in 2:l) {
> 
>     # If the sum of the i smallest prices is greater than amount, then stop
>     if(sum(products[1:i])>amount) break
> 
>     # Look for matching subsets only in the case when the sum of i 
> largest prices is greater than amount
>     if(sum(rev(products)[1:i])>=amount) {
>       # Generates all subsets of i indicies in 1:l
>       subs = subsets(l,i)
>       nl = nrow(subs)
>       nc = ncol(subs)
> 
>       # Compute sums of corresponding price subsets
>       sums = rowSums(matrix(products[subs],nl,nc))
> 
>       # Which ones match the global amount ?
>       w = which(sums == amount)
>       how.many = length(w)
>       if(how.many>0) {
>         cat("\n-->> There are",how.many,"solutions with",nc,"products :\n")
>         for(j in 1:how.many) {
>           print(products[subs[w[j],]])
>         }
>       }
>       else cat("\n-->> There is no solution with",nc,"products.\n")
>     }
>     else cat("\n-->> There is no solution with",i,"products.\n")
>   }
> }
> 
> 
> Then I can use these functions on a smaller example:
> 
>  > find.amount(4,c(1,1,1,1,2,2))
> 
> and a number of matching subsets are provided. The problem is: This 
> approach creates a whole matrix of subsets of r integers among 1:n, 
> which rapidly gives huge matrices, and this is clearly not optimal for 
> the real data provided above.
> 
> Would anyone have a suggestion as to an alternative and more efficient 
> strategy?
> 
> Good luck,
> 
> Yvonnick Noel
> University of Brittany, Rennes 2
> France
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list