[R] parLapply on sqlQuery (from package RODBC)
Frede Aakmann Tøgersen
frtog at vestas.com
Tue Jul 8 07:14:48 CEST 2014
Hi Dylan
Please remember to post to the list. You'll have a greater chance to get answers to your questions by doing so.
Also my name is Frede and not Med. The greetings "Med venlig hilsen" is the Danish translation of "Yours sincerely". Vestas is a global company so even the corporate language is English our signature in our emails is honoring the regional greetings from the country that we work in.
Yours sincerely / Med venlig hilsen
Frede Aakmann Tøgersen
Specialist, M.Sc., Ph.D.
Plant Performance & Modeling
Technology & Service Solutions
T +45 9730 5135
M +45 2547 6050
frtog at vestas.com
http://www.vestas.com
Company reg. name: Vestas Wind Systems A/S
This e-mail is subject to our e-mail disclaimer statement.
Please refer to www.vestas.com/legal/notice
If you have received this e-mail in error please contact the sender.
> -----Original Message-----
> From: Tevlin, Dylan [mailto:Dylan.Tevlin at kochind.com]
> Sent: 7. juli 2014 16:13
> To: Frede Aakmann Tøgersen
> Subject: RE: parLapply on sqlQuery (from package RODBC)
>
> Hi Med, thanks for the reply.
>
> Maybe I should clarify, if I have a list of names as follows
>
> names<-c("A", "B", "A", "C","C")
>
> and a table in the database as follows
>
> clientid clientname
> 1 A
> 2 B
> 3 C
> 4 D
> 5 E
>
> I need to convert the names list to the following ids list
>
> ids<-c(1,2,1,3,3)
>
> There are two ways I can see doing this.
> One would be to read my table into R (or more efficiently read in only the
> needed entries by your method), and then make an apply routine over the
> list of names to get the ids, which is straightforward.
>
> Another would be to skip the intermediary step of reading the table in and
> apply over sql queries on names to get ids. A further improvement on this
> would be to parallelize this apply routine.
>
> I think the efficiency comparison between these two methods would be
> dependent on the size of the table, network connection, distance between
> server and client, so it is probably the case that your method combined with
> a bit more functionality would be the quickest way to accomplish my task.
>
> Just from a learning standpoint, though, I wanted to understand how to pass
> a database connection into parLapply, and more generally how to pass a
> second parameter into parLapply. It turns out I needed to use clusterEvalQ
> to set the environment of each of my "thread" nodes, since the global
> environment isn't preserved when a new node is created.
>
> -----Original Message-----
> From: Frede Aakmann Tøgersen [mailto:frtog at vestas.com]
> Sent: Thursday, July 03, 2014 12:51 AM
> To: Tevlin, Dylan; r-help at r-project.org
> Subject: RE: parLapply on sqlQuery (from package RODBC)
>
> Hi
>
> Why are you doing duplicate queries to the database (two As and Cs in your
> names vector)?
>
> Why do 5 simultaneously connection to the database server? Woukld you do
> 500 connections?
>
> Why not do one query and let the database server do the job for you?
>
> Try this:
>
> > options(useFancyQuotes = FALSE)
> >
> > query0 <- "select id from table where name in (%s)"
> >
> > names <- paste(sQuote(LETTERS[1:5]), collapse = ",") names
> [1] "'A','B','C','D','E'"
> >
> > query <- sprintf(query0, names)
> > query
> [1] "select id from table where name in ('A','B','C','D','E')"
> >
> > dataFromDB <- sqlQuery(dbConn, query)
>
> This should work for MS SQL and MySQL servers.
>
> Yours sincerely / Med venlig hilsen
>
>
> Frede Aakmann Tøgersen
> Specialist, M.Sc., Ph.D.
> Plant Performance & Modeling
>
> Technology & Service Solutions
> T +45 9730 5135
> M +45 2547 6050
> frtog at vestas.com
> http://www.vestas.com
>
> Company reg. name: Vestas Wind Systems A/S This e-mail is subject to our e-
> mail disclaimer statement.
> Please refer to www.vestas.com/legal/notice If you have received this e-mail
> in error please contact the sender.
>
>
> > -----Original Message-----
> > From: r-help-bounces at r-project.org
> > [mailto:r-help-bounces at r-project.org]
> > On Behalf Of Tevlin, Dylan
> > Sent: 2. juli 2014 23:49
> > To: r-help at r-project.org
> > Subject: [R] parLapply on sqlQuery (from package RODBC)
> >
> > R Version : 2.14.1 x64
> > Running on Windows 7
> >
> > Connecting to a database on a remote Microsoft SQL Server 2012
> >
> > The short form of my problem is the following.
> >
> > I have an unordered vectors of names, say:
> >
> > names<-c("A", "B", "A", "C","C")
> >
> > each of which have an id in a table in my db. I need to convert the
> > names to their corresponding ids.
> >
> > I currently have the following code to do it.
> > ###
> > names<-c("A", "B", "A", "C","C")
> > dbConn<-odbcDriverConnect(connection="connection string")
> > #successfully connects
> >
> > nameToID<-function(name, dbConn){
> > #dbConn : active db connection formed via odbcDriverConnect
> > #name : a char string
> >
> > sqlQuery(dbConn, paste("select id from table where
> > name='", name, "'", sep="")) } sapply(names, nameToID, dbConn=dbConn)
> > ###
> >
> > Barring better ways to do this, which could involve loading the table
> > into R then working with the problem there (which is possible), I
> > understand why the following doesn't work, but I cannot seem to find a
> > solution. Attempting to use parallelization via the package 'parallel' :
> >
> > ###
> > names<-c("A", "B", "A", "C","C")
> > dbConn<-odbcDriverConnect(connection="connection string")
> > #successfully connects
> >
> > nameToID<-function(name, dbConn){
> > #dbConn : active db connection formed via odbcDriverConnect
> > #name : a char string
> >
> > sqlQuery(dbConn, paste("select id from table where
> > name='", name, "'", sep="")) }
> >
> > mc<-detectCores()
> > cl<-makeCluster(mc)
> > clusterExport(cl, c("sqlQuery", "dbConn"))
> > parSapply(cl, names, nameToID, dbConn=dbConn) #incorrect passing of
> > nameToID's second argument
> > ###
> >
> > As in the comment, this is not the correct way to assign the second
> > argument to nameToID.
> >
> > I have also tried the following:
> >
> > parSapply(cl, names, function(x) nameToID(x, dbConn))
> >
> > in place of the previous parSapply call, but that also does not work,
> > with the error being thrown saying "the first parameter is not an open
> > RODBC connection", presumably referring to the first parameter of the
> > sqlQuery()
> >
> > The following code does work with parallization.
> >
> > ###
> > names<-c("A", "B", "A", "C","C")
> > dbConn<-odbcDriverConnect(connection="connection string")
> > #successfully connects nameToID<-function(name){
> > #name : a char string
> > dbConn<-odbcDriverConnect(connection="string")
> > result<-sqlQuery(dbConn, paste("select id from table
> > where name='", name, "'", sep=""))
> > odbcClose(dbConn)
> > result
> > }
> >
> > mc<-detectCores()
> > cl<-makeCluster(mc)
> > clusterExport(cl, c("sqlQuery", "odbcDriverConnect", "odbcClose",
> "dbConn",
> > "nameToID")) #throwing everything in
> > parSapply(cl, names, nameToID)
> > ###
> >
> > But the constant opening and closing a ton of the gains from
> > parallelization, and seems just a bit silly.
> >
> > So the overall question would be how to pass the second parameter (the
> > open db connection) to the function within parSapply, in much the same
> > way as it is done in the regular apply? In general, how does one pass
> > a second, third, nth parameter to a function within a parallel routine?
> >
> > Thanks and if you need any more information let me know.
> >
> > -DT
> >
> >
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-
> > guide.html and provide commented, minimal, self-contained,
> > reproducible code.
More information about the R-help
mailing list