SAP HANA Academy – R: 10. Clustering via SQL Script [2.0 SPS 02]


Hi I’m Philip from the HANA Academy. In this series of videos were looking at R integration for SAP HANA. In this video we’re going to see how we can embed R integration for HANA into a HANA stored procedure and actually access that and use some R script. So previously in the getting started video for SPS 02 we saw the YouTube channel and the GitHub repository. So what we need to now do is go to R integration repository on GitHub and R 10 Clustering and that’s got the code that were actually going to be using. So I’m gonna go and copy that and then I’m going to go to the database Explorer which I happen to be using from WEB IDE for HANA. Where I’m actually logged in as the R user and we created that in the getting started video previously. So please reference that if you’re not familiar. I’m going to open up a SQL window for the R user and I’m going to paste in this script. So let’s scroll to the top and have a look at what we’re actually going to do. Now the way we work with R integration for HANA is we need to work with tables and we need to create R script language. So if we want to pass parameters or data to R we need to make sure they’re in HANA tables. If we want to get results back or anything back from R into HANA again we need to specify tables. So when we’re working with a stored procedure we typically are going to use table types to define the structure of those tables. So the first thing we’re doing on line 10 is that we’re creating some table types. One for the input data. I’ve got some customer data in line below that we’re going to use for the clustering in this example. So we’ve got a table type for the data with an ID and then the four items I want to be clustered and we can see we’ve got the data types integer double whatever. Then we’ve got a parameter table and you don’t have to have a parameter table but if you want to send options or parameters then you can’t just put them in a single parameters on the procedure we’re going to create. You need to do it in a table based scenario. So I’m creating a table with name value pairs where I can slot in in this case integer parameters like the number of clusters to create in my clustering. Finally I’ve got a results table type which has just got the ID which relates to the input table and a cluster which will be an integer number of the cluster that that IDs been assigned to. It’s a very straightforward. Now the interesting bit is where we actually create R procedure. So you use the normal create procedure. How you would normally typically create stored procedures in SQL script in HANA without input and output table types and we can get the names data parameters and results and the crucial part here is we specify the language. Now the default language will be SQL script but in this case we’re going to use R script. So we need to specify R Lang. So that’s the language going to use it’s going to be R. So that everything that’s in between the begin and end is R script. This is what will actually get run executed on the R server remotely. So HANA will wrap this up and send it over and actually have it executed. So it can be absolutely anything you want to do in R. There are thousands of algorithms you can write your own. You can use loads and loads of libraries in this case for a simple example I’m just doing some simpler clustering k-means cluster basically. So for that I’m going to use the cluster library. Of course if you’re gonna reference the library make sure you’ve installed that in R previously but that’s something that you manage on the R side and then I’m going to do a simple call to k-means within the cluster library. I’m gonna give it those four input columns that I’m getting out of data and notice that it’s basically this is the syntax I would use. We’re working with vectors in R so actually they’re not in tables anymore as part of the process HANA transformed the data from HANA tables into data vectors used in R. Then we’ve got the same for the parameters where I want to pick out the cluster’s row because I’m going to create a row with a value for the number of clusters that I want to have with its value. So I’ll say how many clusters to create and I’m calling that R script and finally we’ve got we want to get the results out so I’m going to basically bind the ID which is coming from the input data with the clusters’ actual data that’s from the call to the clusters I just made previously and drop that into a results data vector and then that results data vector will be transformed into the results table by the HANA R integration actually when it’s bringing stuff back to HANA. So let’s go ahead and run that code we’re going to create the table types and we’re going to create the stored procedure. So that’s now run the next thing would actually to be look at that what you were gonna do at runtime. So if we’re gonna work with tables we need to create a data table a parameter table and a results table. Let’s go ahead and do that. Of course you these tables may exist in your in your environment. Now for this simple example I’m just going to insert some rows in line into my table about 50 rows that we can then cluster. So that’s the data table. I’m also going to set up my parameter table where I’ve got a row with a value of clusters. So the the name is clusters and the value is three. So I want to create three clusters please. So we can see how we can pass parameters into our R script and that’s all the setup we need to do. The next thing is simply just to call it and as long as the user has got the create R script authorization as we saw in the previous video and as long as your R servers up and running and you specified how to connect it through a remote source as we saw in the previous video and that remote source has been authorized for the user. In our case R user. We saw that in the previous video. Then we should be good to go and actually run this. So it runs and if we have a look down here we can see we’ve got 50 rows back. There abouts and we’ve got the cluster that it’s been assigned a value between 1 and 3. So we’ve been running the cluster process actually in R. Finally if we want to run this rather than just getting results in line we want to put the results into R table then in this case we would actually call it with the function and the three table names and with overview and that will actually then mean that when it runs it will actually put the results into a table R results table. So if we go and actually look at R users’ tables. Let’s go to R user look at the tables. We’ve got a table called results and if we open data we can see we’ve got our ID and cluster. So we’ve actually got R clustering done. So that’s really it. It’s very very straightforward. The main thing that is that you need to create a stored procedure and in that stored procedure you’re going to use the language R Lang and then between the beginning begin and end you’re going to specify your R script and of course you need table types and tables in HANA to deal with any input and output that you want to do. So just be aware that the bandwidth can be quite important between HANA and your R server so make sure you’re not sending unnecessary data across particularly vchar long or complex strings unless you’re actually using them. Try to only send the minimum data from HANA to R and likewise only sent back what you’re actually going to need inside of HANA later. So that’s how we can work when we’re using a SQL script and we’re actually using a stored procedure that allows us to call out to R from HANA. You can find more HANA video tutorials on our YouTube channel. If you’d like to be informed as soon as new video tutorials are published please subscribe to the channel. You can also connect with us on LinkedIn, follow us on Twitter and we have pages on Facebook and Google+. Code snippets and demo data for all of our tutorials are published to GitHub. Thanks for watching.

Leave a Reply

Your email address will not be published. Required fields are marked *