Appirio's Tech Blog

Tuesday, December 22, 2009

Performing Lookups and Transformations with Talend

Ward Loving

In this article, I’d like to explore the workhorse of the Talend Open Studio – the tMap component. As you’d expect, this component allows you to visually map fields from an input source to fields on an output stream. It also gives you the ability to do lookups to other input files or databases while performing transformations on your data with basic java expressions. We’re going to manipulate some text files to give you a feel for this component’s capabilities, but before we dive in let’s discuss a scenario or two where you might need to use the tMap component:

Use Case A: You’re on a space walk doing some routine maintenance and you look up and see that your toolbox is slowly drifting out of reach into geosynchronous orbit.

Use Case B: You have some data that you need to move from one system to another. Three individual address fields in the source data need to be consolidated into a single text area field in the destination system. You also have a requirement to ensure that only valid email addresses are to be included the output files, but you don’t want to lose a whole record if an email address is invalid.

Sadly, data migration skills are probably not going to help you much with Use Case A. You can skip this article. But, with Use Case B, you’re in business. We’ll start where we left off in the TEST TALEND project that I created in the previous article in June. We won’t be reusing the code from the previous example but you might want to walk through the steps of setting up the repository and creating a new project if you haven’t done that before. You can find and download the latest Talend Open Studio by going to http://www.talend.com.

Open up the Talend Open Studio and select a project. Right-click on the Job Designs folder in the Repository tab and select Create Job. Go ahead and name the job – I called mine Second Job.

Fig 1 – Create Job



I’m am working quite a bit with data migrations to Salesforce.com these days and although you can pull data directly from Salesforce using Talend, I often find myself using Talend to work with files rather than using it to pull data out of Salesforce.com directly. Talend doesn’t allow us (yet), to adjust the number records exported in each batch so it is much quicker to use the Salesforce.com Data Loader to pull records out of Salesforce.com. The difference in performance between the two tools is not a big deal for a few thousand rows – but when you’re talking hundreds of thousands or millions of rows of data, every optimization is helpful. For this example let’s use Talend’s tFileInputDelimited Component which can be used to process the ‘lingua franca’ of Saleforce.com Data Migrations – the csv file. You can find this under File->Input in the components list or you just can just type tFileInput into the Palette Search field.

Fig 2 – Search Palette



Drag the tFileInputComponent onto the workspace.

Fig 3 - Workspace



We’re going to point this tFileInputComponent to a file. You can simply cut and paste the following data into a file called “sample_input_data.csv”:

FirstName, LastName, Email, Address1, Address2, Address3, GenderCode
John, Smith, jsmith@yahoo.com, 666 5th Ave, 19th Floor, Suite 15, M
Jane, Doe, jdoe@gmail.com, 444 8th Ave, Snyder Plaza, Building 7, F
Ralph, Emerson, invalid@email!com, 108 Street,, Apt 3B, M
Edgar, Poe, epoe@scarystory.com, 1000 Boston Street,,, M


For this example, I’ve put the file in my C:\data directory. Now if you double-click on the tFileInputComponent in the workspace it will bring up the Basic Settings of the Component tab at the bottom of the workspace.

Fig 4 – Component Tab – Basic Settings



I’ve highlighted the following settings in Fig 4 above which need to be updated:
  • Select the File Name for the sample_input_data file.
  • Select a comma, rather than a semi-colon as the field separator
  • Select the csv option checkbox
  • Put “1” in the Header field to designate that your input file has column headers.

We’ll also need to input the schema for our csv file, so click Edit Schema and enter the column headers for your file as shown in the figure below. Clicking the green plus sign adds a column entry and clicking the red cross removes one. There are also some handy controls here to import and export your schema entries. This can save you a lot of time when you are working with the same schema across multiple files. Click OK when you’ve completed the schema entries.

Fig 5 – Schema Dialog



Now switch to the Advanced Settings panel and click the “Trim All Columns” checkbox. As far as I can tell, there is almost no reason why you wouldn’t want to trim the fields of your csv files. Making it a habit to check this will save you a lot of headaches performing matching against your lookup files.

Fig 6 – Component Tab – Advanced Settings



You can also change the name of your component in the workspace on the View panel. I’ve changed the label for my tFileInputComponent to “Sample Input File” by updating the Label Format field.

Fig 7 – Component Tab - View



Next we’ll create a simple lookup file to process our Gender Codes. Cut and paste the following data into another file called “gender_lookup.csv”:

Code,Value
M,Male
F,Female


We’ll need to repeat the steps shown in Figures 3 – 7 for the second tFileInputComponent which is pointing to the gender lookup file:
  • Drag the tFileInputComponent onto the workspace
  • Update the Basic Settings – File Name, Separator, CSV Option, and Header
  • Create the schema for the Lookup file. It will have two columns: Code and Value.
  • Check the “Trim all columns” checkbox on the Advanced Tab
  • Update the component to a more user-friendly name (Optional). I’ve named mine “Gender Lookup”

Now in addition to input, our migration process is also going to have output. Let’s stick with the delimited file format for now and create another csv file for the output document using the tFileOutputDelimited component. It can be found under File->Output section of the Component Palette. Go ahead and drag the tFileOutputDelimited component out into the workspace.

Fig 8 – Component Tab – Basic Settings – tFileOutput Component



The steps here to update a tFileOutputComponent aren’t that different from the tFileInputComponent:
  • Select the File Name for the sample_output_data file.
  • Select a comma, rather than a semi-colon as the field separator
  • Select the Include Header checkbox to designate that you’d like your output file to have column headers.

We’ll also need to update the schema for the output component. Remember we have a requirement to consolidate the address fields from three separate fields down to one, so our schema is going to change a little from the input file version.

Fig 9 – tFileOutput Schema Dialog



Click ok when you’ve finished entering your schema. Now we’ll go to the advanced settings panel and click the “CSV Options” checkbox to ensure that any wandering commas are properly escaped.

Fig 10 - Component Tab - Advanced Settings – tFileOutputComponent



Switch to the View panel to update the name of the tFileOutputComponent. I’ve updated mine to “Sample Output File”. Your workspace should look something like this.

Fig 11 - Workspace with Input and Output Components



All our efforts have been in preparation for this moment -- so don’t blow this big guy. If you can take the pressure, and I know that you can, select the tMap component from the component palette under the Processing header and drag it onto the workspace.

Fig 12 – tMap



The crowd roars. That wasn’t so bad now, was it?

The next step is to create the flow for our job. Right click the Sample Input File component and select Row->Main as shown below.

Fig 13 - Create Flow



Connect the little plug to your tMap component. You also need to do the same thing with your Gender Lookup tFileInputComponent. When you right click and select Row->Main from the tMap to the Sample Data Output component the following dialog will pop up. I’ve named the flow “toOutputFile” as shown below. It is possible to create multiple outputs for each job but we just need one for this example.

Fig 14 – Output Name Dialog



A second dialog will come up asking if you want to get the schema of the target component. You do. When you’ve done all this your job should look something like this:

Fig 15 - Workspace with Connected Components



Double-click on the tMap component to bring up the Map Editor. Your input fields are on the left side of the screen and your output fields are on the right. To map one field to another simply drag the field from the left to the right side of the screen and drop it into the Expression column.

Fig 16 – Mapping Fields



After matching row1.FirstName to toOutputFile.FirstName you’ll see a yellow arrow marking the association. Match up the LastName field as well.

You can concatenate all three address fields together by simply dragging each of them to the single address field on the output side of the map editor.

Fig 17 - Address Mapping



Unfortunately, this turns out to be a little crude. If one of the address fields is blank, then you end up with unsightly gaps in your address. In Salesforce.com most street address fields are text areas, so it’s nice to preserve the address lines in the text area which are captured in the source data. Clicking the button labeled ‘…’ next to the toOutputFile Address field in the tMap Editor to brings up the Expression Editor. The button is marked with a red arrow in Figure 17.

Fig 18 - Expression Editor



Cut and paste the following code into the expression editor for more adult address processing:

(row1.Address1 != null && row1.Address1.length() > 0 ? row1.Address1 : "") +
(row1.Address2 != null && row1.Address2.length() > 0 ? "\n" + row1.Address2 : "" ) + (row1.Address3 != null && row1.Address3.length() > 0 ? "\n" + row1.Address3 : "")


This code simply checks to see if a value is present before adding a linefeed to the address line entry. Click Ok to update the expression for your Address field and return to the Map Editor.

To filter out invalid email address we can add a regular expression to the Expression Editor. Click the ‘…’ button next to the Email field and enter this expression:

(row1.Email != null &&
row1.Email.matches("^[\\w\\.-]+@([\\w\\-]+\\.)+[A-Za-z]{2,4}$") ? row1.Email.toLowerCase() : "")


This regular expression isn’t a perfect email validator, but it keeps out most of the riff-raff. If an email doesn’t match the regular expression, it probably isn’t a valid address and we’re just replacing the invalid text with an empty string. An Email type field in Salesforce.com will actually bounce records which have an invalid email format when you load them to the system, which may not be the desired behavior. We’ve only scratched the surface of the capabilities of the TOS Expression Editor. It has all kinds of string, numerical, and date processing functions available to you. The most frequent idiom that I use is a simple java ternary operator. In fact, we could use this idiom to process our gender codes if we wanted to:

(row1.GenderCode != null && row1.GenderCode.equals("M") ? "Male" :
     (row1.GenderCode != null && row1.GenderCode.equals("F") : ""))


This code simply resolves the GenderCode to “Male” or “Female” if the code equals “M” or “F” and returns an empty string if it doesn’t match either one. But let’s resolve our gender codes in more stylish fashion using a lookup. The simplicity of this example belies the power of this capability. It is possible to resolve foreign key relationships in this way between files with hundreds of thousands or even millions of records. Talend’s lookups are more powerful than VLookups in Excel as a way to create foreign key relationships for data migrations and are much more repeatable. There is no limit to the number of lookups you can include. My personal record is eleven in a single job. Although Talend has the capability of writing temp files to disk for large lookups (that is the button that looks like a disk drive on the row2 header in the Map Editor), I have found that you can run into memory issues if your data input has several million rows and your lookup file has several million rows as well. There’s only so much heap to go around. The way I’ve worked around this limitation is to divide the input file into parts using a separate Talend job, and then iterate through the parts using a tFileList component. You can check the Talend component documentation to get more information about tFileList configuration.

To create a lookup relationship between the Gender Code and the Gender Lookup file, drag the
row1. Gender Code field down to the Expr. Key column of the row2.Code field. You should see a purple arrow marking the association, or is it mauve? You know, these are things that keep me up at night.

Fig 19 – Lookup Mapping



Next, drag the row2.Value field to the toOutputFile.Gender field on the right side of the editor window. That’s all there is to it. Your complete mapping should look like the following.

Fig 20 – Completed Mapping



If you click the Inner Join checkbox on the header of the row2 editor, shown in Fig 19 above, only row1 records that have matching gender code will be written to your output file. Click Ok to close the Map Editor and let’s run this bad boy. Select the Run tab at the bottom of the screen and press the Run button.

Fig 21 – Run the Second Job



Check your output file to see the results. It should look something like this:

FirstName,LastName,Email,Address,Gender
"John","Smith","jsmith@yahoo.com","666 5th Ave
19th Floor
Suite 15","Male"
"Jane","Doe","jdoe@gmail.com","444 8th Ave
Snyder Plaza
Building 7","Female"
"Ralph","Emerson","","108 Street
Apt 3B","Male"
"Edgar","Poe","epoe@scarystory.com","1000 Boston Street","Male"


We’ll you haven’t retrieved a $100K grease gun from the abyss, but you have added a couple of tricks to you data migration toolkit. I encourage you to download the Open Studio and try it out.

Wednesday, December 9, 2009

Experimenting with Windows Azure

I've been experimenting with Windows Azure the past few days. It's great to see C# again :) and Visual Studio is such a great IDE. I went through the tutorials and built the Guestbook application from the documentation. There are two basic roles in the Azure Cloud Service. First, there's a web role project that is basically an ASP.NET (C#) application defined to run on the Azure Cloud. Secondly, there's a worker role project in the solution that handles the backend processing for the application. In this case, the worker role is resizing images to thumbnails. The web role and the worker role pass tasks between each other using a queue (Microsoft.WindowsAzure.CloudQueue.Client). This was an easy way to separate the front end code from the back end code. The web role can continue to handle user requests without waiting for the image to be processed.

Web Role: ASP.NET application accessible via HTTP / HTTPS endpoint and is commonly the front-end for the application.
Worker Roles: Background processes for the application.

You can have one or multiple web and worker roles within the same Azure service.

Basically, the example works like this...

  • User posts message and path to image
  • Web role grabs image and stores it in blob storage
  • Web role puts job on queue to resize image to thumbnail (image is rendered in the form in normal size)
  • Worker role checks queue for new messages until it finds the instruction to resize the thumbnail
  • Worker role resizes image
  • Web role refreshes page (image is resized via partial refresh)

This was a great starting point for learning Azure. It made use of the blob storage, table storage, and queues. I'm going to update the example and mirror the Guestbook post to insert data into Salesforce.com. I'll post the source code and the URLs when that's complete.


Google Groups added to Google Apps

Google Apps Premier and Education users will notice the addition of Google Groups to their available services this morning. Groups is hugely popular among communities, user groups, and even closed internal discussion forums but remained outside of the Google Apps portfolio. To overcome this, companies like Appirio, that use Google Groups for internal discussions have their users create identical Google accounts using their Google Apps username. We can now roll Groups right into our Google Apps domain. The following Google Groups features are now included in Google Apps Premier and Education Editions:

  • Fast set-up. Employees and students can now create collaborative groups instantly without burdening IT, and manage the group settings to fit their needs.
  • Searchable archives. Group discussions are archived by default, allowing users to easily search and view past and present discussions via the web.
  • Sharing with a group. Once a group is set up, employees and students can easily share a document, spreadsheet, presentation, shared folder, site, calendar, or video with that group. No need to type in individual email address manually, or remember who joined or left the group. Plus, the shared items will only be accessible by the appropriate people, even as people join and leave the group.
  • Reply on behalf of a group. In addition to communicating via email or the web interface, the new functionality lets group managers send a message on behalf of a group.
  • IT capabilities. IT administrators still manage if and how users can create groups from the administrative control panel.

Here's some quick instructions on how to extend your domain for Groups. Note, this will only work for Premier and Education users. First, login to your Apps dashboard. http://www.google.com/a/cpanel/domainname

Next to Service Setting, click the Add more services... link.

That's it! From your dashboard you can drill back into the console for the Groups service and set your preferences from there. You have the option to make your Groups public or internal, grant access to those who can create Groups, and restrict emails from outside your domain. All the same options you have with Groups today. Have fun... while you're adding services check out Google Apps Short Links(labs) and Google Moderator (labs, but a great example of a Google App Engine application)... if you don't have a Premier of Education version, but want to see more detail on the new offering, check out the short video below.


Tuesday, December 1, 2009

Single Sign-On with SAML on Force.com - Jeff Douglas on the Force.com Blog

Jeff Douglas has posted a great article on Single Sign-On with SAML on Force.com:
"With the proliferation of SaaS and other web-based applications, identity management is becoming a major concern for businesses. Just think about the number of usernames and password you regularly type each day. You probably log into your company's network, portal, webmail, benefits system, Google Apps, bespoke applications and of course Force.com applications. Now multiply this by the number of users in your company and think about the support and security implications. You need dedicated resources to manage your identity store, respond to password reset requests, provision new users for each system and deactivate users that no longer need access. Just think of the number of man hours you could save if you could eliminate 25-50% of your passwords and their associated costs!

Implementing a Single Sign-On (SSO) infrastructure enables users to sign in once and have access to all authorized resources. In this article, we'll look at the different methods of implementing SSO with Force.com, how to set up your own open source identity management system for federated authentication using SAML 2, and how to configure the Force.com platform to utilize your new identify provider. We'll also provide some troubleshooting techniques and outline some best practices to help you avoid common roadblocks, getting you up and running fast..."
Read more at the Force.com Blog...
 
2006-2012 Appirio Inc. All rights reserved.
Appirio.com | Support | Resource Center | Contact | Careers | Privacy Policy