Amazon Elastic MapReduce with Talend


Submit EMR Job from Talend

Prerequisites

  • Amazon account with AccessKey and SecretKey.
  • AWS SDK for Java.
  • Talend Studio for BigData.
  • Pre knowledge on how to create job in Talend.

Steps

  1. Create new job in Talend.
  2. Use component ‘tLibraryLoad’ and ‘tJava’, and connect them as shown below.Image
  3. tLibrary settings: add aws sdk jars in tLibrary using advance settings.Image
  4. Use tJava advance settings to import all required dependencies.Image
  5. Code to submit MR job to EMR in tJava.Image
    System.out.println(“Starting T Java component for EMR job.”);
    //Set creadential (accessKey and secretKey)
    //String accessKey = “”;
    //String secretKey = “”;
    AWSCredentials credentials = new BasicAWSCredentials(accessKey, secretKey);
    AmazonElasticMapReduceClient emrClient = new AmazonElasticMapReduceClient(credentials);
    StepFactory stepFactory = new StepFactory();//Hadoop jar step
    HadoopJarStepConfig jarStepConfig = new HadoopJarStepConfig();
    jarStepConfig.setJar(“s3://mddwordcount1/MR_JAR/hadoop-0.20.2-examples.jar”);
    jarStepConfig.setMainClass(“wordcount”);
    ArrayList<String> args = new ArrayList<String>();
    args.add(“s3://mddwordcount1/input/catalina.2012-06-12.log”);
    args.add(“s3://mddwordcount1/output/wordcount”);
    jarStepConfig.setArgs(args);

    // Debug step config that will help us bad times.
    StepConfig enableDebugging = new StepConfig();
    enableDebugging.withName(“Enable Debugging”);
    enableDebugging.withActionOnFailure(“TERMINATE_JOB_FLOW”);
    enableDebugging.withHadoopJarStep(stepFactory.newEnableDebuggingStep());

    // hadoop step config
    StepConfig hadoopJarConf = new StepConfig();
    hadoopJarConf.withName(“Jar Test”);
    hadoopJarConf.withActionOnFailure(“TERMINATE_JOB_FLOW”);
    hadoopJarConf.withHadoopJarStep(jarStepConfig);

    // instance config
    JobFlowInstancesConfig instancesConfig = new JobFlowInstancesConfig();
    instancesConfig.setMasterInstanceType(“m1.small”);
    instancesConfig.setSlaveInstanceType(“m1.small”);
    instancesConfig.setHadoopVersion(“0.20.205”);
    instancesConfig.setInstanceCount(2);
    instancesConfig.setPlacement(new PlacementType(“us-east-1c”));
    instancesConfig.withKeepJobFlowAliveWhenNoSteps(false);
    instancesConfig.setTerminationProtected(false);

    // Job request creation.
    RunJobFlowRequest request = new RunJobFlowRequest();
    request.withName(“CustomJarStepConfigtest”);
    request.withSteps(enableDebugging, hadoopJarConf);
    request.withLogUri(“s3://mddwordcount1/log”);
    request.withInstances(instancesConfig);
    request.withAmiVersion(“latest”);

    // finally submitting job.
    RunJobFlowResult result = emrClient.runJobFlow(request);

  6. Run your job and see the output at Amazon EMR job console.Image
  7. In case of failure you can select your job and click ‘Debug’ to see error logs.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s