Amazon Elastic MapReduce with Talend

Submit EMR Job from Talend

Prerequisites

  • Amazon account with AccessKey and SecretKey.
  • AWS SDK for Java.
  • Talend Studio for BigData.
  • Pre knowledge on how to create job in Talend.

Steps

  1. Create new job in Talend.
  2. Use component ‘tLibraryLoad’ and ‘tJava’, and connect them as shown below.Image
  3. tLibrary settings: add aws sdk jars in tLibrary using advance settings.Image
  4. Use tJava advance settings to import all required dependencies.Image
  5. Code to submit MR job to EMR in tJava.Image
    System.out.println(“Starting T Java component for EMR job.”);
    //Set creadential (accessKey and secretKey)
    //String accessKey = “”;
    //String secretKey = “”;
    AWSCredentials credentials = new BasicAWSCredentials(accessKey, secretKey);
    AmazonElasticMapReduceClient emrClient = new AmazonElasticMapReduceClient(credentials);
    StepFactory stepFactory = new StepFactory();//Hadoop jar step
    HadoopJarStepConfig jarStepConfig = new HadoopJarStepConfig();
    jarStepConfig.setJar(“s3://mddwordcount1/MR_JAR/hadoop-0.20.2-examples.jar”);
    jarStepConfig.setMainClass(“wordcount”);
    ArrayList<String> args = new ArrayList<String>();
    args.add(“s3://mddwordcount1/input/catalina.2012-06-12.log”);
    args.add(“s3://mddwordcount1/output/wordcount”);
    jarStepConfig.setArgs(args);

    // Debug step config that will help us bad times.
    StepConfig enableDebugging = new StepConfig();
    enableDebugging.withName(“Enable Debugging”);
    enableDebugging.withActionOnFailure(“TERMINATE_JOB_FLOW”);
    enableDebugging.withHadoopJarStep(stepFactory.newEnableDebuggingStep());

    // hadoop step config
    StepConfig hadoopJarConf = new StepConfig();
    hadoopJarConf.withName(“Jar Test”);
    hadoopJarConf.withActionOnFailure(“TERMINATE_JOB_FLOW”);
    hadoopJarConf.withHadoopJarStep(jarStepConfig);

    // instance config
    JobFlowInstancesConfig instancesConfig = new JobFlowInstancesConfig();
    instancesConfig.setMasterInstanceType(“m1.small”);
    instancesConfig.setSlaveInstanceType(“m1.small”);
    instancesConfig.setHadoopVersion(“0.20.205”);
    instancesConfig.setInstanceCount(2);
    instancesConfig.setPlacement(new PlacementType(“us-east-1c”));
    instancesConfig.withKeepJobFlowAliveWhenNoSteps(false);
    instancesConfig.setTerminationProtected(false);

    // Job request creation.
    RunJobFlowRequest request = new RunJobFlowRequest();
    request.withName(“CustomJarStepConfigtest”);
    request.withSteps(enableDebugging, hadoopJarConf);
    request.withLogUri(“s3://mddwordcount1/log”);
    request.withInstances(instancesConfig);
    request.withAmiVersion(“latest”);

    // finally submitting job.
    RunJobFlowResult result = emrClient.runJobFlow(request);

  6. Run your job and see the output at Amazon EMR job console.Image
  7. In case of failure you can select your job and click ‘Debug’ to see error logs.