Introducción a Spark-Submit y SDK

Un tutorial para ayudarle a utilizar código de Java SDK para ejecutar una aplicación Spark en Data Flow mediante spark-submit con la cadena execute.

Introducción a spark-submit en Data Flow con SDK. Siga el tutorial existente para Introducción a Oracle Cloud Infrastructure Data Flow, pero utilice Java SDK para ejecutar comandos de spark-submit.

Antes de empezar

Complete los requisitos antes de poder utilizar los comandos de spark-submit en Data Flow con Java SDK.

Configuración del arrendamiento.
Configuración de las claves.

Cree un proyecto independiente y agregue dependencias del SDK de Java de Oracle Cloud Infrastructure:

<dependency>
  <groupId>com.oracle.oci.sdk</groupId>
  <artifactId>oci-java-sdk-dataflow</artifactId>
  <version>${oci-java-sdk-version}</version>
</dependency>

1. ETL con Java

Utilice Spark-submit y Java SDK para realizar ETL con Java.

Mediante Spark-submit y Java SDK, complete el ejercicio ETL con Java desde el tutorial Introducción a Oracle Cloud Infrastructure Data Flow.

Configuración del arrendamiento.
Si no tiene un cubo en Object Storage en el que poder guardar la entrada y los resultados, debe crear un cubo con una estructura de carpetas adecuada. En este ejemplo, la estructura de la carpeta es /output/.

Ejecute este código:

public class ETLWithJavaExample {
 
  private static Logger logger = LoggerFactory.getLogger(ETLWithJavaExample.class);
  String compartmentId = "<compartment-id>"; // need to change comapartment id
 
  public static void main(String[] ars){
    System.out.println("ETL with JAVA Tutorial");
    new ETLWithJavaExample().createRun();
  }
 
  public void createRun(){
 
    ConfigFileReader.ConfigFile configFile = null;
    // Authentication Using config from ~/.oci/config file
    try {
      configFile = ConfigFileReader.parseDefault();
    }catch (IOException ie){
      logger.error("Need to fix the config for Authentication ", ie);
      return;
    }
 
    try {
    AuthenticationDetailsProvider provider =
        new ConfigFileAuthenticationDetailsProvider(configFile);
 
    // Creating a Data Flow Client
    DataFlowClient client = new DataFlowClient(provider);
    client.setRegion(Region.US_PHOENIX_1);
 
    // creation of execute String
    String executeString = "--class convert.Convert "
        + "--files oci://oow_2019_dataflow_lab@idehhejtnbtc/oow_2019_dataflow_lab/usercontent/kaggle_berlin_airbnb_listings_summary.csv "
        + "oci://oow_2019_dataflow_lab@idehhejtnbtc/oow_2019_dataflow_lab/usercontent/oow-lab-2019-java-etl-1.0-SNAPSHOT.jar "
        + "kaggle_berlin_airbnb_listings_summary.csv oci://<bucket-name>@<namespace-name>/output/optimized_listings";
 
    // Create Run details and create run.
    CreateRunResponse response;
     
    CreateRunDetails runDetails = CreateRunDetails.builder()
        .compartmentId(compartmentId).displayName("Tutorial_1_ETL_with_JAVA").execute(executeString)
        .build();
 
    CreateRunRequest runRequest = CreateRunRequest.builder().createRunDetails(runDetails).build();
    CreateRunResponse response = client.createRun(runRequest);
 
    logger.info("Successful run creation for ETL_with_JAVA with OpcRequestID: "+response.getOpcRequestId()
        +" and Run ID: "+response.getRun().getId());
 
    }catch (Exception e){
      logger.error("Exception creating run for ETL_with_JAVA ", e);
    }
 
  }
}

Si ha ejecutado este tutorial antes, suprima el contenido del directorio de salida, oci://<bucket-name>@<namespace-name>/output/optimized_listings, para evitar fallos en el tutorial.

Nota

Para buscar el ID de compartimento, en el menú de navegación, seleccione Identidad y seleccione Compartimentos. Se muestran los compartimentos disponibles, incluido el OCID de cada uno.

2: Machine Learning con PySpark

Con Spark-submit y Java SDK, realice el aprendizaje automático con PySpark.

Con Spark-submit y Java SDK, complete 3. Aprendizaje automático con PySpark, del tutorial Introducción a Oracle Cloud Infrastructure Data Flow.

Complete el ejercicio 1. ETL con Java antes de intentar realizar este ejercicio. Los resultados se utilizan en este ejercicio.

Ejecute el siguiente comando:

public class PySParkMLExample {
 
  private static Logger logger = LoggerFactory.getLogger(PySParkMLExample.class);
  String compartmentId = "<compartment-id>"; // need to change comapartment id
 
  public static void main(String[] ars){
    System.out.println("ML_PySpark Tutorial");
    new PySParkMLExample().createRun();
  }
 
  public void createRun(){
 
    ConfigFileReader.ConfigFile configFile = null;
    // Authentication Using config from ~/.oci/config file
    try {
      configFile = ConfigFileReader.parseDefault();
    }catch (IOException ie){
      logger.error("Need to fix the config for Authentication ", ie);
      return;
    }
 
    try {
    AuthenticationDetailsProvider provider =
        new ConfigFileAuthenticationDetailsProvider(configFile);
 
    DataFlowClient client = new DataFlowClient(provider);
    client.setRegion(Region.US_PHOENIX_1);
 
    String executeString = "oci://oow_2019_dataflow_lab@idehhejtnbtc/oow_2019_dataflow_lab/usercontent/oow_lab_2019_pyspark_ml.py oci://<bucket-name>@<namespace-name>/output/optimized_listings";
 
    CreateRunResponse response;
 
    CreateRunDetails runDetails = CreateRunDetails.builder()
        .compartmentId(compartmentId).displayName("Tutorial_3_ML_PySpark").execute(executeString)
        .build();
 
    CreateRunRequest runRequest = CreateRunRequest.builder().createRunDetails(runDetails).build();
    CreateRunResponse response = client.createRun(runRequest);
 
    logger.info("Successful run creation for ML_PySpark with OpcRequestID: "+response.getOpcRequestId()
        +" and Run ID: "+response.getRun().getId());
 
    }catch (Exception e){
      logger.error("Exception creating run for ML_PySpark ", e);
    }
 
 
  }
}

Siguiente paso

Utilice Spark-submit y la CLI en otras situaciones.

Puede utilizar spark-submit y Java SDK para crear y ejecutar aplicaciones Java, Python o SQL con Data Flow y explorar los resultados. Data Flow gestiona todos los detalles del despliegue, el desmontaje, la gestión de logs, la seguridad y el acceso a la interfaz de usuario. Con Data Flow, se centra en el desarrollo de aplicaciones Spark sin preocuparse por la infraestructura.

Documentación de Oracle Cloud Infrastructure

Introducción a Spark-Submit y SDK

Antes de empezar

1. ETL con Java

2: Machine Learning con PySpark

Siguiente paso