Note:

Migrate Apache Impala Scripts to Trino on Oracle Big Data Service

Introduction

Oracle Big Data Service (BDS) is a cloud-based service provided by Oracle that enables users to create and manage Hadoop clusters, Spark clusters, Trino and other big data services. In the world of big data analytics, Trino stands out as a high-performance, distributed SQL query engine designed for running interactive analytic queries on large datasets. Trino enables querying data across multiple sources, including Hadoop Distributed File System (HDFS), OCI bucket, and traditional relational databases, all within a unified SQL interface. Its ability to handle large-scale data with low latency makes it a powerful tool for data engineers and analysts alike.

With the shift to modern data platforms, many organizations are migrating from Apache Impala to Trino on Oracle Big Data Service. This tutorial provides a step-by-step approach to automate and manually adjust Apache Impala scripts for seamless migration to Trino.

Objectives

Prerequisites

Task 1: Automate Script Conversion

The automated script is designed to streamline the process of converting Apache Impala scripts to be compatible with Trino on Oracle Big Data Service. This script automates several key transformations, reducing the amount of manual effort required and minimizing the potential for human error during the migration process. Here is a breakdown of its functionality.

  1. File processing and initialization.

    • The script begins by reading an input .hql file (which is typically used for Apache Impala queries).

    • It creates a new output file where the converted Trino compatible query will be saved.

    • The script starts by adding a line to set the catalog and database context, using USE catalog_name.schema_name, ensuring the queries run in the correct context in Trino.

  2. Automated query transformations.

    • Overwrite Table Handling: The script looks for any OVERWRITE TABLE statements in the Apache Impala script and inserts corresponding DELETE FROM statements just above them. This ensures that the table is cleared before inserting new data, aligning with how Trino handles data overwrites.

    • Replacement of Impala-Specific Functions: The script applies several transformations to replace Apache Impala specific syntax and functions with Trino compatible ones. For example:

      • invalidate metadata and REFRESH commands are commented out, as they are not needed in Trino.

      • Data type changes such as string to varchar are made to match Trino’s data types.

      • Impala’s TRUNCATE TABLE statements are replaced with DELETE FROM since Trino handles truncation differently.

      • COMPUTE STATS is replaced with ANALYZE, reflecting how Trino collects statistics.

      • Other specific transformations include handling functions like nvl, ifnull, and from_utc_timestamp, adjusting them to Trino’s syntax and functions.

      • Converting nvl2 functions to equivalent COALESCE expressions.

      • Modifying regular expression matching (REGEXP) to Trino’s REGEXP_LIKE.

      • Adjusting date and timestamp functions to align with Trino’s syntax.

    Script:

    #!/bin/bash
    # Function to apply transformations to an HQL file
    process_file() {
       local input_file="$1"
       local output_file="${input_file%.hql}_modified.hql"
       local diff_file="${input_file%.hql}_diff.sql"
    
       # Create the output file with the initial modification
       {
          echo "use catalog_name.schema_name;"
          cat "$input_file"
       } > "$output_file"
    
       # Insert DELETE FROM statements above OVERWRITE TABLE statements
       awk '
       BEGIN { IGNORECASE = 1 }
       {
          if (match($0, /OVERWRITE[[:space:]]+TABLE[[:space:]]+([^[:space:]]+)/, arr)) {
                print "DELETE FROM " arr[1] ";"
          }
          else if (match($0, /OVERWRITE[[:space:]]+([^[:space:]]+)/, arr)) {
          print "DELETE FROM " arr[1] ";"
          }
          print
       }' "$output_file" > temp_file && mv temp_file "$output_file"
    
       # Define replacements as array of tuples (search, replace)
       declare -a replacements=(
          "old_line1 new_line1"
          "old_line2 new_line2"
          # Add more replacements as needed
       )
    
       # Apply replacements
       for replacement in "${replacements[@]}"; do
          search_text=$(echo "$replacement" | awk '{print $1}')
          replace_text=$(echo "$replacement" | awk '{print $2}')
          sed -i "s/${search_text}/${replace_text}/Ig" "$output_file"
       done
    
       # Handle specific replacements
       sed -i 's/invalidate metadata/--invalidate metadata/Ig' "$output_file"
       sed -i 's/REFRESH/--REFRESH/Ig' "$output_file"
       sed -i 's/ string/ varchar/Ig' "$output_file"
       sed -i 's/TRUNCATE[[:space:]]\+TABLE[[:space:]]*/DELETE FROM /Ig' "$output_file"
       sed -i 's/TRUNCATE /DELETE FROM /Ig' "$output_file"
       sed -i 's/OVERWRITE[[:space:]]\+TABLE[[:space:]]*/INTO /Ig' "$output_file"
       sed -i 's/COMPUTE[[:space:]]\+stats/ANALYZE/Ig' "$output_file"
       sed -i 's/INSERT[[:space:]]\+INTO[[:space:]]\+TABLE/INSERT INTO/Ig' "$output_file"
       sed -i 's/from_utc_timestamp(now(),/current_timestamp AT TIME ZONE /Ig' "$output_file"
       sed -i "s/'America\/New_York')/'America\/New_York'/Ig" "$output_file"
       sed -i 's|/\* +NOSHUFFLE \*/| |Ig' "$output_file"
       sed -i 's/into[[:space:]]\+table/into/Ig' "$output_file"
       sed -i 's/STRAIGHT_JOIN/ /Ig' "$output_file"
       sed -i 's/OVERWRITE[[:space:]]\+/INTO /Ig' "$output_file"
       sed -i 's/TABLE[[:space:]]\+INTO/INTO/Ig' "$output_file"
       sed -i 's/SET[[:space:]]\+DECIMAL_V2/--SET DECIMAL_V2/Ig' "$output_file"
       sed -i 's/nvl(/COALESCE(/Ig' "$output_file"
       sed -i 's/ifnull(/COALESCE(/Ig' "$output_file"
       sed -i 's/CREATE[[:space:]]\+EXTERNAL/CREATE/Ig' "$output_file"
       sed -i -E 's/TRUNC[[:space:]]*\(/TRUNCATE(/Ig' "$output_file"
       sed -i -E 's/CAST[[:space:]]*\(/TRY_CAST(/Ig' "$output_file"
       sed -i -E 's/instr[[:space:]]*\(/strpos(/Ig' "$output_file"
       sed -i 's/""[[:space:]]\+as/null as/Ig' "$output_file"
       sed -i -E 's/\$\{var\:([^}]*)\}/\$\{\1\}/Ig' "$output_file"
       sed -i -E 's/([a-zA-Z_][a-zA-Z0-9_]*)[[:space:]]+REGEXP[[:space:]]+('\''[^'\'']*'\'')/REGEXP_LIKE(\1, \2)/Ig' "$output_file"
       sed -i -E 's/lower\(([a-zA-Z_][a-zA-Z0-9_]*\))[[:space:]]+REGEXP[[:space:]]+('\''[^'\'']*'\'')/REGEXP_LIKE(lower(\1, \2)/Ig' "$output_file"
       sed -i "s/)\s*STORED\s*AS\s*PARQUET\s*LOCATION\s*'\([^']*\)';/) WITH (\nexternal_location = '\1',\nformat = 'PARQUET'\n);/Ig" "$output_file"    
       sed -i -E "s/try_cast\(extract\(([a-zA-Z0-9_.]+), 'MONTH'\) AS varchar\)/try_cast(extract(MONTH from try_cast(\1 AS timestamp)) as VARCHAR)/Ig" "$output_file"
       sed -i -E "s/TRY_CAST\(extract\(add_months\(([^,]+),([^,]+)\),[[:space:]]*'MONTH'\)[[:space:]]*AS[[:space:]]*varchar\)/TRY_CAST\(extract\(MONTH from date_add\('month',\2,try_cast\(\1 as timestamp\)\)\) AS varchar\)/Ig" "$output_file"
       sed -Ei "s/try_cast\(ADD_MONTHS\(([^,]+),\s*-?\s*([0-9]+)\) AS varchar\)/try_cast\(date_add\('month', -\2, try_cast(\1 as timestamp)\) AS varchar\)/Ig" "$output_file"
       sed -i -E 's/DATEDIFF\(/date_diff\('\''day'\'', /Ig' "$output_file"
       sed -i -E 's/TO_DATE\(/DATE\(/Ig' "$output_file"
       sed -i -E 's/now\(\)/current_timestamp/Ig' "$output_file"
    
       # Generate the diff file
       diff -u "$input_file" "$output_file" > "$diff_file"
    
       echo "Processed $input_file -> $output_file. Diff file created: $diff_file"
    }
    
    # Check arguments
    if [ "$#" -ne 1 ]; then
       echo "Usage: $0 <file_or_directory>"
       exit 1
    fi
    
    TARGET="$1"
    
    # Check if the target is a directory
    if [ -d "$TARGET" ]; then
       for file in "$TARGET"/*.hql; do
          [ -e "$file" ] || continue # Skip if no HQL files found
          process_file "$file"
       done
    elif [ -f "$TARGET" ]; then
       process_file "$TARGET"
    else
       echo "Invalid input: $TARGET is neither a file nor a directory."
       exit 1
    fi
    

    Note This script provides a foundational approach for converting Apache Impala scripts to Trino, but it does not cover all possible Impala statements and scenarios. If you encounter additional Impala specific syntax or functions not handled by this script, you can extend the script by adding new transformation rules to cover those cases. Regularly updating the script with these modifications will improve its effectiveness for future conversions

Task 2: Run the Conversion Script

  1. Save the script into Oracle Big Data Service edge node.

  2. Run the script with the target file or directory containing Apache Impala scripts.

  3. Review the generated output and difference files for changes.

Task 3: Make Manual Adjustments

  1. Use the LEFT ANTI JOIN statement.

    Certain Apache Impala SQL constructs, like LEFT ANTI JOIN, require manual intervention. Identify LEFT ANTI JOIN statements in your Impala script and replace them with equivalent Trino SQL using LEFT JOIN with a WHERE clause.

    Example Scripts:

    • Impala Script:

      SELECT t1.*
       FROM table1 t1
       LEFT ANTI JOIN table2 t2 ON t1.key = t2.key;
      
    • Trino Script:

      SELECT t1.*
       FROM table1 t1
       LEFT JOIN table2 t2 ON t1.key = t2.key
       WHERE t2.key IS NULL;
      
  2. Adjust GROUP BY clauses.

    Trino allows position based grouping in the GROUP BY clause, which can simplify long groupings. Convert column names to their positional equivalents in the GROUP BY clause where necessary. Ensure all columns are correctly mapped.

    Example Scripts:

    • Impala Script:

      GROUP BY sales_data.region,
          sales_data.product_name,
          sales_data.transaction_id,
          transaction_date,
          last_updated,
          inventory.stock_quantity,
          sales_amount,
          discount_status,
          inventory_cost,
          total_revenue,
          revenue_status,
          reporting_period,
          report_generated_date;
      
    • Trino Script:

      GROUP BY sales_data.region,
          sales_data.product_name,
          sales_data.transaction_id,
          transaction_date,
          last_updated,
          inventory.stock_quantity,
          7,  -- sales_amount
          8,  -- discount_status
          9,  -- inventory_cost
          10, -- total_revenue
          11, -- revenue_status
          12, -- reporting_period
          13; -- report_generated_date
      

Troubleshooting and Tips

Next Steps

After completing the migration of Apache Impala scripts to Trino on Oracle Big Data Service, it is important to validate the converted scripts thoroughly. Start by running the converted Trino scripts in a test environment to ensure they produce the expected results. Pay close attention to performance differences, especially in complex queries involving joins or aggregations, as Trino’s execution engine might behave differently compared to Impala.

Additionally, consider optimizing the new Trino queries by analyzing the query plans and making necessary adjustments. Finally, document any specific challenges or lessons learned during the migration process, as this will be valuable for future migrations.

Acknowledgements

More Learning Resources

Explore other labs on docs.oracle.com/learn or access more free learning content on the Oracle Learning YouTube channel. Additionally, visit education.oracle.com/learning-explorer to become an Oracle Learning Explorer.

For product documentation, visit Oracle Help Center.