Note:

This tutorial requires access to Oracle Cloud. To sign up for a free account, see Get started with Oracle Cloud Infrastructure Free Tier.
It uses example values for Oracle Cloud Infrastructure credentials, tenancy, and compartments. When completing your lab, substitute these values with ones specific to your cloud environment.

Migrate Apache Impala Scripts to Trino on Oracle Big Data Service

Introduction

Oracle Big Data Service (BDS) is a cloud-based service provided by Oracle that enables users to create and manage Hadoop clusters, Spark clusters, Trino and other big data services. In the world of big data analytics, Trino stands out as a high-performance, distributed SQL query engine designed for running interactive analytic queries on large datasets. Trino enables querying data across multiple sources, including Hadoop Distributed File System (HDFS), OCI bucket, and traditional relational databases, all within a unified SQL interface. Its ability to handle large-scale data with low latency makes it a powerful tool for data engineers and analysts alike.

With the shift to modern data platforms, many organizations are migrating from Apache Impala to Trino on Oracle Big Data Service. This tutorial provides a step-by-step approach to automate and manually adjust Apache Impala scripts for seamless migration to Trino.

Objectives

Automate the conversion of Apache Impala scripts to Trino compatible SQL.
Highlight manual changes required for specific scenarios.
Provide troubleshooting tips for common migration challenges.

Prerequisites

Access to Oracle Big Data Service with Trino installed.
Basic knowledge of Apache Impala and Trino SQL.
Sufficient permissions to run scripts and modify configurations.

Task 1: Automate Script Conversion

The automated script is designed to streamline the process of converting Apache Impala scripts to be compatible with Trino on Oracle Big Data Service. This script automates several key transformations, reducing the amount of manual effort required and minimizing the potential for human error during the migration process. Here is a breakdown of its functionality.

File processing and initialization.
- The script begins by reading an input .hql file (which is typically used for Apache Impala queries).
- It creates a new output file where the converted Trino compatible query will be saved.
- The script starts by adding a line to set the catalog and database context, using USE catalog_name.schema_name, ensuring the queries run in the correct context in Trino.

Automated query transformations.

Overwrite Table Handling: The script looks for any OVERWRITE TABLE statements in the Apache Impala script and inserts corresponding DELETE FROM statements just above them. This ensures that the table is cleared before inserting new data, aligning with how Trino handles data overwrites.
Replacement of Impala-Specific Functions: The script applies several transformations to replace Apache Impala specific syntax and functions with Trino compatible ones. For example:
- invalidate metadata and REFRESH commands are commented out, as they are not needed in Trino.
- Data type changes such as string to varchar are made to match Trino’s data types.
- Impala’s TRUNCATE TABLE statements are replaced with DELETE FROM since Trino handles truncation differently.
- COMPUTE STATS is replaced with ANALYZE, reflecting how Trino collects statistics.
- Other specific transformations include handling functions like nvl, ifnull, and from_utc_timestamp, adjusting them to Trino’s syntax and functions.
- Converting nvl2 functions to equivalent COALESCE expressions.
- Modifying regular expression matching (REGEXP) to Trino’s REGEXP_LIKE.
- Adjusting date and timestamp functions to align with Trino’s syntax.

Script:

#!/bin/bash
# Function to apply transformations to an HQL file
process_file() {
   local input_file="$1"
   local output_file="${input_file%.hql}_modified.hql"
   local diff_file="${input_file%.hql}_diff.sql"

   # Create the output file with the initial modification
   {
      echo "use catalog_name.schema_name;"
      cat "$input_file"
   } > "$output_file"

   # Insert DELETE FROM statements above OVERWRITE TABLE statements
   awk '
   BEGIN { IGNORECASE = 1 }
   {
      if (match($0, /OVERWRITE[[:space:]]+TABLE[[:space:]]+([^[:space:]]+)/, arr)) {
            print "DELETE FROM " arr[1] ";"
      }
      else if (match($0, /OVERWRITE[[:space:]]+([^[:space:]]+)/, arr)) {
      print "DELETE FROM " arr[1] ";"
      }
      print
   }' "$output_file" > temp_file && mv temp_file "$output_file"

   # Define replacements as array of tuples (search, replace)
   declare -a replacements=(
      "old_line1 new_line1"
      "old_line2 new_line2"
      # Add more replacements as needed
   )

   # Apply replacements
   for replacement in "${replacements[@]}"; do
      search_text=$(echo "$replacement" | awk '{print $1}')
      replace_text=$(echo "$replacement" | awk '{print $2}')
      sed -i "s/${search_text}/${replace_text}/Ig" "$output_file"
   done

   # Handle specific replacements
   sed -i 's/invalidate metadata/--invalidate metadata/Ig' "$output_file"
   sed -i 's/REFRESH/--REFRESH/Ig' "$output_file"
   sed -i 's/ string/ varchar/Ig' "$output_file"
   sed -i 's/TRUNCATE[[:space:]]\+TABLE[[:space:]]*/DELETE FROM /Ig' "$output_file"
   sed -i 's/TRUNCATE /DELETE FROM /Ig' "$output_file"
   sed -i 's/OVERWRITE[[:space:]]\+TABLE[[:space:]]*/INTO /Ig' "$output_file"
   sed -i 's/COMPUTE[[:space:]]\+stats/ANALYZE/Ig' "$output_file"
   sed -i 's/INSERT[[:space:]]\+INTO[[:space:]]\+TABLE/INSERT INTO/Ig' "$output_file"
   sed -i 's/from_utc_timestamp(now(),/current_timestamp AT TIME ZONE /Ig' "$output_file"
   sed -i "s/'America\/New_York')/'America\/New_York'/Ig" "$output_file"
   sed -i 's|/\* +NOSHUFFLE \*/| |Ig' "$output_file"
   sed -i 's/into[[:space:]]\+table/into/Ig' "$output_file"
   sed -i 's/STRAIGHT_JOIN/ /Ig' "$output_file"
   sed -i 's/OVERWRITE[[:space:]]\+/INTO /Ig' "$output_file"
   sed -i 's/TABLE[[:space:]]\+INTO/INTO/Ig' "$output_file"
   sed -i 's/SET[[:space:]]\+DECIMAL_V2/--SET DECIMAL_V2/Ig' "$output_file"
   sed -i 's/nvl(/COALESCE(/Ig' "$output_file"
   sed -i 's/ifnull(/COALESCE(/Ig' "$output_file"
   sed -i 's/CREATE[[:space:]]\+EXTERNAL/CREATE/Ig' "$output_file"
   sed -i -E 's/TRUNC[[:space:]]*\(/TRUNCATE(/Ig' "$output_file"
   sed -i -E 's/CAST[[:space:]]*\(/TRY_CAST(/Ig' "$output_file"
   sed -i -E 's/instr[[:space:]]*\(/strpos(/Ig' "$output_file"
   sed -i 's/""[[:space:]]\+as/null as/Ig' "$output_file"
   sed -i -E 's/\$\{var\:([^}]*)\}/\$\{\1\}/Ig' "$output_file"
   sed -i -E 's/([a-zA-Z_][a-zA-Z0-9_]*)[[:space:]]+REGEXP[[:space:]]+('\''[^'\'']*'\'')/REGEXP_LIKE(\1, \2)/Ig' "$output_file"
   sed -i -E 's/lower\(([a-zA-Z_][a-zA-Z0-9_]*\))[[:space:]]+REGEXP[[:space:]]+('\''[^'\'']*'\'')/REGEXP_LIKE(lower(\1, \2)/Ig' "$output_file"
   sed -i "s/)\s*STORED\s*AS\s*PARQUET\s*LOCATION\s*'\([^']*\)';/) WITH (\nexternal_location = '\1',\nformat = 'PARQUET'\n);/Ig" "$output_file"    
   sed -i -E "s/try_cast\(extract\(([a-zA-Z0-9_.]+), 'MONTH'\) AS varchar\)/try_cast(extract(MONTH from try_cast(\1 AS timestamp)) as VARCHAR)/Ig" "$output_file"
   sed -i -E "s/TRY_CAST\(extract\(add_months\(([^,]+),([^,]+)\),[[:space:]]*'MONTH'\)[[:space:]]*AS[[:space:]]*varchar\)/TRY_CAST\(extract\(MONTH from date_add\('month',\2,try_cast\(\1 as timestamp\)\)\) AS varchar\)/Ig" "$output_file"
   sed -Ei "s/try_cast\(ADD_MONTHS\(([^,]+),\s*-?\s*([0-9]+)\) AS varchar\)/try_cast\(date_add\('month', -\2, try_cast(\1 as timestamp)\) AS varchar\)/Ig" "$output_file"
   sed -i -E 's/DATEDIFF\(/date_diff\('\''day'\'', /Ig' "$output_file"
   sed -i -E 's/TO_DATE\(/DATE\(/Ig' "$output_file"
   sed -i -E 's/now\(\)/current_timestamp/Ig' "$output_file"

   # Generate the diff file
   diff -u "$input_file" "$output_file" > "$diff_file"

   echo "Processed $input_file -> $output_file. Diff file created: $diff_file"
}

# Check arguments
if [ "$#" -ne 1 ]; then
   echo "Usage: $0 <file_or_directory>"
   exit 1
fi

TARGET="$1"

# Check if the target is a directory
if [ -d "$TARGET" ]; then
   for file in "$TARGET"/*.hql; do
      [ -e "$file" ] || continue # Skip if no HQL files found
      process_file "$file"
   done
elif [ -f "$TARGET" ]; then
   process_file "$TARGET"
else
   echo "Invalid input: $TARGET is neither a file nor a directory."
   exit 1
fi

Note This script provides a foundational approach for converting Apache Impala scripts to Trino, but it does not cover all possible Impala statements and scenarios. If you encounter additional Impala specific syntax or functions not handled by this script, you can extend the script by adding new transformation rules to cover those cases. Regularly updating the script with these modifications will improve its effectiveness for future conversions

Task 2: Run the Conversion Script

Save the script into Oracle Big Data Service edge node.
Run the script with the target file or directory containing Apache Impala scripts.
Review the generated output and difference files for changes.

Task 3: Make Manual Adjustments

Use the LEFT ANTI JOIN statement.

Certain Apache Impala SQL constructs, like LEFT ANTI JOIN, require manual intervention. Identify LEFT ANTI JOIN statements in your Impala script and replace them with equivalent Trino SQL using LEFT JOIN with a WHERE clause.

Example Scripts:
- Impala Script:
```
SELECT t1.*
 FROM table1 t1
 LEFT ANTI JOIN table2 t2 ON t1.key = t2.key;
```
- Trino Script:
```
SELECT t1.*
 FROM table1 t1
 LEFT JOIN table2 t2 ON t1.key = t2.key
 WHERE t2.key IS NULL;
```

Adjust GROUP BY clauses.

Trino allows position based grouping in the GROUP BY clause, which can simplify long groupings. Convert column names to their positional equivalents in the GROUP BY clause where necessary. Ensure all columns are correctly mapped.

Example Scripts:

Impala Script:

GROUP BY sales_data.region,
    sales_data.product_name,
    sales_data.transaction_id,
    transaction_date,
    last_updated,
    inventory.stock_quantity,
    sales_amount,
    discount_status,
    inventory_cost,
    total_revenue,
    revenue_status,
    reporting_period,
    report_generated_date;

Trino Script:

GROUP BY sales_data.region,
    sales_data.product_name,
    sales_data.transaction_id,
    transaction_date,
    last_updated,
    inventory.stock_quantity,
    7,  -- sales_amount
    8,  -- discount_status
    9,  -- inventory_cost
    10, -- total_revenue
    11, -- revenue_status
    12, -- reporting_period
    13; -- report_generated_date

Troubleshooting and Tips

Error Handling: If you encounter syntax errors after conversion, carefully review the script, especially around JOIN clauses and date functions.
Testing: Validate the output of your queries in Trino to ensure they match the expected results from Apache Impala.
Do refer Trino documentation as and when needed.

Next Steps

After completing the migration of Apache Impala scripts to Trino on Oracle Big Data Service, it is important to validate the converted scripts thoroughly. Start by running the converted Trino scripts in a test environment to ensure they produce the expected results. Pay close attention to performance differences, especially in complex queries involving joins or aggregations, as Trino’s execution engine might behave differently compared to Impala.

Additionally, consider optimizing the new Trino queries by analyzing the query plans and making necessary adjustments. Finally, document any specific challenges or lessons learned during the migration process, as this will be valuable for future migrations.

Acknowledgements

Authors - Pavan Upadhyay (Principal Cloud Engineer), Saket Bihari (Principal Cloud Engineer)

More Learning Resources

Explore other labs on docs.oracle.com/learn or access more free learning content on the Oracle Learning YouTube channel. Additionally, visit education.oracle.com/learning-explorer to become an Oracle Learning Explorer.

For product documentation, visit Oracle Help Center.

Title and Copyright Information

Migrate Apache Impala Scripts to Trino on Oracle Big Data Service

G13919-01

September 2024

Oracle and/or its affiliates.

Migrate Apache Impala Scripts to Trino on Oracle Big Data Service

Introduction

Objectives

Prerequisites

Task 1: Automate Script Conversion

Task 2: Run the Conversion Script

Task 3: Make Manual Adjustments

Troubleshooting and Tips

Next Steps

Related Links

Acknowledgements

More Learning Resources