Note:
- This tutorial requires access to Oracle Cloud. To sign up for a free account, see Get started with Oracle Cloud Infrastructure Free Tier.
- It uses example values for Oracle Cloud Infrastructure credentials, tenancy, and compartments. When completing your lab, substitute these values with ones specific to your cloud environment.
Migrate Apache Impala Scripts to Trino on Oracle Big Data Service
Introduction
Oracle Big Data Service (BDS) is a cloud-based service provided by Oracle that enables users to create and manage Hadoop clusters, Spark clusters, Trino and other big data services. In the world of big data analytics, Trino stands out as a high-performance, distributed SQL query engine designed for running interactive analytic queries on large datasets. Trino enables querying data across multiple sources, including Hadoop Distributed File System (HDFS), OCI bucket, and traditional relational databases, all within a unified SQL interface. Its ability to handle large-scale data with low latency makes it a powerful tool for data engineers and analysts alike.
With the shift to modern data platforms, many organizations are migrating from Apache Impala to Trino on Oracle Big Data Service. This tutorial provides a step-by-step approach to automate and manually adjust Apache Impala scripts for seamless migration to Trino.
Objectives
-
Automate the conversion of Apache Impala scripts to Trino compatible SQL.
-
Highlight manual changes required for specific scenarios.
-
Provide troubleshooting tips for common migration challenges.
Prerequisites
-
Access to Oracle Big Data Service with Trino installed.
-
Basic knowledge of Apache Impala and Trino SQL.
-
Sufficient permissions to run scripts and modify configurations.
Task 1: Automate Script Conversion
The automated script is designed to streamline the process of converting Apache Impala scripts to be compatible with Trino on Oracle Big Data Service. This script automates several key transformations, reducing the amount of manual effort required and minimizing the potential for human error during the migration process. Here is a breakdown of its functionality.
-
File processing and initialization.
-
The script begins by reading an input
.hql
file (which is typically used for Apache Impala queries). -
It creates a new output file where the converted Trino compatible query will be saved.
-
The script starts by adding a line to set the catalog and database context, using
USE catalog_name.schema_name
, ensuring the queries run in the correct context in Trino.
-
-
Automated query transformations.
-
Overwrite Table Handling: The script looks for any
OVERWRITE TABLE
statements in the Apache Impala script and inserts correspondingDELETE FROM
statements just above them. This ensures that the table is cleared before inserting new data, aligning with how Trino handles data overwrites. -
Replacement of Impala-Specific Functions: The script applies several transformations to replace Apache Impala specific syntax and functions with Trino compatible ones. For example:
-
invalidate metadata
andREFRESH
commands are commented out, as they are not needed in Trino. -
Data type changes such as
string
tovarchar
are made to match Trino’s data types. -
Impala’s
TRUNCATE TABLE
statements are replaced withDELETE FROM
since Trino handles truncation differently. -
COMPUTE STATS
is replaced withANALYZE
, reflecting how Trino collects statistics. -
Other specific transformations include handling functions like
nvl
,ifnull
, andfrom_utc_timestamp
, adjusting them to Trino’s syntax and functions. -
Converting
nvl2
functions to equivalentCOALESCE
expressions. -
Modifying regular expression matching (
REGEXP
) to Trino’sREGEXP_LIKE
. -
Adjusting date and timestamp functions to align with Trino’s syntax.
-
Script:
#!/bin/bash # Function to apply transformations to an HQL file process_file() { local input_file="$1" local output_file="${input_file%.hql}_modified.hql" local diff_file="${input_file%.hql}_diff.sql" # Create the output file with the initial modification { echo "use catalog_name.schema_name;" cat "$input_file" } > "$output_file" # Insert DELETE FROM statements above OVERWRITE TABLE statements awk ' BEGIN { IGNORECASE = 1 } { if (match($0, /OVERWRITE[[:space:]]+TABLE[[:space:]]+([^[:space:]]+)/, arr)) { print "DELETE FROM " arr[1] ";" } else if (match($0, /OVERWRITE[[:space:]]+([^[:space:]]+)/, arr)) { print "DELETE FROM " arr[1] ";" } print }' "$output_file" > temp_file && mv temp_file "$output_file" # Define replacements as array of tuples (search, replace) declare -a replacements=( "old_line1 new_line1" "old_line2 new_line2" # Add more replacements as needed ) # Apply replacements for replacement in "${replacements[@]}"; do search_text=$(echo "$replacement" | awk '{print $1}') replace_text=$(echo "$replacement" | awk '{print $2}') sed -i "s/${search_text}/${replace_text}/Ig" "$output_file" done # Handle specific replacements sed -i 's/invalidate metadata/--invalidate metadata/Ig' "$output_file" sed -i 's/REFRESH/--REFRESH/Ig' "$output_file" sed -i 's/ string/ varchar/Ig' "$output_file" sed -i 's/TRUNCATE[[:space:]]\+TABLE[[:space:]]*/DELETE FROM /Ig' "$output_file" sed -i 's/TRUNCATE /DELETE FROM /Ig' "$output_file" sed -i 's/OVERWRITE[[:space:]]\+TABLE[[:space:]]*/INTO /Ig' "$output_file" sed -i 's/COMPUTE[[:space:]]\+stats/ANALYZE/Ig' "$output_file" sed -i 's/INSERT[[:space:]]\+INTO[[:space:]]\+TABLE/INSERT INTO/Ig' "$output_file" sed -i 's/from_utc_timestamp(now(),/current_timestamp AT TIME ZONE /Ig' "$output_file" sed -i "s/'America\/New_York')/'America\/New_York'/Ig" "$output_file" sed -i 's|/\* +NOSHUFFLE \*/| |Ig' "$output_file" sed -i 's/into[[:space:]]\+table/into/Ig' "$output_file" sed -i 's/STRAIGHT_JOIN/ /Ig' "$output_file" sed -i 's/OVERWRITE[[:space:]]\+/INTO /Ig' "$output_file" sed -i 's/TABLE[[:space:]]\+INTO/INTO/Ig' "$output_file" sed -i 's/SET[[:space:]]\+DECIMAL_V2/--SET DECIMAL_V2/Ig' "$output_file" sed -i 's/nvl(/COALESCE(/Ig' "$output_file" sed -i 's/ifnull(/COALESCE(/Ig' "$output_file" sed -i 's/CREATE[[:space:]]\+EXTERNAL/CREATE/Ig' "$output_file" sed -i -E 's/TRUNC[[:space:]]*\(/TRUNCATE(/Ig' "$output_file" sed -i -E 's/CAST[[:space:]]*\(/TRY_CAST(/Ig' "$output_file" sed -i -E 's/instr[[:space:]]*\(/strpos(/Ig' "$output_file" sed -i 's/""[[:space:]]\+as/null as/Ig' "$output_file" sed -i -E 's/\$\{var\:([^}]*)\}/\$\{\1\}/Ig' "$output_file" sed -i -E 's/([a-zA-Z_][a-zA-Z0-9_]*)[[:space:]]+REGEXP[[:space:]]+('\''[^'\'']*'\'')/REGEXP_LIKE(\1, \2)/Ig' "$output_file" sed -i -E 's/lower\(([a-zA-Z_][a-zA-Z0-9_]*\))[[:space:]]+REGEXP[[:space:]]+('\''[^'\'']*'\'')/REGEXP_LIKE(lower(\1, \2)/Ig' "$output_file" sed -i "s/)\s*STORED\s*AS\s*PARQUET\s*LOCATION\s*'\([^']*\)';/) WITH (\nexternal_location = '\1',\nformat = 'PARQUET'\n);/Ig" "$output_file" sed -i -E "s/try_cast\(extract\(([a-zA-Z0-9_.]+), 'MONTH'\) AS varchar\)/try_cast(extract(MONTH from try_cast(\1 AS timestamp)) as VARCHAR)/Ig" "$output_file" sed -i -E "s/TRY_CAST\(extract\(add_months\(([^,]+),([^,]+)\),[[:space:]]*'MONTH'\)[[:space:]]*AS[[:space:]]*varchar\)/TRY_CAST\(extract\(MONTH from date_add\('month',\2,try_cast\(\1 as timestamp\)\)\) AS varchar\)/Ig" "$output_file" sed -Ei "s/try_cast\(ADD_MONTHS\(([^,]+),\s*-?\s*([0-9]+)\) AS varchar\)/try_cast\(date_add\('month', -\2, try_cast(\1 as timestamp)\) AS varchar\)/Ig" "$output_file" sed -i -E 's/DATEDIFF\(/date_diff\('\''day'\'', /Ig' "$output_file" sed -i -E 's/TO_DATE\(/DATE\(/Ig' "$output_file" sed -i -E 's/now\(\)/current_timestamp/Ig' "$output_file" # Generate the diff file diff -u "$input_file" "$output_file" > "$diff_file" echo "Processed $input_file -> $output_file. Diff file created: $diff_file" } # Check arguments if [ "$#" -ne 1 ]; then echo "Usage: $0 <file_or_directory>" exit 1 fi TARGET="$1" # Check if the target is a directory if [ -d "$TARGET" ]; then for file in "$TARGET"/*.hql; do [ -e "$file" ] || continue # Skip if no HQL files found process_file "$file" done elif [ -f "$TARGET" ]; then process_file "$TARGET" else echo "Invalid input: $TARGET is neither a file nor a directory." exit 1 fi
Note This script provides a foundational approach for converting Apache Impala scripts to Trino, but it does not cover all possible Impala statements and scenarios. If you encounter additional Impala specific syntax or functions not handled by this script, you can extend the script by adding new transformation rules to cover those cases. Regularly updating the script with these modifications will improve its effectiveness for future conversions
-
Task 2: Run the Conversion Script
-
Save the script into Oracle Big Data Service edge node.
-
Run the script with the target file or directory containing Apache Impala scripts.
-
Review the generated output and difference files for changes.
Task 3: Make Manual Adjustments
-
Use the
LEFT ANTI JOIN
statement.Certain Apache Impala SQL constructs, like
LEFT ANTI JOIN
, require manual intervention. IdentifyLEFT ANTI JOIN
statements in your Impala script and replace them with equivalent Trino SQL usingLEFT JOIN
with aWHERE
clause.Example Scripts:
-
Impala Script:
SELECT t1.* FROM table1 t1 LEFT ANTI JOIN table2 t2 ON t1.key = t2.key;
-
Trino Script:
SELECT t1.* FROM table1 t1 LEFT JOIN table2 t2 ON t1.key = t2.key WHERE t2.key IS NULL;
-
-
Adjust
GROUP BY
clauses.Trino allows position based grouping in the
GROUP BY
clause, which can simplify long groupings. Convert column names to their positional equivalents in theGROUP BY
clause where necessary. Ensure all columns are correctly mapped.Example Scripts:
-
Impala Script:
GROUP BY sales_data.region, sales_data.product_name, sales_data.transaction_id, transaction_date, last_updated, inventory.stock_quantity, sales_amount, discount_status, inventory_cost, total_revenue, revenue_status, reporting_period, report_generated_date;
-
Trino Script:
GROUP BY sales_data.region, sales_data.product_name, sales_data.transaction_id, transaction_date, last_updated, inventory.stock_quantity, 7, -- sales_amount 8, -- discount_status 9, -- inventory_cost 10, -- total_revenue 11, -- revenue_status 12, -- reporting_period 13; -- report_generated_date
-
Troubleshooting and Tips
-
Error Handling: If you encounter syntax errors after conversion, carefully review the script, especially around
JOIN
clauses and date functions. -
Testing: Validate the output of your queries in Trino to ensure they match the expected results from Apache Impala.
-
Do refer Trino documentation as and when needed.
Next Steps
After completing the migration of Apache Impala scripts to Trino on Oracle Big Data Service, it is important to validate the converted scripts thoroughly. Start by running the converted Trino scripts in a test environment to ensure they produce the expected results. Pay close attention to performance differences, especially in complex queries involving joins or aggregations, as Trino’s execution engine might behave differently compared to Impala.
Additionally, consider optimizing the new Trino queries by analyzing the query plans and making necessary adjustments. Finally, document any specific challenges or lessons learned during the migration process, as this will be valuable for future migrations.
Related Links
Acknowledgements
- Authors - Pavan Upadhyay (Principal Cloud Engineer), Saket Bihari (Principal Cloud Engineer)
More Learning Resources
Explore other labs on docs.oracle.com/learn or access more free learning content on the Oracle Learning YouTube channel. Additionally, visit education.oracle.com/learning-explorer to become an Oracle Learning Explorer.
For product documentation, visit Oracle Help Center.
Migrate Apache Impala Scripts to Trino on Oracle Big Data Service
G13919-01
September 2024