The image shows a multimodal AI model. The model includes the following components:

A user sends Audio/Video to Speech to Text in the model. Text flows from Speech to Text to the LLM on OCI.

Text flows from the LLM on OCI to Text to Speech, and then the Audio/Video flows to the user.