Oracle Contextual Intelligence

Oracle Contextual Intelligence (Context) analyzes content at the massive scale and speeds required by automated advertising technology to determine the context and the central meaning of web content where ads typically appear. (Oracle Contextual Intelligence was previously called Grapeshot.)

You can use Context to have content included or excluded for consideration for your automated advertising buys based on the appropriateness of the content for the brand being advertised.

Context is part of Oracle Data Cloud, which also includes Oracle Measurement — providing real-time analytics, fraud detection, and accreditation of more than 50 metrics for ad-impression viewability — and a proprietary DMP (data management platform) that has unparalleled access to industry data.

In this topic:

Contextual Intelligence Product Line

Primary users

Context users include:

Language support

Context is available for 31 of the most-used languages in the world. It can distinguish among important variations of languages, such as between British and American English.

How Context works

Context has two main processes: crawling and contextually analyzing pages, then matching them against carefully curated sets of words and phrases. These sets are called contextual segments.

Note: Context works in the pre-bid environment: it notifies ad systems to include or exclude pages before an advertising bid has been placed. This feature contrasts with systems that block an ad from appearing after a bid has been placed and won for a spot on a page. By using pre-bid technology, advertisers avoid being billed for placements on which they've never bid.

Crawling and analyzing pages

Context crawls and contextually analyzes text in response to requests. The technology and architecture support loads exceeding 3 million queries per second (QPS), a level found in massive programmatic advertising implementations, and also offer sub-millisecond response times.

The request for a page at a given URL can originate from a number of sources but typically comes from a partner's technology platform or server that handles ad serving, measurement and/or verification. Pages are re-crawled to check for changes, depending on the propensity of a particular page at the given URL to change within its epicenter.

To account for web pages that are the same as one another but have subtly different URLs, such as from parameters toward the end of a URL that are caused by web-analytics software, parameters are stripped out.

Sites that our systems are attempting to crawl can exclude or block the crawlers by various means, such as by using a robots.txt page or by specifically excluding Context's crawler. If we are unable to crawl a page, information about that inability is indicated to partners or customers via a signal prefixed "gx_".

Concentrating on the epicenter

After our crawler has received a request to scan a web page, it finds and crawls that page. It then downloads the page's core textual content (we call this the epicenter) from the page's HTML. We do not download or analyze the CSS, JavaScript, images, navigation, footer, sidebars, or other areas tangential to the main textual content on the page. For example, on a typical news web page, our technology analyzes the central text of that page. It does not analyze the embedded or side elements, which may include related stories, additional linked headlines, images, videos, and so on.

Re-crawling pages

Pages change, of course, and need to be re-crawled. The crawler maintains an estimate of how frequently a page changes. If a page has been modified since the last time it was crawled, then the crawling frequency is halved to a lower limit of every four hours. If it hasn't been modified, then the crawling frequency is doubled, to a maximum of every 30 days. In this way, the rate of re-crawl soon matches the modification rate, providing efficiency in apportioning resources.

Categorizing pages

Once a web page has been crawled, its information is kept in a document store — a centralized area for information from all crawled pages. Context's central data store holds information from more than 5 billion documents at a time and is an ever-growing, frequently updated record of all pages that have been crawled.

From this document store, a web page's record can be run against our proprietary algorithm to determine the weighted value of the language on the page and determine if there is a match to the contextual segments being used by a partner. If a web page is requested but not found in the document store, the URL is sent to the crawler layer to be crawled, processed and stored in the document store.

Contextual segments, matches, and signals

When a page is requested for analysis, its categorized data is then matched against the relevant contextual segments. Contextual segments are another elemental factor at the heart of Context's technological processes. 

Segments and sub-segments build upon one another in a tree structure. So, for example, the "sports" segment may include "gs-sports-football" and "gs-sports-basketball," and each of those may include sub-segments relevant to their individual sports.

For example, a segment concerning "sports" would contain multiple words and phrases that would indicate &mdash if they appear with enough weight on a page — that the page is about sports.

Each language we cover has hundreds of standard segments addressing an array of topics. The segments are constructed by our teams of editors trained in linguistics and in our processes. Partners can also create, or ask us to create, custom segments to cover topics or niches not sufficiently covered for their purposes by our standard segments or to avoid brand-unsuitable content, based on their own criteria.

Once a page has been crawled, indexed and categorized, that information is compared to the relevant keyword segments to determine if there is a contextual match. Matches are scored to indicate how strong the match is compared with other matches.

Once a match is found and scored, a response is sent via API to our partners so they can determine how the page should be treated for advertising purposes.

The cache

In order to operate as quickly as possible, customers can elect to situate a cache close to the systems they are using so the information on categorized pages can be retrieved and matched as quickly as possible. The cache is updated every time a page is rescanned and categorized.

Signal nomenclature

Context provides signals in standardized alphanumeric formats that include a set prefix, descriptive word(s), and the "_" or "-" symbols. The nomenclature is constructed to be easily understandable and differentiated. As one example, a page found to match our "sports" segment sends a response of "gs_sports" as well as a "score" to indicate how strong the match is for that page to the segment compared with other matching segments. The "gs" prefix indicates "Grapeshot Standard," as described below. There may be sub-segments as well, such as "sports-football" or "sports-tennis," which can be used separately for matching or rolled up into an umbrella segment.

We have a range of overarching signal response types:

Brand safety options

Context provides two standard levels for identifying risks associated with the textual content of web pages, audio and video:

Content that has been successfully processed and does not match any of the standard unsafe (gv_) segments is identified as safe (gv_safe) and offered for targeting.

Custom "safe-from" segments can also be added — segments for which if a match is found the page is negatively targeted (avoided).

Standard unsafe segments are ones that map to brand safety parameters that have been agreed upon by industry trade groups such as the 4A's. In September of 2018, the American Association of Advertising Agencies (4A's) Advertiser Protection Bureau (APB) introduced its Brand Safety framework. The framework lists 13 content categories that, in the words of a 4A's news release, "pose risk to advertisers, whereby advertisers might choose to adopt a 'never appropriate' position for their ad buys." These 13 categories and the 4A's definitions of them are identified in the table below, along with Context's corresponding avoidance categories where available and the ways in which such content is otherwise addressed for programmatic content.

 

4A Framework

Context Avoidance Categories

#

Category

Definition

 Category

 Definition

1

Adult and Explicit Sexual Content

Illegal sale, distribution, and consumption of child pornography.

Explicit or gratuitous depiction of sexual acts, and/or display of genitals, real or animated.

gv_adult

Avoids mature and sexual content.

2

Arms and Ammunition

Promotion and advocacy of sales of illegal arms, rifles, and handguns.

Instructive content on how to obtain, make, distribute, or use illegal arms.

Glamorization of illegal arms for the purpose of harm to others.

Use of illegal arms in unregulated environments.

gv_arms

Avoids content around guns and weapons.

3

Crime and Harmful acts to individuals and Society and Human Right Violations

Graphic promotion, advocacy, and depiction of willful harm and actual unlawful criminal activity, including murder, manslaughter and harm to others. Explicit violations of human rights, such as trafficking and slavery.

gv_crime

Segments include serious, sex and violent crimes.

4

Death or Injury

Promotion or advocacy of death or injury. Murder or willful bodily harm to others. Graphic depictions of willful harm to others.

gv_death_injury

Segments include air, fire, rail, road and sea.

5

Online piracy

Pirating, copyright infringement, and counterfeiting.

gv_download

Topics related to online piracy and spam.

6

Hate Speech and Acts of Aggression

Unlawful acts of aggression based on race, nationality, ethnicity, religious affiliation, gender, or sexual preference. Behavior or commentary that incites such hateful acts, including bullying.

gv_hatespeech

Avoids derogatory terms, including content around racism, homophobia, and inflammatory political terms.

7

Military Conflict

Incendiary content provoking, enticing, or evoking military aggression.

Live-action footage/photos of military actions and genocide or other war crimes.

gv_military

Avoids conflict, war and negative foreign policy content.

8

Obscenity and Profanity

Excessive use of profane language or gestures and other repulsive actions with the intent to shock, offend, or insult.

gv_obscenity

Avoids content that includes offensive terms.

9

Illegal Drugs

Promotion or sale of illegal drug use, including abuse of prescription drugs. Federal jurisdiction applies, but allowable where legal local jurisdiction can be effectively managed.

gv_drugs

Avoids content related to consumption of drugs, including recreational and performance-enhancing use.

10

Spam or Harmful Content

Malware/Phishing

Does not map directly to a specific Context avoidance category, although certain URLs related to this definition may be covered in some part by Context's gv_download category. Additionally, Context, as part of our monitoring of keyword segments, manually adds Spam or Harmful Content sites to our internal block list of pages not to be crawled. If requested by Context clients, these sites return a gv_spam_or_harmful_site categorization.

11

Terrorism

Promotion and advocacy of graphic terrorist activity involving defamation, physical and/or emotional harm of individuals, communities, and society.

gv_terrorism

Avoids content around terrorist attacks.

12

Tobacco/eCigarettes/Vaping

Promotion and advocacy of tobacco and e-cigarette (vaping) and alcohol use to minors.

gv_tobacco

Avoids smoking content, including vaping and e-cigarettes.

13

Sensitive Social Issue/ Violations of Human Rights

Disrespectful and harmful treatment of sensitive social topics such as abortion, extreme political positions, and so on.

Acts, language and gestures deemed illegal, not otherwise outlined in this framework. Examples include harm to self or others and animal cruelty.

Targeted harassment of individuals and groups.

Does not map directly to a specific Context avoidance category, although some text related to this definition may be covered in some part by Context's gv_hatespeech and gv_obscenity categories. Sites deemed inappropriate as noted in category 10 above may be removed from our crawl list and be designated with a harmful_site categorization.

 

Video and Audio

For video and audio, we ingest and process the audio matched to the video, converting it to text. That text is then analyzed and matched in similar fashion to the processes described for epicenter text noted above.

Mobile and CTV apps

Context is able to evaluate applications built for mobile device and CTV environments through the analysis of descriptions in Google Play, iOS and other app stores. For mobile, we categorize apps with a PEGI rating of 3 (for Google) or an Age rating of 4+ (for iOS) as "safe for mobile."