12.9.4 Full-Text Stopwords

The stopword list is loaded and searched for full-text queries using the server character set and collation (the values of the character_set_server and collation_server system variables). False hits or misses might occur for stopword lookups if the stopword file or columns used for full-text indexing or searches have a character set or collation different from character_set_server or collation_server.

Case sensitivity of stopword lookups depends on the server collation. For example, lookups are case insensitive if the collation is latin1_swedish_ci, whereas lookups are case sensitive if the collation is latin1_general_cs or latin1_bin.

Stopwords for InnoDB Search Indexes

InnoDB has a relatively short list of default stopwords, because documents from technical, literary, and other sources often use short words as keywords or in significant phrases. For example, you might search for to be or not to be and expect to get a sensible result, rather than having all those words ignored.

To see the default InnoDB stopword list, query the INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD table.

mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD;
+-------+
| value |
+-------+
| a     |
| about |
| an    |
| are   |
| as    |
| at    |
| be    |
| by    |
| com   |
| de    |
| en    |
| for   |
| from  |
| how   |
| i     |
| in    |
| is    |
| it    |
| la    |
| of    |
| on    |
| or    |
| that  |
| the   |
| this  |
| to    |
| was   |
| what  |
| when  |
| where |
| who   |
| will  |
| with  |
| und   |
| the   |
| www   |
+-------+
36 rows in set (0.00 sec)

To define your own stopword list for all InnoDB tables, define a table with the same structure as the INNODB_FT_DEFAULT_STOPWORD table, populate it with stopwords, and set the value of the innodb_ft_server_stopword_table option to a value in the form db_name/table_name before creating the full-text index. The stopword table must have a single VARCHAR column named value. The following example demonstrates creating and configuring a new global stopword table for InnoDB.

-- Create a new stopword table

mysql> CREATE TABLE my_stopwords(value VARCHAR(30)) ENGINE = INNODB;
Query OK, 0 rows affected (0.01 sec)

-- Insert stopwords (for simplicity, a single stopword is used in this example)  

mysql> INSERT INTO my_stopwords(value) VALUES ('Ishmael');
Query OK, 1 row affected (0.00 sec)

-- Create the table

mysql> CREATE TABLE opening_lines (
id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
opening_line TEXT(500),
author VARCHAR(200),
title VARCHAR(200)
) ENGINE=InnoDB;
Query OK, 0 rows affected (0.01 sec)

-- Insert data into the table

mysql> INSERT INTO opening_lines(opening_line,author,title) VALUES
('Call me Ishmael.','Herman Melville','Moby-Dick'),
('A screaming comes across the sky.','Thomas Pynchon','Gravity\'s Rainbow'),
('I am an invisible man.','Ralph Ellison','Invisible Man'),
('Where now? Who now? When now?','Samuel Beckett','The Unnamable'),
('It was love at first sight.','Joseph Heller','Catch-22'),
('All this happened, more or less.','Kurt Vonnegut','Slaughterhouse-Five'),
('Mrs. Dalloway said she would buy the flowers herself.','Virginia Woolf','Mrs. Dalloway'),
('It was a pleasure to burn.','Ray Bradbury','Fahrenheit 451');
Query OK, 8 rows affected (0.00 sec)
Records: 8  Duplicates: 0  Warnings: 0

-- Set the innodb_ft_server_stopword_table option to the new stopword table

mysql> SET GLOBAL innodb_ft_server_stopword_table = 'test/my_stopwords';
Query OK, 0 rows affected (0.00 sec)

-- Create the full-text index (which rebuilds the table if no FTS_DOC_ID column is defined)

mysql> CREATE FULLTEXT INDEX idx ON opening_lines(opening_line);
Query OK, 0 rows affected, 1 warning (1.17 sec)
Records: 0  Duplicates: 0  Warnings: 1

Verify that the specified stopword ('Ishmael') does not appear by querying the words in INFORMATION_SCHEMA.INNODB_FT_INDEX_TABLE.

Note

By default, words less than 3 characters in length or greater than 84 characters in length do not appear in an InnoDB full-text search index. Maximum and minimum word length values are configurable using the innodb_ft_max_token_size and innodb_ft_min_token_size variables.

mysql> SELECT word FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_TABLE LIMIT 15;
+-----------+
| word      |
+-----------+
| across    |
| all       |
| burn      |
| buy       |
| call      |
| comes     |
| dalloway  |
| first     |
| flowers   |
| happened  |
| herself   |
| invisible |
| less      |
| love      |
| man       |
+-----------+
15 rows in set (0.00 sec)

To create stopword lists on a table-by-table basis, create other stopword tables and use the innodb_ft_user_stopword_table option to specify the stopword table that you want to use before you create the full-text index.

Stopwords for MyISAM Search Indexes

In MySQL 5.6, the stopword file is loaded and searched using latin1 if character_set_server is ucs2, utf16, utf16le, or utf32.

To override the default stopword list for MyISAM tables, set the ft_stopword_file system variable. (See Section 5.1.4, “Server System Variables”.) The variable value should be the path name of the file containing the stopword list, or the empty string to disable stopword filtering. The server looks for the file in the data directory unless an absolute path name is given to specify a different directory. After changing the value of this variable or the contents of the stopword file, restart the server and rebuild your FULLTEXT indexes.

The stopword list is free-form, separating stopwords with any nonalphanumeric character such as newline, space, or comma. Exceptions are the underscore character (_) and a single apostrophe (') which are treated as part of a word. The character set of the stopword list is the server's default character set; see Section 10.1.3.1, “Server Character Set and Collation”.

The following table shows the default list of stopwords for MyISAM search indexes. In a MySQL source distribution, you can find this list in the storage/myisam/ft_static.c file.

a'sableaboutaboveaccording
accordinglyacrossactuallyafterafterwards
againagainstain'tallallow
allowsalmostalonealongalready
alsoalthoughalwaysamamong
amongstanandanotherany
anybodyanyhowanyoneanythinganyway
anywaysanywhereapartappearappreciate
appropriatearearen'taroundas
asideaskaskingassociatedat
availableawayawfullybebecame
becausebecomebecomesbecomingbeen
beforebeforehandbehindbeingbelieve
belowbesidebesidesbestbetter
betweenbeyondbothbriefbut
byc'monc'scamecan
can'tcannotcantcausecauses
certaincertainlychangesclearlyco
comcomecomesconcerningconsequently
considerconsideringcontaincontainingcontains
correspondingcouldcouldn'tcoursecurrently
definitelydescribeddespitediddidn't
differentdodoesdoesn'tdoing
don'tdonedowndownwardsduring
eacheduegeighteither
elseelsewhereenoughentirelyespecially
etetceveneverevery
everybodyeveryoneeverythingeverywhereex
exactlyexampleexceptfarfew
fifthfirstfivefollowedfollowing
followsforformerformerlyforth
fourfromfurtherfurthermoreget
getsgettinggivengivesgo
goesgoinggonegotgotten
greetingshadhadn'thappenshardly
hashasn'thavehaven'thaving
hehe'shellohelphence
herherehere'shereafterhereby
hereinhereuponhersherselfhi
himhimselfhishitherhopefully
howhowbeithoweveri'di'll
i'mi'veieifignored
immediateininasmuchincindeed
indicateindicatedindicatesinnerinsofar
insteadintoinwardisisn't
itit'dit'llit'sits
itselfjustkeepkeepskept
knowknownknowslastlately
laterlatterlatterlyleastless
lestletlet'slikeliked
likelylittlelooklookinglooks
ltdmainlymanymaymaybe
memeanmeanwhilemerelymight
moremoreovermostmostlymuch
mustmymyselfnamenamely
ndnearnearlynecessaryneed
needsneitherneverneverthelessnew
nextninenonobodynon
nonenoonenornormallynot
nothingnovelnownowhereobviously
ofoffoftenohok
okayoldononceone
onesonlyontoorother
othersotherwiseoughtourours
ourselvesoutoutsideoveroverall
ownparticularparticularlyperperhaps
placedpleasepluspossiblepresumably
probablyprovidesquequiteqv
ratherrdrereallyreasonably
regardingregardlessregardsrelativelyrespectively
rightsaidsamesawsay
sayingsayssecondsecondlysee
seeingseemseemedseemingseems
seenselfselvessensiblesent
seriousseriouslysevenseveralshall
sheshouldshouldn'tsincesix
sosomesomebodysomehowsomeone
somethingsometimesometimessomewhatsomewhere
soonsorryspecifiedspecifyspecifying
stillsubsuchsupsure
t'staketakentelltends
ththanthankthanksthanx
thatthat'sthatsthetheir
theirsthemthemselvesthenthence
therethere'sthereaftertherebytherefore
thereintheresthereuponthesethey
they'dthey'llthey'rethey'vethink
thirdthisthoroughthoroughlythose
thoughthreethroughthroughoutthru
thustotogethertootook
towardtowardstriedtriestruly
trytryingtwicetwoun
underunfortunatelyunlessunlikelyuntil
untoupuponususe
usedusefulusesusingusually
valuevariousveryviaviz
vswantwantswaswasn't
waywewe'dwe'llwe're
we'vewelcomewellwentwere
weren'twhatwhat'swhateverwhen
whencewheneverwherewhere'swhereafter
whereaswherebywhereinwhereuponwherever
whetherwhichwhilewhitherwho
who'swhoeverwholewhomwhose
whywillwillingwishwith
withinwithoutwon'twonderwould
wouldn'tyesyetyouyou'd
you'llyou'reyou'veyouryours
yourselfyourselveszero