If you use Postgres, the most straightforward option is acts-as-tsearch . Tsearch2 is a Postgres contrib module that lets you offload the heavy lifting to low-level C code.
——
Here’s proposed solution. This is work in progress for now. Please comment. Will produce a proper patch, when done.
GlebArshinov
# this goes into base.rb
# Works like find_all, but takes a plain text search string
# rather than conditions. Current implementation splits the
# text into keywords and applies +find_by_keywords+, which then
# uses SQL LIKE operator too look for matches.
# +find_by_keywords+ sanitizes keywords to avoid SQL injection.
#
# +orderings+ and +limit+ arguments are supported as in +find_all+.
#
# * TODO: current tokenizer just splits the text on spaces. I'd
# like to be able to consider quoted strings as single token
# (standard search engine behavior).
# Example:
# User.find_by_text "artist David"
def find_by_text(text, orderings = nil, limit = nil, joins = nil)
keywords = text.split(' ')
find_by_keywords(keywords)
end
# Works like find_all, but takes a collection of keywords rather
# than conditions. Returns an array of objects where all the
# keywords are found in the attributes. Keywords are sanitized
# to avoid SQL injection.
#
# +orderings+ and +limit+ arguments are supported as in +find_all+.
#
# Implementation details:
# * Current implementation builds a set of conditions using SQL
# LIKE '%keyword%' operator. For each keyword a match is
# attempted on all the columns. If any column matches, then
# keyword matches. A record matches when all the keywords
# match. Example:
# Given User(name, description) search for ["artist" "David"]
# will produce condition:
# (name LIKE '%artist%' OR description LIKE '%artist%') AND
# (name LIKE '%David%' OR description LIKE '%David%')
# * TODO: need to deal with specifics of LIKE operator. Case
# sensitivity is different for different databases. Also need
# to escape special LIKE character (e.g. '%') in provided
# strings. Also, not sure if applying LIKE to non-character
# columns is generally supported, though it does works in
# <a href="http://wiki.rubyonrails.com/rails/pages/MySQL" class="existingWikiWord">MySQL</a>.
# * Additional method (this do not exist right now) integrating
# with <a href="http://wiki.rubyonrails.com/rails/pages/MySQL" class="existingWikiWord">MySQL</a> full text search would be useful:
# <a href="http://dev.mysql.com/doc/mysql/en/Fulltext_Search.html">http://dev.mysql.com/doc/mysql/en/Fulltext_Search.html</a>
# * Additional method (this do not exist right now) integrating
# with "PostgreSQL":http://www.moonation.com/3.html full text search would be useful:
# <a href="http://www.sai.msu.su/">http://www.sai.msu.su/</a>~megera/postgres/gist/tsearch/V2/
#
# Limitations:
# * +conditions+ argument is not supported. Supporting its
# array form would require some refactoring of existing code,
# since we need to avoid further sanitizing of conditions we
# create, as they use '%' characters.
#
# Example:
# User.find_by_keywords ["artist" "David"]
def find_by_keywords(keywords, orderings = nil, limit = nil, joins = nil)
# TODO: should do less conservative sanitizing, since we know
# that these will go inside SQL string, so characters like ':'
# or ';' are OK. Also, need to escape \%_ characters (these
# have special meaning to LIKE)
#
# LIKE documentation:
# <a href="http://www.postgresql.org/docs/7.1/interactive/functions-matching.html">http://www.postgresql.org/docs/7.1/interactive/functions-matching.html</a>
# <a href="http://dev.mysql.com/doc/mysql/en/Pattern_matching.html">http://dev.mysql.com/doc/mysql/en/Pattern_matching.html</a>
# <a href="http://dev.mysql.com/doc/mysql/en/Case_sensitivity.html">http://dev.mysql.com/doc/mysql/en/Case_sensitivity.html</a>
keywords.collect! { |keyword| sanitize(keyword) }
# build condition in the form of
# (name LIKE '%artist%' OR description LIKE '%artist%')
# AND (name LIKE '%David%' OR description LIKE '%David%')
condition = keywords.collect { |keyword|
'(' + columns.collect { |column|
"#{column.name} LIKE '%#{keyword}%'"
}.join(' OR ') + ')'
}.join(' AND ')
find_all(condition, orderings, limit, joins)
end
# this goes into finder_test.rb
def test_find_by_text
# simple match
topics = Topic.find_by_text "Mary"
assert_equal(1, topics.size)
assert_equal(@topic_fixtures["second"]["title"], topics.first.title)
# case insensitive match
topics = Topic.find_by_text "mary"
assert_equal(1, topics.size)
assert_equal(@topic_fixtures["second"]["title"], topics.first.title)
# substring
topics = Topic.find_by_text "ary"
assert_equal(1, topics.size)
assert_equal(@topic_fixtures["second"]["title"], topics.first.title)
# simple negative test
topics = Topic.find_by_text "does_not_exist"
assert_equal(0, topics.size)
# negative test across columns
topics = Topic.find_by_text "Mary does_not_exist"
assert_equal(0, topics.size)
# different order in the same column
topics = Topic.find_by_text "day have"
assert_equal(2, topics.size)
# match across 2 columns
topics = Topic.find_by_text "day David have"
assert_equal(@topic_fixtures["first"]["title"], topics.first.title)
assert_equal(1, topics.size)
end
—-
The sanitize line:
keywords.collect! { |keyword| sanitize(keyword) } is putting single quotes around each keyword so the search ends up like …LIKE ‘%’key’%’… and fails. Is there another way to escape the keywords?
—-
I believe that you can use sanitize("%#{term}%") instead, unless sanitize escapes %.
—-
Instead
keywords = text.split(' ')
you can use
keywords = text.scan(/[^"() ]*["(][^")]*[")]|[^"() ]+/)
and you will get Google like tokenizer
If you use Postgres, the most straightforward option is acts-as-tsearch . Tsearch2 is a Postgres contrib module that lets you offload the heavy lifting to low-level C code.
——
Here’s proposed solution. This is work in progress for now. Please comment. Will produce a proper patch, when done.
GlebArshinov
# this goes into base.rb
# Works like find_all, but takes a plain text search string
# rather than conditions. Current implementation splits the
# text into keywords and applies +find_by_keywords+, which then
# uses SQL LIKE operator too look for matches.
# +find_by_keywords+ sanitizes keywords to avoid SQL injection.
#
# +orderings+ and +limit+ arguments are supported as in +find_all+.
#
# * TODO: current tokenizer just splits the text on spaces. I'd
# like to be able to consider quoted strings as single token
# (standard search engine behavior).
# Example:
# User.find_by_text "artist David"
def find_by_text(text, orderings = nil, limit = nil, joins = nil)
keywords = text.split(' ')
find_by_keywords(keywords)
end
# Works like find_all, but takes a collection of keywords rather
# than conditions. Returns an array of objects where all the
# keywords are found in the attributes. Keywords are sanitized
# to avoid SQL injection.
#
# +orderings+ and +limit+ arguments are supported as in +find_all+.
#
# Implementation details:
# * Current implementation builds a set of conditions using SQL
# LIKE '%keyword%' operator. For each keyword a match is
# attempted on all the columns. If any column matches, then
# keyword matches. A record matches when all the keywords
# match. Example:
# Given User(name, description) search for ["artist" "David"]
# will produce condition:
# (name LIKE '%artist%' OR description LIKE '%artist%') AND
# (name LIKE '%David%' OR description LIKE '%David%')
# * TODO: need to deal with specifics of LIKE operator. Case
# sensitivity is different for different databases. Also need
# to escape special LIKE character (e.g. '%') in provided
# strings. Also, not sure if applying LIKE to non-character
# columns is generally supported, though it does works in
# <a href="http://wiki.rubyonrails.com/rails/pages/MySQL" class="existingWikiWord">MySQL</a>.
# * Additional method (this do not exist right now) integrating
# with <a href="http://wiki.rubyonrails.com/rails/pages/MySQL" class="existingWikiWord">MySQL</a> full text search would be useful:
# <a href="http://dev.mysql.com/doc/mysql/en/Fulltext_Search.html">http://dev.mysql.com/doc/mysql/en/Fulltext_Search.html</a>
# * Additional method (this do not exist right now) integrating
# with "PostgreSQL":http://www.moonation.com/3.html full text search would be useful:
# <a href="http://www.sai.msu.su/">http://www.sai.msu.su/</a>~megera/postgres/gist/tsearch/V2/
#
# Limitations:
# * +conditions+ argument is not supported. Supporting its
# array form would require some refactoring of existing code,
# since we need to avoid further sanitizing of conditions we
# create, as they use '%' characters.
#
# Example:
# User.find_by_keywords ["artist" "David"]
def find_by_keywords(keywords, orderings = nil, limit = nil, joins = nil)
# TODO: should do less conservative sanitizing, since we know
# that these will go inside SQL string, so characters like ':'
# or ';' are OK. Also, need to escape \%_ characters (these
# have special meaning to LIKE)
#
# LIKE documentation:
# <a href="http://www.postgresql.org/docs/7.1/interactive/functions-matching.html">http://www.postgresql.org/docs/7.1/interactive/functions-matching.html</a>
# <a href="http://dev.mysql.com/doc/mysql/en/Pattern_matching.html">http://dev.mysql.com/doc/mysql/en/Pattern_matching.html</a>
# <a href="http://dev.mysql.com/doc/mysql/en/Case_sensitivity.html">http://dev.mysql.com/doc/mysql/en/Case_sensitivity.html</a>
keywords.collect! { |keyword| sanitize(keyword) }
# build condition in the form of
# (name LIKE '%artist%' OR description LIKE '%artist%')
# AND (name LIKE '%David%' OR description LIKE '%David%')
condition = keywords.collect { |keyword|
'(' + columns.collect { |column|
"#{column.name} LIKE '%#{keyword}%'"
}.join(' OR ') + ')'
}.join(' AND ')
find_all(condition, orderings, limit, joins)
end
# this goes into finder_test.rb
def test_find_by_text
# simple match
topics = Topic.find_by_text "Mary"
assert_equal(1, topics.size)
assert_equal(@topic_fixtures["second"]["title"], topics.first.title)
# case insensitive match
topics = Topic.find_by_text "mary"
assert_equal(1, topics.size)
assert_equal(@topic_fixtures["second"]["title"], topics.first.title)
# substring
topics = Topic.find_by_text "ary"
assert_equal(1, topics.size)
assert_equal(@topic_fixtures["second"]["title"], topics.first.title)
# simple negative test
topics = Topic.find_by_text "does_not_exist"
assert_equal(0, topics.size)
# negative test across columns
topics = Topic.find_by_text "Mary does_not_exist"
assert_equal(0, topics.size)
# different order in the same column
topics = Topic.find_by_text "day have"
assert_equal(2, topics.size)
# match across 2 columns
topics = Topic.find_by_text "day David have"
assert_equal(@topic_fixtures["first"]["title"], topics.first.title)
assert_equal(1, topics.size)
end
—-
The sanitize line:
keywords.collect! { |keyword| sanitize(keyword) } is putting single quotes around each keyword so the search ends up like …LIKE ‘%’key’%’… and fails. Is there another way to escape the keywords?
—-
I believe that you can use sanitize("%#{term}%") instead, unless sanitize escapes %.
—-
Instead
keywords = text.split(' ')
you can use
keywords = text.scan(/[^"() ]*["(][^")]*[")]|[^"() ]+/)
and you will get Google like tokenizer