Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a weighted search #122

Open
wants to merge 27 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
be3ee4f
#37 Adds a test that searches Solr using the eDisMax query parser (WIP)
extracts Nov 16, 2023
214c501
#37 Fix "call to undefined method" error
extracts Nov 16, 2023
0f387b9
#37 Keep Solr-specific terminology & syntax out of Opus\Search\Query
extracts Nov 17, 2023
105ca56
#37 Completes search test which verifies that a weighted Solr search …
extracts Nov 17, 2023
b0b22a4
#37 Default values for getters that return a boolean value must be tr…
extracts Nov 22, 2023
b290e62
#37 The weighted search test now shows that searching with boosted fi…
extracts Nov 22, 2023
abe0045
#37 Adopt tests to Opus\Search\Query->getUnion() returning either tru…
extracts Nov 22, 2023
527e42c
#37 The weighted search test now also verifies that swapping field we…
extracts Nov 23, 2023
541bc43
#37 Default to the "search.weightedSearch" & "search.simple" configur…
extracts Nov 23, 2023
c78afc7
#37 Now checks the sort order of weighted search results; moves testi…
extracts Nov 23, 2023
66c06a1
#37 Reuse test documents between weighted search tests
extracts Nov 23, 2023
c622821
#37 Fixes a namespace conflict
extracts Nov 23, 2023
42b9eb0
#37 Verify the sort order of weighted search results via the document…
extracts Nov 24, 2023
f6d7469
#37 Implements explicit getters getWeightedSearch() & getWeightedFiel…
extracts Nov 24, 2023
ad067cb
#37 Removes the weightedsearch key from the initial data array so tha…
extracts Nov 29, 2023
e0593c0
#37 Fix missing return statement in setWeightedSearch() which uses a …
extracts Nov 29, 2023
357b053
#37 Adds a weight multiplier to generate a value for the Solr "pf" re…
extracts Dec 1, 2023
54d1760
#37 Adds a test that compares the search behaviour of the standard & …
extracts Dec 1, 2023
a47e997
#37 Fix coding style
extracts Dec 1, 2023
efbba3b
#37 Replaces redundant boiler plate code with separate helper methods
extracts Dec 1, 2023
e635d6f
#37 More (and more granular) tests that test weighted search behavior
extracts Dec 1, 2023
0d51ed8
#37 Removes the catchall fields "text" & "simple" from the Solr schem…
extracts Dec 3, 2023
4577fcf
#37 When searching Solr, matches with a score of 0 are now ignored by…
extracts Dec 4, 2023
9e165b1
#37 Adopts a test that searches the author field so that it uses a we…
extracts Dec 4, 2023
8a08dff
Merge branch '4.8.1' into weightedSearch37
j3nsch May 17, 2024
6066e6a
Merge pull request #130 from OPUS4/weightedSearch37tmp
j3nsch May 17, 2024
cf8cb36
#131 Added test for advanced search
j3nsch May 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 0 additions & 48 deletions conf/schema.xml
Original file line number Diff line number Diff line change
Expand Up @@ -122,59 +122,11 @@
<field name="fulltext_id_success" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="fulltext_id_failure" type="string" indexed="true" stored="true" multiValued="true"/>

<!--
Catchall field, containing all other searchable text fields (implemented
via copyField further on in this schema)
indexes tokens both normally and in reverse for efficient
leading wildcard queries.
-->
<field name="text" type="text" indexed="true" stored="false" multiValued="true" omitNorms="false"/>

<!-- Catchall field for simple search without fulltext. -->
<field name="simple" type="text" indexed="true" stored="false" multiValued="true" omitNorms="false"/>

<dynamicField name="enrichment_*" multiValued="true" stored="true" indexed="true" type="string" />
</fields>

<uniqueKey>id</uniqueKey>

<!--
Copy all searchable text fields into catchall field for keyword searches across everything.

TODO Werden alle Titel und Zusammenfassungen indiziert oder nur die in der Sprache des Dokuments?
-->
<copyField source="abstract" dest="text"/>
<copyField source="title" dest="text"/>
<copyField source="author" dest="text"/>
<copyField source="subject" dest="text"/>
<copyField source="title_parent" dest="text"/>
<copyField source="title_additional" dest="text"/>
<copyField source="title_sub" dest="text"/>
<copyField source="creating_corporation" dest="text"/>
<copyField source="contributing_corporation" dest="text"/>
<copyField source="publisher_name" dest="text"/>
<copyField source="publisher_place" dest="text"/>
<copyField source="identifier" dest="text"/>
<copyField source="persons" dest="text"/>
<copyField source="enrichment_*" dest="text"/>
<!-- Add fulltext to text field for simple search with fulltext. -->
<copyField source="fulltext" dest="text"/>

<!-- Duplicate content of text in simple field for simple search without fulltext. -->
<copyField source="abstract" dest="simple"/>
<copyField source="title" dest="simple"/>
<copyField source="author" dest="simple"/>
<copyField source="subject" dest="simple"/>
<copyField source="title_parent" dest="simple"/>
<copyField source="title_additional" dest="simple"/>
<copyField source="title_sub" dest="simple"/>
<copyField source="creating_corporation" dest="simple"/>
<copyField source="contributing_corporation" dest="simple"/>
<copyField source="publisher_name" dest="simple"/>
<copyField source="publisher_place" dest="simple"/>
<copyField source="identifier" dest="simple"/>
<copyField source="persons" dest="simple"/>

<!-- TODO why -->
<copyField source="author" dest="author_facet"/>

Expand Down
2 changes: 1 addition & 1 deletion conf/solrconfig.xml
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@
<requestHandler name="standard" class="solr.SearchHandler" default="true">
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="df">text</str>
<str name="df">title</str>
<str name="q.op">AND</str>
</lst>
</requestHandler>
Expand Down
117 changes: 107 additions & 10 deletions src/Query.php
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,8 @@
namespace Opus\Search;

use InvalidArgumentException;
use Opus\Common\Config;
use Opus\Search\Config as SearchConfig;
use Opus\Search\Facet\Set;
use Opus\Search\Filter\AbstractFilterBase;
use RuntimeException;
Expand All @@ -42,6 +44,7 @@
use function array_merge;
use function array_shift;
use function array_unique;
use function boolval;
use function count;
use function ctype_digit;
use function intval;
Expand Down Expand Up @@ -74,7 +77,7 @@
* @method int getRows( int $default = null )
* @method string[] getFields( array $default = null )
* @method array getSort( array $default = null )
* @method bool getUnion( bool $default = null )
* @method bool getUnion( bool $default = false )
* @method AbstractFilterBase getFilter(AbstractFilterBase $default = null ) retrieves condition to be met by resulting documents
* @method Set getFacet( Set $default = null )
* @method $this setStart( int $offset )
Expand All @@ -86,6 +89,8 @@
* @method $this setFacet( Set $facet )
* @method $this addFields( string $fields )
* @method $this addSort( $sorting )
* @method $this setWeightedFields( int[] $weightedFields ) assigns boost factors to fields (e.g. [ 'title' => 10, 'abstract' => 0.5 ])
* @method $this setWeightMultiplier( int $multiplier ) multiplier to further increase boost factors when matching phrases
*/
class Query
{
Expand All @@ -95,14 +100,16 @@ class Query
public function reset()
{
$this->data = [
'start' => null,
'rows' => null,
'fields' => null,
'sort' => null,
'union' => null,
'filter' => null,
'facet' => null,
'subfilters' => null,
'start' => null,
'rows' => null,
'fields' => null,
'sort' => null,
'union' => false,
'filter' => null,
'facet' => null,
'subfilters' => null,
'weightedfields' => null,
'weightmultiplier' => null,
];
}

Expand Down Expand Up @@ -184,6 +191,83 @@ protected function normalizeDirection($ascending)
return $ascending;
}

/**
* Returns true if a weighted search shall be used, otherwise returns false.
*
* @return bool
*/
public function getWeightedSearch()
{
if (! isset($this->data['weightedsearch'])) {
$config = Config::get();

if (isset($config->search->weightedSearch)) {
$this->data['weightedsearch'] = boolval($config->search->weightedSearch);
} else {
$this->data['weightedsearch'] = false;
}
}

return $this->data['weightedsearch'];
}

/**
* Set to true if a weighted search shall be used, otherwise set to false.
*
* @param bool $value
* @return $this fluent interface
*/
public function setWeightedSearch($value)
{
$this->data['weightedsearch'] = ! ! $value;

return $this;
}

/**
* Returns boost factors keyed by field (e.g. [ 'title' => 10, 'abstract' => 0.5 ]).
*
* @return int[]
*/
public function getWeightedFields()
{
if ($this->data['weightedfields'] === null) {
$config = Config::get();

if (isset($config->search->simple)) {
$this->data['weightedfields'] = $config->search->simple->toArray();
} else {
$this->data['weightedfields'] = [];
}
}

return $this->data['weightedfields'];
}

/**
* Returns a positive integer used as a multiplier to further increase field-specific boost factors when
* matching phrases (i.e., in cases where all query terms appear in close proximity).
*
* For example, with a weight multiplier of 5, the weightedfields array [ 'title' => 10, 'abstract' => 0.5 ]
* would be translated to [ 'title' => 50, 'abstract' => 2.5 ] when matching phrases.
*
* @return int
*/
public function getWeightMultiplier()
{
if ($this->data['weightmultiplier'] === null) {
$config = Config::get();

if (isset($config->search->weightMultiplier)) {
$this->data['weightmultiplier'] = $config->search->weightMultiplier;
} else {
$this->data['weightmultiplier'] = 1;
}
}

return $this->data['weightmultiplier'];
}

/**
* Retrieves value of selected query parameter.
*
Expand Down Expand Up @@ -214,6 +298,7 @@ public function set($name, $value, $adding = false)
switch ($name) {
case 'start':
case 'rows':
case 'weightmultiplier':
if ($adding) {
throw new InvalidArgumentException('invalid parameter access on ' . $name);
}
Expand Down Expand Up @@ -300,6 +385,18 @@ public function set($name, $value, $adding = false)

case 'subfilters':
throw new RuntimeException('invalid access on sub filters');

case 'weightedfields':
if ($adding) {
throw new InvalidArgumentException('invalid parameter access on ' . $name);
}

if (! is_array($value)) {
throw new InvalidArgumentException('invalid query fields option');
}

$this->data[$name] = $value;
break;
}

return $this;
Expand Down Expand Up @@ -469,7 +566,7 @@ public function getSubFilters()
*/
public static function getParameterDefault($name, $fallbackIfMissing, $oldName = null)
{
$config = Config::getDomainConfiguration();
$config = SearchConfig::getDomainConfiguration();
$defaults = $config->parameterDefaults;

if ($defaults instanceof Zend_Config) {
Expand Down
25 changes: 18 additions & 7 deletions src/Result/Base.php
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,10 @@
use Opus\Search\Log;
use RuntimeException;

use function array_filter;
use function array_key_exists;
use function array_map;
use function array_values;
use function count;
use function ctype_digit;
use function intval;
Expand Down Expand Up @@ -192,9 +194,11 @@ public function getFacet($fieldName)
* Retrieves set of matching and locally existing documents returned in
* response to some search query.
*
* @param bool $ignoreZeroScoreMatches ignore any matches with score 0.0
* (true) or not (false); defaults to true
* @return ResultMatch[]
*/
public function getReturnedMatches()
public function getReturnedMatches($ignoreZeroScoreMatches = true)
{
if ($this->data['matches'] === null) {
return [];
Expand All @@ -208,7 +212,10 @@ public function getReturnedMatches()
foreach ($this->data['matches'] as $match) {
try {
$match->getDocument();
$matches[] = $match;
$ignoreMatch = $ignoreZeroScoreMatches === true && $match->getScore() === 0.0;
if ($ignoreMatch !== true) {
$matches[] = $match;
}
} catch (DocumentException $e) {
Log::get()->warn('skipping matching but locally missing document #' . $match->getId());
}
Expand All @@ -223,18 +230,22 @@ public function getReturnedMatches()
*
* @note If query was requesting to retrieve non-qualified matches this set
* might include IDs of documents that doesn't exist locally anymore.
* @param bool $ignoreZeroScoreMatches ignore any matches with score 0.0
* (true) or not (false); defaults to true
* @return int[]
*/
public function getReturnedMatchingIds()
public function getReturnedMatchingIds($ignoreZeroScoreMatches = true)
{
if ($this->data['matches'] === null) {
return [];
}

return array_map(function ($match) {
/** @var ResultMatch $match */
return $match->getId();
$matchingIds = array_map(function (ResultMatch $match) use ($ignoreZeroScoreMatches) {
$ignoreMatch = $ignoreZeroScoreMatches === true && $match->getScore() === 0.0;
return $ignoreMatch !== true ? $match->getId() : null;
}, $this->data['matches']);

return array_values(array_filter($matchingIds));
}

/**
Expand All @@ -247,7 +258,7 @@ public function getReturnedMatchingIds()
* has changed in that it's returning set of Opus_Document instances
* rather than set of Opus_Search_Util_Result instances.
* @note The wording is less specific in that all information in response to
* search query may considered results of search. Thus this new API
* search query may be considered results of search. Thus this new API
* prefers "matches" over "results".
*/
public function getResults()
Expand Down
54 changes: 54 additions & 0 deletions src/Solr/Solarium/Adapter.php
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@
use function file_exists;
use function filesize;
use function filter_var;
use function implode;
use function in_array;
use function intval;
use function is_array;
Expand Down Expand Up @@ -615,6 +616,25 @@ protected function applyParametersOnQuery(
$query->setSorts($sortings);
}

$isWeightedSearch = $parameters->getWeightedSearch();
if ($isWeightedSearch === true) {
// get the edismax component
$edismax = $query->getEDisMax();

// NOTE: query is now an edismax query
$weightedFields = $parameters->getWeightedFields();
if (! empty($weightedFields)) {
$queryFields = $this->getQueryFieldsString($weightedFields);
$edismax->setQueryFields($queryFields);
extracts marked this conversation as resolved.
Show resolved Hide resolved

$weightMultiplier = $parameters->getWeightMultiplier();
if ($weightMultiplier !== null) {
$phraseFields = $this->getPhraseFieldsString($weightedFields, $weightMultiplier);
$edismax->setPhraseFields($phraseFields);
}
}
}

$facet = $parameters->getFacet();
if ($facet !== null) {
$facetSet = $query->getFacetSet();
Expand Down Expand Up @@ -880,4 +900,38 @@ public function setTimeout($timeout)
$this->client->setOptions($options, true);
}
}

/**
* Converts an array containing boost factors keyed by field into a query fields string that can be used
* as input for the Solr `qf` request parameter.
*
* @param int[] $weightedFields assigns boost factors to fields, e.g.: [ 'title' => 10, 'abstract' => 0.5 ]
* @return string query fields string, e.g.: "title^10 abstract^0.5"
*/
protected function getQueryFieldsString($weightedFields)
{
$queryFields = [];
foreach ($weightedFields as $field => $boostFactor) {
$queryFields[] = "$field^$boostFactor";
}

return implode(' ', $queryFields);
}

/**
* Generates a phrase fields string that can be used as input for the Solr `pf` request parameter.
*
* @param int[] $weightedFields assigns boost factors to fields, e.g.: [ 'title' => 10, 'abstract' => 0.5 ]
* @param int $weightMultiplier factor by which each boost factor will be multiplied when matching phrases, e.g.: 5
* @return string phrase fields string, e.g.: "title^50 abstract^2.5"
*/
protected function getPhraseFieldsString($weightedFields, $weightMultiplier)
{
$phraseFields = [];
foreach ($weightedFields as $field => $boostFactor) {
$phraseFields[] = "$field^" . $boostFactor * $weightMultiplier;
}

return implode(' ', $phraseFields);
}
}
Loading
Loading