-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy pathinstall.txt
225 lines (177 loc) · 9.65 KB
/
install.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
============================================
Sphider - a lightweight search engine in PHP
Version 1.3.10, PDO version
Installation and usage instructions
Ando Saabas 2005-2007
(PDO port by Thiadmer Riemersma, 2012-2016)
============================================
------------
Documentation
------------
1. Installation
2. Indexing options
3. Customizing
4. Indexing from command-line
5. Keeping pages from being indexed
* Robots.txt
* Must include / must not include string list
* Ignoring links
* Ignoring parts of a page
1. Installation
===============
1. Unpack the files somewhere on your drive. On Microsoft Windows, you can
create a folder "sphider" in your "My Documents" folder. On Linux, make
a subdirectory in your home directory.
Alternatively, you can unpack the files directly on the server, if that is
convenient to you.
2. In the server, create a database to hold Sphider data (in MySQL, SQLite or
another supported format).
If you have direct access to the server (e.g. with a secure shell login),
the steps are:
a) Type: mysql -u <your username> -p
(enter your password when prompted)
b) Type: create database sphider_db;
(of course you can use some other name for database instead of sphider_db)
c) Type: exit
When you use DirectAdmin:
a) Log-on to your DirectAdmin account
b) Click on "MySQL Management"
c) In the next page, click on "Create new Database"
d) Fill in the database name (e.g. "sphider_db"), the user name and the
password. Note that DirectAdmin may prefix the account name to the
database name (and/or the user name).
When using SQLite, make sure that the folder in which the database is
stored, as well as the file itself, have read/write permissions for the web
server pseudo-user.
3. In the "settings" directory, edit the "database.php" file and change the
definition of the database (in variable $db); See the PDO documentation for
the syntax.
For MySQL, constant DATABASE_NAME also needs to be set, and possibly
TABLE_PREFIX as well.
For SQLite, constant AUTOINCREMENT has to be defined as an empty string, for
example:
define("AUTOINCREMENT", "");
The full text of each page is stored in the database. By default, the field
for the full text has the type TEXT. The maximum text size that a field with
this type can hold is database-dependend: for SQLite it is 1G by default,
but for MySQL it is 64k. If you have pages larger than that (in raw text,
after removing all HTML tags), you should change this to MEDIUMTEXT for
example.
4. In the "admin" directory, edit "auth.php" to change the administrator user
name and password (default values are "admin" and "admin"). In this file,
you can also disable the web interface for changing the configuration.
5. Upload everything to your web server (not needed if you had unpacked all
files on the server directly).
6. Open "install.php" script ("admin" directory) in your browser, which will
create the tables necessary for Sphider to operate.
So, when your web server is at "www.mydomain.com" and you uploaded Sphider
in directory "search" at that domain, you need to type the following
address in your browser:
http://www.mydomain.com/search/admin/install.php
The install page just lists a number of names for tables (25 in total) and
it should end with a line that all is fine. If there is an error message, or
if the table names do not appear, the installation failed.
7. Still in the web browser, open admin/admin.php in browser. Again, the full
address would be:
http://www.mydomain.com/search/admin/admin.php
You need to fill in your name and password (the ones in "auth.php"). Note
that these are both case-sensitive.
8. Now you can add your site. The site refers to the first page where the
spidering starts. For technical reasons, this must be a true file; it may
be a PHP file, but not a redirect. So, continuing the example that your
site is at "www.mydomain.com", you could add:
http://www.mydomain.com/index.html
(This assumes that there is a file "index.html" on the site.)
9. After adding the site, click on the button "options" next to the site and
then click on "index". The default spidering depth is 2, which is fine for
a test, but see "Indexing options" (section 2) below.
10. "search.php" is the default search page. You can how test the search engine.
Assuming that you uploaded Sphider in the directory "search" at the root of
the domain, the full address would be:
http://www.mydomain.com/search/search.php
11. If the test succeeds, it is recommended to delete the file install.php from
the "admin" directory, so that nobody can (accidentally) recreate the
database.
2. Indexing options
===================
Full: Indexing continues until there are no further (permitted) links to
follow.
To depth: Indexes to a given depth, where depth means how many "clicks" away
the page can be from the starting page. Depth 0 means that only the
starting page is indexed, depth 1 indexes the starting page and all
the pages linked from it etc.
Reindex: By checking this checkbox, indexing is forced even if the page
already has been indexed.
Spider can leave domain :
By default, Sphider never leaves a given domain, so that links from
domain.com pointing to domain2.com are not followed. By checking
this option Sphider can leave the domain, however in this case its
highly advisable to define proper must include / must not include
string lists to prevent the spider from going too far.
Must include / must not include:
See section 5 of this document ("Keeping pages from being indexed")
for an explanation.
3. Customizing
==============
If you want to change the default behaviour of Sphider, you can do this either
through the admin interface, or by directly editing conf.php in settings
directory. Note that the web interface can be disabled in the auth.php file.
To change the look of the search page to fit your site, modify or add a template
in the templates directory. It is often enough to modify the search.css file and
header and footer templates (header.html and footer.html). Heavier modifications
can be made through editing the rest of template files.
The list of file types that are not checked for indexing are given in
admin/ext.txt. The list of common words that are not indexed are given in
include/common.txt.
4. Using the indexer from commandline
=====================================
It is possible to spider webpages from the command line, using the syntax:
php spider.php <options>
where <options> are
-all Reindex everything in the database
-u <url> Set the url to index
-f Set indexing depth to full (unlimited depth)
-d <num> Set indexing depth to <num>
-l Allow spider to leave the initial domain
-r Set spider to reindex a site
-m <string> Set the string(s) that an url must include (use \n as a
delimiter between multiple strings)
-n <string> Set the string(s) that an url must not include (use \n as a
delimiter between multiple strings)
For example, for spidering and indexing http://www.domain.com/test.html to
depth 2, use
php spider.php -u http://www.domain.com/test.html -d 2
If you want to reindex the same url, use
php spider.php -u http://www.domain.com/test.html -r
5. Keeping pages from being indexed
===================================
* Robots.txt
The most common way to prevent pages from being indexed is using the
robots.txt standard, by either putting a robots.txt file into the root
directory of the server, or adding the necessary meta tags into the page
headers (for more information on how to do this, see here).
* Must include / must not include string list
A powerful option Sphider supports is defining a must include / must not
include string list for a site (click on Advanced options in Index screen
for this). Any url containing a string in the 'must not include' list is
ignored. Any url that does not contain any string in the 'must include'
list is likewise ignored.
All strings in the string list should be separated by a newline (enter). For
example, to prevent a forum in your site from being indexed, you might add
www.yoursite.com/forum to the "must not include" list. This means that all
URLs containing the string will be ignored and wont be indexed. Using Perl
style regular expressions instead of literal strings is also supported.
Every string starting with a '*' in front is considered as a regular
expression, so that '*/[a]+/' denotes a string with one or more a's in it.
* Ignoring links
Sphider respect rel="nofollow" attribute in <a href..> tags, so for example
the link foo.html in <a href="foo.html" rel="nofollow> is ignored.
* Ignoring parts of a page
Sphider includes an option to exclude parts of pages from being indexed.
This can for example be used to prevent search result flooding when certain
keywords appear on certain part in most pages (like a header, footer or a
menu). Any part of a page between
<!--sphider_noindex-->
and
<!--/sphider_noindex-->
tags is not indexed, however links in it are followed.