install.txt

============================================
Sphider - a lightweight search engine in PHP
Version 1.3.10, PDO version

Installation and usage instructions
Ando Saabas 2005-2007
(PDO port by Thiadmer Riemersma, 2012-2016)
============================================

------------
Documentation
------------

   1. Installation
   2. Indexing options
   3. Customizing
   4. Indexing from command-line
   5. Keeping pages from being indexed
      * Robots.txt
      * Must include / must not include string list
      * Ignoring links
      * Ignoring parts of a page


1. Installation
===============
1.  Unpack the files somewhere on your drive. On Microsoft Windows, you can
    create a folder "sphider" in your "My Documents" folder. On Linux, make
    a subdirectory in your home directory.

    Alternatively, you can unpack the files directly on the server, if that is
    convenient to you.

2.  In the server, create a database to hold Sphider data (in MySQL, SQLite or
    another supported format).

    If you have direct access to the server (e.g. with a secure shell login),
    the steps are:
    a) Type: mysql -u <your username> -p
       (enter your password when prompted)
    b) Type: create database sphider_db;
       (of course you can use some other name for database instead of sphider_db)
    c) Type: exit

    When you use DirectAdmin:
    a) Log-on to your DirectAdmin account
    b) Click on "MySQL Management"
    c) In the next page, click on "Create new Database"
    d) Fill in the database name (e.g. "sphider_db"), the user name and the
       password. Note that DirectAdmin may prefix the account name to the
       database name (and/or the user name).

    When using SQLite, make sure that the folder in which the database is
    stored, as well as the file itself, have read/write permissions for the web
    server pseudo-user.

3.  In the "settings" directory, edit the "database.php" file and change the
    definition of the database (in variable $db); See the PDO documentation for
    the syntax.

    For MySQL, constant DATABASE_NAME also needs to be set, and possibly
    TABLE_PREFIX as well.

    For SQLite, constant AUTOINCREMENT has to be defined as an empty string, for
    example:
        define("AUTOINCREMENT", "");

    The full text of each page is stored in the database. By default, the field
    for the full text has the type TEXT. The maximum text size that a field with
    this type can hold is database-dependend: for SQLite it is 1G by default,
    but for MySQL it is 64k. If you have pages larger than that (in raw text,
    after removing all HTML tags), you should change this to MEDIUMTEXT for
    example.
    
4.  In the "admin" directory, edit "auth.php" to change the administrator user
    name and password (default values are "admin" and "admin"). In this file,
    you can also disable the web interface for changing the configuration.

5.  Upload everything to your web server (not needed if you had unpacked all
    files on the server directly).

6.  Open "install.php" script ("admin" directory) in your browser, which will
    create the tables necessary for Sphider to operate.

    So, when your web server is at "www.mydomain.com" and you uploaded Sphider
    in directory "search" at that domain, you need to type the following
    address in your browser:

        http://www.mydomain.com/search/admin/install.php

    The install page just lists a number of names for tables (25 in total) and
    it should end with a line that all is fine. If there is an error message, or
    if the table names do not appear, the installation failed.

7.  Still in the web browser, open admin/admin.php in browser. Again, the full
    address would be:

        http://www.mydomain.com/search/admin/admin.php

    You need to fill in your name and password (the ones in "auth.php"). Note
    that these are both case-sensitive.

8.  Now you can add your site. The site refers to the first page where the
    spidering starts. For technical reasons, this must be a true file; it may
    be a PHP file, but not a redirect. So, continuing the example that your
    site is at "www.mydomain.com", you could add:

        http://www.mydomain.com/index.html

    (This assumes that there is a file "index.html" on the site.)

9.  After adding the site, click on the button "options" next to the site and
    then click on "index". The default spidering depth is 2, which is fine for
    a test, but see "Indexing options" (section 2) below.

10. "search.php" is the default search page. You can how test the search engine.
    Assuming that you uploaded Sphider in the directory "search" at the root of
    the domain, the full address would be:

        http://www.mydomain.com/search/search.php

11. If the test succeeds, it is recommended to delete the file install.php from
    the "admin" directory, so that nobody can (accidentally) recreate the
    database.


2. Indexing options
===================
Full:       Indexing continues until there are no further (permitted) links to
            follow.
To depth:   Indexes to a given depth, where depth means how many "clicks" away
            the page can be from the starting page. Depth 0 means that only the
            starting page is indexed, depth 1 indexes the starting page and all
            the pages linked from it etc.
Reindex:    By checking this checkbox, indexing is forced even if the page
            already has been indexed.
Spider can leave domain :
            By default, Sphider never leaves a given domain, so that links from
            domain.com pointing to domain2.com are not followed. By checking
            this option Sphider can leave the domain, however in this case its
            highly advisable to define proper must include / must not include
            string lists to prevent the spider from going too far.
Must include / must not include:
            See section 5 of this document ("Keeping pages from being indexed")
            for an explanation.


3. Customizing
==============
If you want to change the default behaviour of Sphider, you can do this either
through the admin interface, or by directly editing conf.php in settings
directory. Note that the web interface can be disabled in the auth.php file.

To change the look of the search page to fit your site, modify or add a template
in the templates directory. It is often enough to modify the search.css file and
header and footer templates (header.html and footer.html). Heavier modifications
can be made through editing the rest of template files.

The list of file types that are not checked for indexing are given in
admin/ext.txt. The list of common words that are not indexed are given in
include/common.txt.


4. Using the indexer from commandline
=====================================
It is possible to spider webpages from the command line, using the syntax:

php spider.php <options>

   where <options> are

-all            Reindex everything in the database
-u <url>        Set the url to index
-f              Set indexing depth to full (unlimited depth)
-d <num>        Set indexing depth to <num>
-l              Allow spider to leave the initial domain
-r              Set spider to reindex a site
-m <string>     Set the string(s) that an url must include (use \n as a
                delimiter between multiple strings)
-n <string>     Set the string(s) that an url must not include (use \n as a
                delimiter between multiple strings)

For example, for spidering and indexing http://www.domain.com/test.html to
depth 2, use
    php spider.php -u http://www.domain.com/test.html -d 2

If you want to reindex the same url, use
    php spider.php -u http://www.domain.com/test.html -r


5. Keeping pages from being indexed
===================================
* Robots.txt
    The most common way to prevent pages from being indexed is using the
    robots.txt standard, by either putting a robots.txt file into the root
    directory of the server, or adding the necessary meta tags into the page
    headers (for more information on how to do this, see here).

* Must include / must not include string list
    A powerful option Sphider supports is defining a must include / must not
    include string list for a site (click on Advanced options in Index screen
    for this). Any url containing a string in the 'must not include' list is
    ignored. Any url that does not contain any string in the 'must include'
    list is likewise ignored.
    All strings in the string list should be separated by a newline (enter). For
    example, to prevent a forum in your site from being indexed, you might add
    www.yoursite.com/forum to the "must not include" list. This means that all
    URLs containing the string will be ignored and wont be indexed. Using Perl
    style regular expressions instead of literal strings is also supported.
    Every string starting with a '*' in front is considered as a regular
    expression, so that '*/[a]+/' denotes a string with one or more a's in it.

* Ignoring links
    Sphider respect rel="nofollow" attribute in <a href..> tags, so for example
    the link foo.html in <a href="foo.html" rel="nofollow> is ignored.

* Ignoring parts of a page
    Sphider includes an option to exclude parts of pages from being indexed.
    This can for example be used to prevent search result flooding when certain
    keywords appear on certain part in most pages (like a header, footer or a
    menu). Any part of a page between
        <!--sphider_noindex-->
    and
        <!--/sphider_noindex-->
    tags is not indexed, however links in it are followed.