This repository contain source code for Sinhala song search engine created using Python and Elasticsearch
The important files and directories of the repository is shown below
├── song-corpus : data scraped from the [website](http://sinhalasongbook.com/)
├── songs.json : contain the final song set
├── songs_01.json to songs_010.json : original songs scraped form the website
├── songs_meta_all.json : contain all meta date related to the songs
└── song_link.csv : contain links to the songs url
├── templates : UI related files
├── app.py : Backend of the web app created using Flask
├── data_upload.py : File to upload data to elasticsearch cluster
├── scraper.py : Source code for the data scraper
├── search.py : Search functions used to classify user search phrases and elasticsearch queries
├── queries.txt : Example queries
You can install elasticsearch locally or otherwise and spin up the elasticsearch cluster For more details visit website
Once elasticsearch is install, start elasticsearch cluster on port 9200
git clone https://github.com/iTharindu/SinhalaSongSearch.git
cd Sinhala Songs Search
virtualenv -p python3 envname
source env/bin/activate
pip3 install -r requirements.txt
python app.py
Follow the above steps but instead of python app.py
run scraper.py
Each song contain subset of following data fields
- Title (both in Sinhala and English letters)
- Artist - English
- Artist - Sinhala
- Composer - English
- Composer - Sinhala
- Lyricist - English
- Lyricist - Sinhala
- Genre - English
- Genre - Sinhala
- Number of views
- Guitar keys
- Name of the movie (if the song is based on a movie)
- Lyrics
The process with scraping data from the site, the HTML/XML parsing library BeautifulSoup was used for scraping the web pages. Then the text extracted was passed through the text processing unit. Both simple techniques like replacing complex methods like regex are used for this process. This text processing unit generates cleaned text data which is then passed to the translator to translate to Sinhala, here both translation and transliteration takes place. Then the translated data is sent for post processing and the final data set with an aggregated dataset containing information about fields is generated
For indexing the data and querying the Elasticsearch is used and I have used the standard indexing methods,mapping and the analyzer provided in the Elasticsearch. When a user enters a query the related intent is identified and the search query is related to the intent is executed. The searching can be done related to any field in the index and predefined size is used which the user can override using his search query. Also filtered queries are provided where users can filter the search result.
- Text mining and text preprocessing
- Search queries are processed before intent classification, here spelling errors are corrected and the query is cleaned. Also data extracted is also cleaned and processed before displaying on the web application.
- Intent Classification
- Once the query is added, intent behind the query is found by intent classification. The intent could be simple text search or a select top type, etc. The intent classifier used word tokenization and text vectorization and cosine distance to classify intentens
- Faceted Search
- The search engine supported faceted search related to Genre, Artist, Composer and Lyricist.
- Bilingual support
- The search engine supports queries in both Sinhala and English. User can type queries like top 10 songs or හොඳම ගීත 10 and expect to yield the same result.
- Synonyms support
- The search engine also support synonyms and that can be either in Sinhala or in English, user can type best, popular, good instead of top or හොඳ, ජනප්රිය, ප්රසිද්ධ instead of හොඳම.
- Resistant to simple spelling errors
- Due to the use of vectorization and distance calculation the search engine is resistant to small spelling errors and these are automatically corrected and related search results are generated.