Skip to content

Latest commit

 

History

History
100 lines (75 loc) · 2.95 KB

docs.md

File metadata and controls

100 lines (75 loc) · 2.95 KB
  • 2024_main_document_lvl

  • 2024_main_paragraph_lvl

  • 2023_main_document_lvl

  • 2023_main_paragraph_lvl

  • Embeddings convert pdfs

    • Para
    • Docs
  • HNSW - Kmeans fast searcddh

  • K means graphs based on the topics

  • Check for similarity between our own db

    • Para
    • Docs
  • Get The most important Ones

  • Get the Unquine sentances like title & other content ?? - LLM think karun karel

  • Search Google using the unquine searches --> get the top 3 and do the same check again --> result

1. Data Input:

  • Input Data: Collect a diverse dataset of academic papers, articles, or textual content from various sources.
  • Format: Ensure the data is in a consistent and machine-readable format, such as plain text or a format compatible with your chosen NLP library.

2. Data Cleaning:

  • Text Cleaning:

    • Remove metadata, formatting, and irrelevant details.
    • Handle special characters, punctuation, and stopwords.
  • Normalization:

    • Convert text to lowercase to ensure uniformity.
  • Tokenization:

    • Tokenize the text into words or subword tokens.
    • Libraries:
      • For Python, you can use NLTK or spaCy for tokenization.

3. Embedding Generation:

  • Word Level Embeddings:

    • Utilize pre-trained word embeddings like Word2Vec or GloVe.
    • Libraries:
      • For Word2Vec: Gensim library.
      • For GloVe: spaCy or gensim.
  • Paragraph Level Embeddings:

    • Aggregate word embeddings using techniques like averaging or using Doc2Vec.
    • Libraries:
      • Gensim for Doc2Vec.
  • Document Level Embeddings:

    • Consider using the average of paragraph embeddings or more advanced models.
    • Libraries:
      • spaCy or transformers library for more advanced models.

4. Pairwise Comparison:

  • Similarity Measures:
    • Calculate cosine similarity, Jaccard similarity, or other relevant measures.
    • Libraries:
      • scikit-learn for cosine similarity.

5. Clustering:

  • K-Means Clustering:

    • Partition documents into K clusters.
    • Libraries:
      • scikit-learn for K-Means.
  • Hierarchical Clustering:

    • Build a hierarchy of clusters.
    • Libraries:
      • scipy.cluster.hierarchy for hierarchical clustering.
  • DBSCAN:

    • Density-based clustering.
    • Libraries:
      • scikit-learn for DBSCAN.

6. Scoring System:

  • Threshold Setting:

    • Establish a threshold for similarity scores to classify documents.
    • Determine the threshold through experimentation.
  • Scoring Logic:

    • Develop a scoring system based on the results of pairwise comparison and clustering.
    • Decide on the scoring weights for each component.

7. Hybrid Approach:

  • Traditional Models:

    • Use traditional similarity measures for efficiency.
    • Implement efficient algorithms for quick pairwise comparisons.
  • Large Language Models:

    • Fine-tune or use pre-trained models for enhanced context understanding.
    • Hugging Face Transformers library for accessing pre-trained models.
  • Fingerprinting Concept