中文 | English
The web page shown in the figure below is used for topic research. Users can input keywords or natural language sentences. The backend searches for relevant content based on the given literature, and then uses the ERNIE to generate a research report.
The download link for the generated report:Report download
ERNIEBot Researcher is an Autonomous Agent designed to conduct comprehensive online research for various tasks. It can carefully compile detailed, authentic, and unbiased Chinese research reports, while providing deep customization services for specific resources, structured outlines, and valuable experiences and lessons as needed. Drawing on the essence of the recently notable Plan-and-Solve technology, and combining the advantages of the currently popular RAG technology. ERNIEBot Researcher effectively overcomes challenges such as speed bottlenecks, decision certainty, and reliability of results through multi-agent collaboration and efficient parallel processing mechanisms.
- Forming objective conclusions through manual research tasks can be time-consuming, sometimes taking weeks to find the correct resources and information.
- Current LLMs (Large Language Models) are trained on past and outdated information, which carries a high risk of generating hallucinations, making the produced reports almost irrelevant to the research tasks.
- Reports generated by LLMs generally do not include paragraph-level or sentence-level citations of literature sources, making the generated content difficult to trace and verify.
- On April 4, 2024, ERNIEBot Researcher was released, supporting ERNIEBot and ChatGPT in completing research tasks, as well as supporting OpenAI Embedding and ERNIE-Embedding.
The main idea is to operate "planner" and "execution" agents. The planner generates questions for research, and the execution agents seek the most relevant information for each generated research question. Finally, the planner filters and aggregates all relevant information and creates a research report.
Agents utilize ERNIE-4.0 and ERNIE-LongText to complete research tasks. ERNIE-4.0 is primarily used for decision-making and planning, while ERNIE-LongText is mainly used for writing reports.
- Create domain-specific agents based on research queries or tasks.
- Generate a diverse set of research questions based on the content of the existing knowledge base, which collectively form an objective opinion on any given task.
- For each research question, select information from the knowledge base that is relevant to the given question.
- Filter and aggregate all information sources and generate the final research report.
- Multiple report agents generate reports in parallel while maintaining a certain level of diversity.
- Use chain-of-thought techniques to evaluate and rank multiple reports, overcoming pseudo-randomness, and selecting the optimal report.
- Revise and refine the report using a reflection mechanism.
- Verify facts using retrieval-augmented techniques and chain of verification.
- Enhance the overall readability of the report using a polishing mechanism, integrating more detailed descriptions.
Note
- Generating a report takes more than 10 minutes, and the more research agents are set up, the longer it takes, consuming a large number of tokens.
- The quality of the generated report is related to the quality of the documents input into the application. It is suitable for scenarios such as web pages, journals, and corporate office documents, but not suitable for scenarios with less text and excessive useless information in the documents.
Step 1: Download the project source code
git https://github.com/PaddlePaddle/ERNIE-SDK.git
cd ernie-agent/applications/erniebot_researcher
Step 2: Install dependencies
pip install -r requirements.txt
If the above command fails, please run the following command:
conda create -n researcher39 -y python=3.9 && conda activate researcher39
pip install -r requirements.txt
Instal ernie-agent from source code:
cd ernie-agent
pip install -e .
Step 3: Download Chinese fonts
wget https://paddlenlp.bj.bcebos.com/pipelines/fonts/SimSun.ttf
Step 4: Build the document index
Support for two vector types: azure openai_embedding and ernie_embedding. For ernie-embedding, you need to register and log in to an account on the AI Studio Galaxy Community, then obtain the Access Token from the Access Token page on AI Studio, and finally set the environment variable:
export EB_AGENT_ACCESS_TOKEN=<aistudio-access-token>
export AISTUDIO_ACCESS_TOKEN=<aistudio-access-token>
export EB_AGENT_LOGGING_LEVEL=INFO
To set up Azure OpenAI embedding, you need to configure the relevant OpenAI environment variables.
export AZURE_OPENAI_ENDPOINT="<your azure openai endpoint>"
export AZURE_OPENAI_API_KEY="<your azure openai api key>"
We support file formats such as docx, pdf, and txt. Users can place these files in the same folder and then run the following command to create an index. Subsequent reports will be generated based on these files.
For convenience in testing, we provide sample data. Sample data:
wget https://paddlenlp.bj.bcebos.com/pipelines/erniebot_researcher_example.tar.gz
tar xvf erniebot_researcher_example.tar.gz
URL Data: If users have URLs corresponding to their files, they can provide a txt file containing these URLs. In the txt file, each line should store the URL link and the corresponding file path, for example:
https://zhuanlan.zhihu.com/p/659457816 erniebot_researcher_example/Ai_Agent的起源.md
If the user does not provide a URL file, the default file path will be used as the URL link.
Abstract Data: Users can use the path_abstract parameter to provide the storage path of the abstracts corresponding to their files. The abstracts need to be stored in a JSON file. The JSON file contains multiple dictionaries, and each dictionary has three key-value pairs.
page_content:str, file abstract.url:str, file URL link.name:str, file name.
For example,
[{"page_content":"文件摘要","url":"https://zhuanlan.zhihu.com/p/659457816","name":Ai_Agent的起源},
...]
If the user does not provide an abstract path, there is no need to change the default value of path_abstract. We will use ernie-4.0 to automatically generate the abstracts, and the generated abstracts will be stored in abstract.json.
Next, run:
python ./tools/preprocessing.py \
--index_name_full_text <the index name of your full text> \
--index_name_abstract <the index name of your abstract text> \
--path_full_text <the folder path of your full text> \
--url_path <the path of your url text> \
--path_abstract <the json path of your abstract text>
Step 5: Run
python demo.py --num_research_agent 1 \
--index_name_full_text <your full text> \
--index_name_abstract <your abstract text>
index_name_full_text: Path to the full-text knowledge base indexindex_name_abstract: Path to the abstract knowledge base indexindex_name_citation: Path to the citation indexnum_research_agent: Number of agents generating the reportiterations: Number of reflection iterationschatbot: Type of LLM, currently supports erniebot and chatgptreport_type: Type of report, currently supports research_reportembedding_type: Type of embedding used, currently supports ernie_embedding and openai_embedding (azure)save_path:Path to save the reportserver_name: IP address of the web UIserver_port: Port number of the web UIlog_path: Path to save the logsuse_ui: Whether to use the web UIuse_reflection: Whether to use the reflection processfact_checking:Whether to use the fact-checking processframework: Underlying framework, currently supports langchain
[1] Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, Ee-Peng Lim: Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models. ACL (1) 2023: 2609-2634
[2] Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, Zhaochun Ren: Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents. EMNLP 2023: 14918-14937
We learn form the excellent framework design of Assaf Elovic GPT Researcher, and we would like to express our thanks to the authors of GPT Researcher and their open source community.


