{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "source": [ "# **RAG UsingWeaviate**\n", "\n", "## **Roman Empire Question Answering System**\n", "### **Overview**\n", "This project is a question answering system focused on providing answers related to the Roman Empire. It utilizes various technologies such as Weaviate for document storage and similarity search, OpenAI for language models and embeddings, and Langchain for creating the question answering pipeline.\n", "![warkflow.jpg]()\n", "\n", "**Setup Instructions:**\n", "- Install All the required libraries and dependencies.\n", "- Define your Weaviate and OpenAI credentials.\n", "- load the documents using the `load_documents` function.\n", "- spltit the documents into small chunks using the `text_splitter` function.\n", "- load the embeddings using the `load_embeddings` function.\n", "- Select the embedding model to create the embeddings.\n", "- create a vector store using weaviate.\n", "- store the data in weaviate vectorstore.\n", "- Finally, create a question answering pipeline to interact with the data." ], "metadata": { "id": "mZLVPpSPDMD8" } }, { "cell_type": "markdown", "source": [ "### Creating Vectorstore" ], "metadata": { "id": "xvm6BXxUiK2F" } }, { "cell_type": "code", "source": [ "# install dependencies\n", "!pip install weaviate-client\n", "!pip install langchain\n", "!pip install openai\n", "!pip install pypdf\n", "!pip install -U langchain-community\n", "!pip install sentence_transformers\n", "!pip install unstructured\n", "!pip install \"unstructured[pdf]\"" ], "metadata": { "id": "cg6XDu3VYNW3", "colab": { "base_uri": "https://localhost:8080/" }, "collapsed": true, "outputId": "3f6b845a-adf2-4f7b-b4ae-432daa1e3ed4" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Requirement already satisfied: weaviate-client in /usr/local/lib/python3.10/dist-packages (4.6.4)\n", "Requirement already satisfied: requests<3.0.0,>=2.30.0 in /usr/local/lib/python3.10/dist-packages (from weaviate-client) (2.31.0)\n", "Requirement already satisfied: httpx<=0.27.0,>=0.25.0 in /usr/local/lib/python3.10/dist-packages (from weaviate-client) (0.27.0)\n", "Requirement already satisfied: validators==0.28.3 in /usr/local/lib/python3.10/dist-packages (from weaviate-client) (0.28.3)\n", "Requirement already satisfied: authlib<2.0.0,>=1.2.1 in /usr/local/lib/python3.10/dist-packages (from weaviate-client) (1.3.1)\n", "Requirement already satisfied: pydantic<3.0.0,>=2.5.0 in /usr/local/lib/python3.10/dist-packages (from weaviate-client) (2.7.3)\n", "Requirement already satisfied: grpcio<2.0.0,>=1.57.0 in /usr/local/lib/python3.10/dist-packages (from weaviate-client) (1.64.1)\n", "Requirement already satisfied: grpcio-tools<2.0.0,>=1.57.0 in /usr/local/lib/python3.10/dist-packages (from weaviate-client) (1.64.1)\n", "Requirement already satisfied: grpcio-health-checking<2.0.0,>=1.57.0 in /usr/local/lib/python3.10/dist-packages (from weaviate-client) (1.64.1)\n", "Requirement already satisfied: cryptography in /usr/local/lib/python3.10/dist-packages (from authlib<2.0.0,>=1.2.1->weaviate-client) (42.0.7)\n", "Requirement already satisfied: protobuf<6.0dev,>=5.26.1 in /usr/local/lib/python3.10/dist-packages (from grpcio-health-checking<2.0.0,>=1.57.0->weaviate-client) (5.27.1)\n", "Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from grpcio-tools<2.0.0,>=1.57.0->weaviate-client) (67.7.2)\n", "Requirement already satisfied: anyio in /usr/local/lib/python3.10/dist-packages (from httpx<=0.27.0,>=0.25.0->weaviate-client) (3.7.1)\n", "Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from httpx<=0.27.0,>=0.25.0->weaviate-client) (2024.6.2)\n", "Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.10/dist-packages (from httpx<=0.27.0,>=0.25.0->weaviate-client) (1.0.5)\n", "Requirement already satisfied: idna in /usr/local/lib/python3.10/dist-packages (from httpx<=0.27.0,>=0.25.0->weaviate-client) (3.7)\n", "Requirement already satisfied: sniffio in /usr/local/lib/python3.10/dist-packages (from httpx<=0.27.0,>=0.25.0->weaviate-client) (1.3.1)\n", "Requirement already satisfied: h11<0.15,>=0.13 in /usr/local/lib/python3.10/dist-packages (from httpcore==1.*->httpx<=0.27.0,>=0.25.0->weaviate-client) (0.14.0)\n", "Requirement already satisfied: annotated-types>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from pydantic<3.0.0,>=2.5.0->weaviate-client) (0.7.0)\n", "Requirement already satisfied: pydantic-core==2.18.4 in /usr/local/lib/python3.10/dist-packages (from pydantic<3.0.0,>=2.5.0->weaviate-client) (2.18.4)\n", "Requirement already satisfied: typing-extensions>=4.6.1 in /usr/local/lib/python3.10/dist-packages (from pydantic<3.0.0,>=2.5.0->weaviate-client) (4.12.1)\n", "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.30.0->weaviate-client) (3.3.2)\n", "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.30.0->weaviate-client) (2.0.7)\n", "Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio->httpx<=0.27.0,>=0.25.0->weaviate-client) (1.2.1)\n", "Requirement already satisfied: cffi>=1.12 in /usr/local/lib/python3.10/dist-packages (from cryptography->authlib<2.0.0,>=1.2.1->weaviate-client) (1.16.0)\n", "Requirement already satisfied: pycparser in /usr/local/lib/python3.10/dist-packages (from cffi>=1.12->cryptography->authlib<2.0.0,>=1.2.1->weaviate-client) (2.22)\n", "Requirement already satisfied: langchain in /usr/local/lib/python3.10/dist-packages (0.2.3)\n", "Requirement already satisfied: PyYAML>=5.3 in /usr/local/lib/python3.10/dist-packages (from langchain) (6.0.1)\n", "Requirement already satisfied: SQLAlchemy<3,>=1.4 in /usr/local/lib/python3.10/dist-packages (from langchain) (2.0.30)\n", "Requirement already satisfied: aiohttp<4.0.0,>=3.8.3 in /usr/local/lib/python3.10/dist-packages (from langchain) (3.9.5)\n", "Requirement already satisfied: async-timeout<5.0.0,>=4.0.0 in /usr/local/lib/python3.10/dist-packages (from langchain) (4.0.3)\n", "Requirement already satisfied: langchain-core<0.3.0,>=0.2.0 in /usr/local/lib/python3.10/dist-packages (from langchain) (0.2.5)\n", "Requirement already satisfied: langchain-text-splitters<0.3.0,>=0.2.0 in /usr/local/lib/python3.10/dist-packages (from langchain) (0.2.1)\n", "Requirement already satisfied: langsmith<0.2.0,>=0.1.17 in /usr/local/lib/python3.10/dist-packages (from langchain) (0.1.75)\n", "Requirement already satisfied: numpy<2,>=1 in /usr/local/lib/python3.10/dist-packages (from langchain) (1.25.2)\n", "Requirement already satisfied: pydantic<3,>=1 in /usr/local/lib/python3.10/dist-packages (from langchain) (2.7.3)\n", "Requirement already satisfied: requests<3,>=2 in /usr/local/lib/python3.10/dist-packages (from langchain) (2.31.0)\n", "Requirement already satisfied: tenacity<9.0.0,>=8.1.0 in /usr/local/lib/python3.10/dist-packages (from langchain) (8.3.0)\n", "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (1.3.1)\n", "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (23.2.0)\n", "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (1.4.1)\n", "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (6.0.5)\n", "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (1.9.4)\n", "Requirement already satisfied: jsonpatch<2.0,>=1.33 in /usr/local/lib/python3.10/dist-packages (from langchain-core<0.3.0,>=0.2.0->langchain) (1.33)\n", "Requirement already satisfied: packaging<24.0,>=23.2 in /usr/local/lib/python3.10/dist-packages (from langchain-core<0.3.0,>=0.2.0->langchain) (23.2)\n", "Requirement already satisfied: orjson<4.0.0,>=3.9.14 in /usr/local/lib/python3.10/dist-packages (from langsmith<0.2.0,>=0.1.17->langchain) (3.10.3)\n", "Requirement already satisfied: annotated-types>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from pydantic<3,>=1->langchain) (0.7.0)\n", "Requirement already satisfied: pydantic-core==2.18.4 in /usr/local/lib/python3.10/dist-packages (from pydantic<3,>=1->langchain) (2.18.4)\n", "Requirement already satisfied: typing-extensions>=4.6.1 in /usr/local/lib/python3.10/dist-packages (from pydantic<3,>=1->langchain) (4.12.1)\n", "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2->langchain) (3.3.2)\n", "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2->langchain) (3.7)\n", "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2->langchain) (2.0.7)\n", "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2->langchain) (2024.6.2)\n", "Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.10/dist-packages (from SQLAlchemy<3,>=1.4->langchain) (3.0.3)\n", "Requirement already satisfied: jsonpointer>=1.9 in /usr/local/lib/python3.10/dist-packages (from jsonpatch<2.0,>=1.33->langchain-core<0.3.0,>=0.2.0->langchain) (2.4)\n", "Requirement already satisfied: openai in /usr/local/lib/python3.10/dist-packages (1.33.0)\n", "Requirement already satisfied: anyio<5,>=3.5.0 in /usr/local/lib/python3.10/dist-packages (from openai) (3.7.1)\n", "Requirement already satisfied: distro<2,>=1.7.0 in /usr/lib/python3/dist-packages (from openai) (1.7.0)\n", "Requirement already satisfied: httpx<1,>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from openai) (0.27.0)\n", "Requirement already satisfied: pydantic<3,>=1.9.0 in /usr/local/lib/python3.10/dist-packages (from openai) (2.7.3)\n", "Requirement already satisfied: sniffio in /usr/local/lib/python3.10/dist-packages (from openai) (1.3.1)\n", "Requirement already satisfied: tqdm>4 in /usr/local/lib/python3.10/dist-packages (from openai) (4.66.4)\n", "Requirement already satisfied: typing-extensions<5,>=4.7 in /usr/local/lib/python3.10/dist-packages (from openai) (4.12.1)\n", "Requirement already satisfied: idna>=2.8 in /usr/local/lib/python3.10/dist-packages (from anyio<5,>=3.5.0->openai) (3.7)\n", "Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio<5,>=3.5.0->openai) (1.2.1)\n", "Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from httpx<1,>=0.23.0->openai) (2024.6.2)\n", "Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.10/dist-packages (from httpx<1,>=0.23.0->openai) (1.0.5)\n", "Requirement already satisfied: h11<0.15,>=0.13 in /usr/local/lib/python3.10/dist-packages (from httpcore==1.*->httpx<1,>=0.23.0->openai) (0.14.0)\n", "Requirement already satisfied: annotated-types>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from pydantic<3,>=1.9.0->openai) (0.7.0)\n", "Requirement already satisfied: pydantic-core==2.18.4 in /usr/local/lib/python3.10/dist-packages (from pydantic<3,>=1.9.0->openai) (2.18.4)\n", "Requirement already satisfied: pypdf in /usr/local/lib/python3.10/dist-packages (4.2.0)\n", "Requirement already satisfied: typing_extensions>=4.0 in /usr/local/lib/python3.10/dist-packages (from pypdf) (4.12.1)\n" ] } ] }, { "cell_type": "code", "source": [ "# define environment variables\n", "OPENAI_API_KEY = \"Your_OpenAI_Key\"\n", "WEAVIATE_API_KEY = \"Your_Weaviate_Key\"\n", "WEAVIATE_CLUSTER = \"Your_Weaviate_Cluster\"" ], "metadata": { "id": "L7_7b6QfG-D9" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# load pdf\n", "from langchain.document_loaders import DirectoryLoader\n", "loader = DirectoryLoader(\".\",glob = \"**/*.pdf\")\n", "documents = loader.load()\n", "\n", "# OR Following method\n", "\n", "# from langchain.document_loaders import PyPDFLoader\n", "# loader = PyPDFLoader(\"/content/RomanEmpire.pdf\")\n", "# documents = loader.load()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "l_vcYSPUKFxr", "outputId": "0ae27db9-99f7-4449-abd8-d656dec7f9ff" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "[nltk_data] Downloading package punkt to /root/nltk_data...\n", "[nltk_data] Unzipping tokenizers/punkt.zip.\n", "[nltk_data] Downloading package averaged_perceptron_tagger to\n", "[nltk_data] /root/nltk_data...\n", "[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.\n" ] } ] }, { "cell_type": "code", "source": [ "len(documents)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "86KbBQvUPAJ2", "outputId": "693ee80e-7f8f-4bcd-c34e-0c294ef01ddd" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "1" ] }, "metadata": {}, "execution_count": 3 } ] }, { "cell_type": "code", "source": [ "# split document content\n", "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", "text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)\n", "text = text_splitter.split_documents(documents)" ], "metadata": { "id": "lIJTH0BwKWy5" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "len(text)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "qJAgT58UO-M_", "outputId": "813ddcc7-6e59-4994-b23f-e08db22dc4a6" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "145" ] }, "metadata": {}, "execution_count": 5 } ] }, { "cell_type": "markdown", "source": [ "Always try to use atop embedding model from hugging face embedding leaderboard: https://huggingface.co/spaces/mteb/leaderboard" ], "metadata": { "id": "2itMqkMTOeFX" } }, { "cell_type": "code", "source": [ "# load embedding model\n", "from langchain.embeddings.openai import OpenAIEmbeddings\n", "embeddings = OpenAIEmbeddings(openai_api_key= OPENAI_API_KEY)\n", "\n", "# OR Following method\n", "\n", "# from langchain.embeddings import HuggingFaceEmbeddings\n", "# embeddings = HuggingFaceEmbeddings(model_name=\"BAAI/bge-small-en-v1.5\", encode_kwargs = {\"normalize_embeddings\": True})" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "collapsed": true, "id": "CcQZcRVoMZQp", "outputId": "7489577a-4dde-4d29-82de-8d0ba12fc4f8" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "/usr/local/lib/python3.10/dist-packages/langchain_core/_api/deprecation.py:119: LangChainDeprecationWarning: The class `OpenAIEmbeddings` was deprecated in LangChain 0.0.9 and will be removed in 0.3.0. An updated version of the class exists in the langchain-openai package and should be used instead. To use it run `pip install -U langchain-openai` and import as `from langchain_openai import OpenAIEmbeddings`.\n", " warn_deprecated(\n" ] } ] }, { "cell_type": "code", "source": [ "# # check embeddings\n", "# check = embeddings.embed_query(\"Trial sentence for check embeddings\")\n", "# check[0:10]" ], "metadata": { "id": "3sX5TbXjTrEw" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "embeddings" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "collapsed": true, "id": "G_IiyVMAPn-j", "outputId": "86ed90e5-bc8c-49f7-b46a-d18194adbe12" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "HuggingFaceEmbeddings(client=SentenceTransformer(\n", " (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel \n", " (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})\n", " (2): Normalize()\n", "), model_name='BAAI/bge-small-en-v1.5', cache_folder=None, model_kwargs={}, encode_kwargs={'normalize_embeddings': True}, multi_process=False, show_progress=False)" ] }, "metadata": {}, "execution_count": 26 } ] }, { "cell_type": "code", "source": [ "!pip install -U langchain-weaviate" ], "metadata": { "id": "bHJgKgMHTWoU" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# create vectorstore using weaviate\n", "import weaviate\n", "from langchain.vectorstores import Weaviate\n", "\n", "#Connect to weaviate Cluster\n", "auth_config = weaviate.auth.AuthApiKey(api_key = WEAVIATE_API_KEY)\n", "WEAVIATE_URL = WEAVIATE_CLUSTER\n", "\n", "client = weaviate.Client(\n", " url = WEAVIATE_URL,\n", " additional_headers = {\"X-OpenAI-Api-key\": OPENAI_API_KEY},\n", " auth_client_secret = auth_config,\n", " startup_period = 10\n", ")" ], "metadata": { "id": "IxQOVNUgPwXu" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# define input structure\n", "client.schema.delete_all()\n", "client.schema.get()\n", "schema = {\n", " \"classes\": [\n", " {\n", " \"class\": \"RomanEmpire\",\n", " \"description\": \"Documents for Roman Empire\",\n", " \"vectorizer\": \"text2vec-openai\",\n", " \"moduleConfig\": {\"text2vec-openai\": {\"model\": \"ada\", \"type\": \"text\"}},\n", " \"properties\": [\n", " {\n", " \"dataType\": [\"text\"],\n", " \"description\": \"The content of the paragraph\",\n", " \"moduleConfig\": {\n", " \"text2vec-openai\": {\n", " \"skip\": False,\n", " \"vectorizePropertyName\": False,\n", " }\n", " },\n", " \"name\": \"content\",\n", " },\n", " ],\n", " },\n", " ]\n", "}\n", "\n", "client.schema.create(schema)\n", "vectorstore = Weaviate(client, \"RomanEmpire\", \"content\", attributes=[\"source\"])" ], "metadata": { "id": "BwFyjfSEhyDR" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# load text into the vectorstore\n", "text_meta_pair = [(doc.page_content, doc.metadata) for doc in text]\n", "texts, meta = list(zip(*text_meta_pair))\n", "vectorstore.add_texts(texts, meta)\n" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "collapsed": true, "id": "EEZxLBYhimeX", "outputId": "2c7df17c-1c3f-4236-ded2-8869cd245af8" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "['d8addbe6-87c2-450f-8ab5-cf4bce43f91d',\n", " 'b778f89f-1d63-4dde-b81d-bbecbf50f938',\n", " '5d1054db-3444-49da-b0ed-a1152013c1b7',\n", " 'e288b636-b2d1-42fb-9893-42aef886657c',\n", " '3750701e-58bb-44a0-aa3f-fd7217847e28',\n", " '6477e409-084c-4808-8144-1df91fa9ce65',\n", " '83dd61dd-151c-428b-b61c-8d20b1548014',\n", " 'db7f2c51-d553-497a-b6fc-ab6fe8339f24',\n", " '52cf52e8-7563-461d-991a-91b6c7c5909b',\n", " '08c82c58-eedc-4d78-b5d9-74bce2078cf2',\n", " 'b48981ca-70e3-4a17-804b-eebc77383080',\n", " 'ca8f54e3-5d0c-431d-b491-8a65a7cb959d',\n", " 'bdeb944e-a35e-4f73-8932-78b8dac1e081',\n", " '9a9d1826-3433-4704-a0a9-25b067a1da3c',\n", " 'e147cb38-c299-4625-ad6c-fec0ecc48aa0',\n", " '435c05e4-330f-4e2f-905b-db1798b90d1f',\n", " '4b033538-91f0-4034-9b6b-0b148268b6c0',\n", " 'a7bf1677-6026-401e-8e12-1bf2cce201b7',\n", " '5af18673-0087-4640-b7c6-8a633ff9cae3',\n", " '7a2bfc14-cccc-4a5b-bb00-2d2f27a36096',\n", " '35825bef-0d46-40a5-90c9-ba711d2c787c',\n", " 'de9db13f-f709-424b-945d-b1b556bcdf20',\n", " 'e86092c5-5c53-481f-a9f6-b33e489a68f6',\n", " 'e2416c74-7fac-4730-afdb-34fb6222ae2f',\n", " '8e43e712-63a9-4dd4-9dab-7158be4d38e3',\n", " '0ff817f6-2b92-483e-82fb-585fc7d10279',\n", " '7c1f99be-d49b-49bb-bed5-466a26c48a46',\n", " '40c0637c-8da6-4769-b2f2-093960e55d34',\n", " 'b62512fa-12d4-4773-aad2-de608ab5b424',\n", " 'ef9ca77e-a9d2-4f49-827d-8acdf8ba0a77',\n", " '5138fa0c-eaaa-4c01-a6df-310e9e36001a',\n", " 'e8e28b86-0f02-4fe0-bbff-32474c82639c',\n", " 'cebf98e2-486f-419c-966a-f94b30a6ec96',\n", " '034601eb-5015-4373-8bb7-6c211473b7dc',\n", " '597edbb0-a6aa-4b5c-86c4-aa6a2b158882',\n", " '5fffd143-0fea-432d-8c3d-23c812244caf',\n", " '92b057ce-1b50-49ab-96f5-5cb0e35b4f21',\n", " '13f82083-9e49-4c82-9782-7f91287966ca',\n", " 'bc018292-189c-40f3-b0b5-5df81343b420',\n", " '2f6a6f8a-c8b2-4ec8-8091-c267fc71f359',\n", " '59b7044c-9656-4770-af85-6c4394b4432d',\n", " '363e25a2-0ed3-4e13-bcb0-0b72b6656397',\n", " 'e788f941-4c03-4b4d-9726-70960a6de4af',\n", " '29ec59b9-140d-4b25-b95f-b0294a65452d',\n", " 'bcbf06e0-66ca-401e-b189-0dccc4917780',\n", " 'f5b3d362-c028-455c-a680-d62bc30914da',\n", " 'e5dcacb7-e873-42ca-857c-11ad2da2302f',\n", " 'b783e65c-d77b-4172-936b-c3aa462494cf',\n", " '10c8ca8f-09e9-4201-bb6a-2a6bb6b8b695',\n", " 'dd6cd501-66b0-4745-bbf0-1b0ee47dc73b',\n", " 'e3c17dd5-6af2-4350-8ef3-c36e262100ab',\n", " '0d745d0d-b126-4e77-a20a-2f7ce0738f50',\n", " '090db073-e52b-4d14-83a3-2264516e7422',\n", " '6d8ca3ee-34b6-4498-8cbb-2f8e834f62db',\n", " '2fdaf0e5-6c64-4ffa-b40c-4b45f014e8fd',\n", " '29ecec3b-04c1-44c3-9c2a-ee3e2f686ad7',\n", " '6bf40ee5-9580-493b-a71c-b0a9dde2ed70',\n", " '182fefb9-dcf7-4fe6-9a85-24fbf2b4beb5',\n", " '2f1725a1-bb97-437f-907c-fa3051dbe669',\n", " '43293184-ac35-4c8c-b578-ec1e77185a1e',\n", " '16fb0541-502d-4a07-b89e-2e816848cfb1',\n", " 'b84f6235-5b87-4707-b3d7-bb3d1b48f669',\n", " '43ccd433-d846-4472-b2e8-613cac141eba',\n", " '03e9c5a0-ee94-47f8-b915-bd42fb69096b',\n", " '15c4265b-a5cf-4ba3-ba4f-ba6e134f3c99',\n", " '60ec49ae-12c1-4602-855b-7825b4559840',\n", " '57eed7e8-af4c-42da-83ef-ee4add39ef5e',\n", " '34a58e08-e259-4849-8d41-1fc572b48d1e',\n", " '0689ddfe-7bd8-407c-85b8-4a0b58f2ac49',\n", " '71bd60e6-c01a-4796-9463-e6a140792935',\n", " 'a1e8879a-4bfa-4375-956d-55cfee6a0961',\n", " 'e0635016-2dfa-49e7-8013-048ac8f460ea',\n", " 'd02a48ef-b242-4e43-b6a3-803d857e63e4',\n", " '9c1df575-9cc4-43fc-8de7-d0c1d98e499c',\n", " '60e8babb-b004-4963-bd0a-f362c3f28430',\n", " '0f1a75e8-c6d8-4210-98a1-ae5fd0010902',\n", " 'b32e91e5-eb65-40fb-88d1-05b90324a952',\n", " '8802de47-8d60-4288-9679-2243cc0d59c2',\n", " '9230cfba-3d4d-45b6-bb12-6c299b6e3784',\n", " 'dddf10ac-f34b-44ff-9827-b979a7b488cc',\n", " '0f4ec260-e808-49c9-ada0-f89e9ddafe0e',\n", " 'dffc1a39-b955-41d2-ae94-1e98cb5c7e0a',\n", " 'f674beea-0374-4e74-ba2b-984a52f19262',\n", " '0981f5b8-d33c-471e-9143-2ced895d2b1f',\n", " 'fbecee1f-ec14-41c0-9f8b-28d88385eb11',\n", " 'a06562f7-5afe-4e43-aede-d0138dec2da3',\n", " 'bcbd29c4-bad2-47b8-9b5b-08223a923c07',\n", " 'd66ba276-8286-4ee2-9cec-7908a63e08bd',\n", " '7ac1122c-310a-4cdc-9f84-44205b53a123',\n", " 'b8a5652e-d1ff-4b31-a378-fdd8b093457f',\n", " '9b45118a-f827-47da-b1f6-28aeabb51de8',\n", " 'e4268973-95c3-45b8-9696-58ad20b7eb0d',\n", " 'c162c174-6be7-43a9-86e5-54e262b603dd',\n", " '6b0e4edd-ec3f-46ca-90dd-dcc215276e2f',\n", " '9932b0cb-01ca-4e5a-893e-e4f4c603adf3',\n", " '4d3fc7d8-ce5e-4b1d-9713-47c456c3fd00',\n", " 'ef278108-5854-464b-9268-262838a6ae42',\n", " '6be41c75-c68b-49b7-a459-1d67cff13dfd',\n", " '2e3f9970-8782-4727-aaad-930b5f7a4dc0',\n", " 'f5f7b034-161c-4c2d-9f54-31c08ca0c18c',\n", " 'bb93ff74-9cca-4cff-a8fe-eb87f8eaff6a',\n", " '676f09ea-94bf-40a1-a2aa-34e229c229f3',\n", " '9a52013f-c51b-4944-b4b8-fe44e6a904fb',\n", " 'a50a756f-5157-4662-bfce-5e17d4a4dae5',\n", " '70446e6c-f095-456d-9b66-933f698a30fb',\n", " '75398fe1-1be3-43b9-9a03-173ec8dad48c',\n", " '2f723688-eecf-4cea-badb-c66e07872c9d',\n", " '47758d3c-4328-45fa-a74c-04796b53dc66',\n", " '1c38193c-cea5-4b38-b475-8ff769c6ef18',\n", " '5eea9511-a911-4a27-b651-8f6fffa0b1c5',\n", " 'd9fb78ad-3645-445a-81ec-f89751bf2bb6',\n", " '4bf596a7-74f0-4666-a75b-543e442e646d',\n", " 'a183b788-5934-41cb-8413-85f77adf2376',\n", " '4ee30bcf-ea40-4ef4-9a4d-b9562734c3ce',\n", " '443d46c2-f52d-43b4-8c36-4790d2ba48b3',\n", " 'b98d86fe-1e26-43d2-aa1e-8bb5288a9a84',\n", " 'd42b41c5-ec4e-4610-8e0d-9f067a7049b1',\n", " '5f9c4026-446a-4d4f-8274-bafb875f8cf7',\n", " '65082cd5-6cda-481a-af8a-ed7dc301e74e',\n", " '97e24cb9-618b-4816-9868-b793db0458be',\n", " '16708fde-e915-413a-9b8d-9a652056de4e',\n", " 'f2d91e12-4ae1-490f-8d01-b42ddc2169fe',\n", " 'd538dc42-f265-47d3-98ba-8fd0b08538d6',\n", " '278807b5-b162-4189-88da-d2a6c94de714',\n", " 'b17e6c7d-0e56-4785-b409-7efd542c219d',\n", " 'f68c6abf-a4ee-4ec5-a8bd-c5eebcd5af90',\n", " '6ad65aa1-2d83-4d78-b159-e596b9f2e906',\n", " '600cf2b7-b23b-42cb-84e7-815721ce7cb4',\n", " 'f2648c57-4820-4cd1-85b1-c0e16689e190',\n", " 'b5c4b9ab-0a24-499a-924b-e290cb06defa',\n", " 'e02ff7ba-4940-4b18-8809-7ffd395b5fc6',\n", " 'b48819d1-39c4-49fb-a3e8-37a56752c120',\n", " '4a46237b-7ee8-4603-b7d2-b80aeb73d3d9',\n", " 'c4c5611b-b980-4ad1-a068-dc6b45265920',\n", " '000ea3e7-29ce-401d-921c-ec117d126459',\n", " 'a950086e-c3b7-46f6-a94a-03019764d658',\n", " '4b79ed51-9484-482c-aee9-fcad3d0b94c6',\n", " 'd5e1caaf-392d-4d24-bd07-f43bcf2e3e17',\n", " '0daf8e09-3fcc-490f-b44d-36470a128f6f',\n", " '59059d09-0430-4f8a-951b-47f342e4fa45',\n", " '70efb192-c2ae-4423-9d08-16fe182ad961',\n", " '63dc9a1b-78e3-4d8b-9148-4567c000265d',\n", " 'de3d0c88-13ba-41ab-b983-d7a473080308',\n", " 'b9166c8f-82ed-4a1a-8dc3-9c7b96bccd19',\n", " '58369474-0691-4714-9896-83697d870331']" ] }, "metadata": {}, "execution_count": 10 } ] }, { "cell_type": "code", "source": [ "# # load the vectorstore and check similarity\n", "# query = \"tell me interesting facts about Roman Empire\"\n", "# results = vectorstore.similarity_search(query) # k=5" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "collapsed": true, "id": "IdBYcn3SjY1l", "outputId": "b1f75039-465a-47d8-827c-74b388387d80" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[Document(page_content='Main articles: Demography of the Roman Empire and Borders of the Roman Empire\\n\\nFurther information: Classical demography\\n\\nThe Roman Empire was one of the largest in history, with contiguous territories throughout Europe, North Africa, and the Middle East. The Latin phrase imperium sine fine (\"empire without end\") expressed the ideology that neither time nor space limited the Empire. In Virgil\\'s Aeneid, limitless empire is said to be granted to the Romans by Jupiter. This claim of universal dominion was renewed when the Empire came under Christian rule in the 4th century. In addition to annexing large regions, the Romans directly altered their geography, for example cutting down entire forests.', metadata={'source': 'RomanEmpire.pdf'}),\n", " Document(page_content='Most of the cultural appurtenances popularly associated with imperial culture---public cult and its games and civic banquets, competitions for artists, speakers, and athletes, as well as the funding of the great majority of public buildings and public display of art---were financed by private individuals, whose expenditures in this regard helped to justify their economic power and legal and provincial privileges. In the city of Rome, most people lived in multistory apartment buildings (insulae) that were often squalid firetraps. Public facilities---such as baths (thermae), toilets with running\\n\\nwater (latrinae), basins or elaborate fountains _(nymphea)_ delivering fresh water, and large- scale entertainments such as chariot races and gladiator combat---were aimed primarily at the common people. Similar facilities were constructed in cities throughout the Empire, and some of the best-preserved Roman structures are in Spain, southern France, and northern Africa.', metadata={'source': 'RomanEmpire.pdf'}),\n", " Document(page_content='Like gladiators, entertainers were legally infames, technically free but little better than slaves. \"Stars\", however, could enjoy considerable wealth and celebrity, and mingled socially and often sexually with the elite. Performers supported each other by forming guilds, and several memorials for theatre members survive. Theatre and dance were often condemned by Christian polemicists in the later Empire.\\n\\nEstimates of the average literacy rate range from 5 to over 30%. The Roman obsession with documents and inscriptions indicates the value placed on the written word. Laws and edicts were posted as well as read out. Illiterate Roman subjects could have a government scribe _(scriba)* read or write their official documents for them. The military produced extensive written records. The Babylonian Talmud declared \"if all seas were ink, all reeds were pen, all skies parchment, and all men scribes, they would be unable to set down the full scope of the Roman government\\'s concerns.\"', metadata={'source': 'RomanEmpire.pdf'}),\n", " Document(page_content=\"first time\\n\\nThe Empire reached its largest expanse under Trajan, encompassing 5 million square kilometres. The traditional population estimate of [55--60 million] inhabitants accounted for between one-sixth and one-fourth of the world's total population and made it the most populous unified political entity in the West until the mid-19th century. Recent demographic studies have argued for a population peak from [70 million] to more than [100 million]. Each of the three largest cities in the Empire -- Rome, Alexandria, and Antioch -- was almost twice the size of any European city at the beginning of the 17th century.\\n\\nAs the historian Christopher Kelly described it:\", metadata={'source': 'RomanEmpire.pdf'})]" ] }, "metadata": {}, "execution_count": 27 } ] }, { "cell_type": "markdown", "source": [ "### Revtriever Code" ], "metadata": { "id": "gz3S_Fp-iDGo" } }, { "cell_type": "code", "source": [ "# create retriever\n", "retriever = vectorstore.as_retriever()" ], "metadata": { "id": "nTYyIwlBqjJb" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "\n", "from langchain_core.prompts import ChatPromptTemplate\n", "\n", "template = \"\"\"You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question.\n", "If answer is not in the context, just say that you don't know. Use five sentences maximum and keep the answer concise.\n", "Question: {question}\n", "Context: {context}\n", "Answer:\n", "\"\"\"\n", "prompt = ChatPromptTemplate.from_template(template)" ], "metadata": { "id": "di2Hs7GEqoxm" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# load llm for chain\n", "from langchain_community.chat_models import ChatOpenAI\n", "llm = ChatOpenAI(model=\"gpt-3.5-turbo\", temperature=0, openai_api_key = OPENAI_API_KEY)" ], "metadata": { "id": "phhsoITkqtuY" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# create a chain with a retriever, prompt, and llm\n", "from langchain_core.output_parsers import StrOutputParser\n", "from langchain_core.runnables import RunnablePassthrough\n", "\n", "rag_chain = (\n", " {\"context\": retriever, \"question\": RunnablePassthrough()}\n", " | prompt\n", " | llm\n", " | StrOutputParser()\n", ")" ], "metadata": { "id": "IGbeSUOSq0fI" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# Asking question from your knowledge base\n", "rag_chain.invoke(\"tell me interesting facts about Roman Empire\")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 70 }, "id": "QFDBntQireKp", "outputId": "d10364b2-5d79-4a0e-d3fc-52dab17f58e2" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'The Roman Empire was one of the largest in history, spanning across Europe, North Africa, and the Middle East. The ideology of the empire was expressed through the Latin phrase imperium sine fine, meaning \"empire without end.\" The Romans directly altered their geography by annexing large regions and cutting down entire forests. The Empire reached its largest expanse under Trajan, covering 5 million square kilometers. Estimates of the average literacy rate in the Roman Empire range from 5 to over 30%.'" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" } }, "metadata": {}, "execution_count": 54 } ] }, { "cell_type": "code", "source": [ "# Asking question which is not in the knowledge base\n", "rag_chain.invoke(\"what is the capital of United States?\")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 36 }, "id": "RcKZB3qHrfk2", "outputId": "aa513b19-65e3-45b7-df08-678b53516ac3" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "\"I don't know.\"" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" } }, "metadata": {}, "execution_count": 55 } ] } ] }