Datasources

Private project - Datasources menu

Datasources play a pivotal role in broadening and enriching the functionalities of ELITEA and AI technologies. They enable the extension of LLMs by integrating user-specific or project-specific data, which is not initially part of the model’s training set, thereby enhancing the LLM's context with tailored information.

Datasources-Menu_Private

Creating a Datasource

To set up a new datasource and augment your model's capabilities:

  1. Click the + Datasource button located at the top right corner.
  2. Fill out the Name and Description fields.
  3. Choose an Embedding model from the dropdown list provided.
  4. Select the desired Storage type from another dropdown menu.
  5. Optionally, add tags by typing a tag name or selecting from pre-existing tags in the Tags input box.
  6. Click Save to finalize the creation.

Datasources-Create_New_Datasource

Exploring Datasources

Discovering the intricacies of a datasource is both simple and insightful: Click on the card or the name of a datasource to unveil its configurations, providing a detailed overview of its setup and usage.

Connecting a Dataset to the Datasource

The initial step involves linking your dataset to the desired datasource:

  1. Press the + icon to start adding a new dataset.
  2. Enter a name for your dataset.
  3. From the dropdown list, select the source type of your dataset. Available options include:
    • File: Any supported file type for upload.
    • Table: Supported file types with a table structure, such as CSV or XLSX.
    • GIT: Any accessible Git repository.
    • Confluence: Any Confluence page accessible to you.
    • QTest: Any QTest project accessible to you.

Datasources-Dataset

Depending on the selected source type, various configurations may be necessary to access the dataset source, primarily involving authentication and authorization parameters. This step is exempt for File and Table options since the files will be directly uploaded.

Source type - File

ELITEA supports a variety of file types and offers flexible settings to handle your documents effectively. Below is an easy-to-understand breakdown of the options, settings, and parameters available for configuration.

Datasources-Dataset_File

Source type - Table

This functionality is crucial for users who work with structured data in formats such as spreadsheets or structured text files. The aim is to make the process straightforward for users without requiring deep technical knowledge. Here, we outline the options, settings, and parameters available for your table data sources.

Datasources-Dataset_Table

Source type - GIT

For users who rely on Git repositories to manage their code, documents, or other types of projects, this source type allows to streamline the process of linking and extracting data from these repositories. Here, we outline the options, settings, and parameters available for your GIT source type.

Important Note: To ensure a successful connection, you must clone your Git repository and provide the cloned Git link. Simply copying the Git repository address from your browser's address bar is not sufficient. Cloning the repository ensures that you're using a valid, accessible link that ELITEA can connect to without issues.

Datasources-Dataset_GIT

Source type - Confluence

For users who rely on Confluence pages to manage their information, documents, or other types of projects, this source type allows to streamline the process of linking and extracting data from these knowledge pages. Here, we outline the options, settings, and parameters available for your Confluence source type.

Important Note: To establish a successful connection to Confluence from ELITEA, you must select one of these filters and provide the corresponding value for it. This step is crucial as it defines the scope of content that ELITEA will access and import from Confluence, aligning the integration process with your project's specific requirements.

Datasources-Dataset_Confluence

Source type - QTest

Integrating QTest with ELITEA enhances your test management by connecting directly to QTest Test Case Management System (TCMS). This integration allows you to select test cases for duplication checks, search functionalities, and leverage manual test cases for future automation with Generative AI. Below, we detail the configuration options, settings, and parameters available for the QTest source type.

TRANSFORMERS

Transformers enhance your documents by extracting significant keywords, summarizing content, and improving searchability. Please note if you don't clearly understand the purpose of the parameters and options available here than leave them as is. Don't make any changes.

Datasources-Dataset_Transformers

SUMMARIZATION

Summarization utilizes LLMs to condense documents into their core messages. Due to the high computational demand, use of this feature incurs additional costs. Please note if you don't clearly understand the purpose of the parameters and options available here than leave them as is. Don't make any changes.

    • Summarization model - select from available LLMs based on your document’s complexity.
    • Document summarization - enables summarization of the entire document.
    • Chunk summarization - applies summarization to specific sections or chunks of the document.
    • Finally, click Create to index the dataset for use. Note that processing time can take up to 10 minutes, depending on the source type and size.

Datasources-Dataset_Summarization

Note: Multiple datasets can be utilized within the same datasource, enhancing versatility and depth of analysis.

CONTEXT

Context input field is a designated area for providing instructions (prompt'), that facilitates the utilization of information from configured datasets via LLMs. This prompt guides the LLM on how to interpret and analyze the dataset, ensuring that the generated output aligns with the user's specific objectives. Note: By providing detailed and clear instructions in the Context field, users effectively guide the processing and analysis of their datasets, leveraging the robust capabilities of LLMs for tailored insights and actions.

Datasources-Context

WELCOME MESSAGE

The Welcome Message feature allows you to provide additional context for prompts, datasources, and agents. Currently, the Welcome Message is sent to LLM along with other instructions.

How to Add the Welcome Message:

  1. Add the Welcome Message: Type the welcome message text in the input field.
  2. Save the Configuration: After entering the desired text, ensure to save the changes to the datasource. This action makes the configured welcome message available to user in the Chat section.

Using the Welcome Message:

Go to the Chat section of the datasource. Here, you will see the configured Welcome Message. It will provide additional notification, instruction to the user.

Examples of Welcome Message:

Datasources-Welcome_Message

CONVERSATION STARTERS

The Conversation Starter feature enables you to configure and add predefined text that can be used to initiate a conversation when executing an agent. This feature is particularly useful for setting a consistent starting point for interactions facilitated by the datasource.

How to Add a Conversation Starter:

  1. Access the Configuration Panel: Navigate to the Conversation Starter section.
  2. Add a Conversation Starter: Click the + icon to open the text input field where you can type the text you wish to use as a conversation starter.
  3. Save the Configuration: After entering the desired text, ensure to save the changes to the prompt. This action makes the configured conversation starter available for use.

Using a Conversation Starter:

Initiate a Conversation: Go to the Chat section of the datasource. Here, you will find the saved conversation starters listed. Click on the desired starter to automatically populate the chat input and execute the datasource.

Examples of Conversation Starters:

Datasources-Conversation_Starters

By setting up conversation starters, you streamline the process of initiating specific tasks or queries, making your interactions with the datasource more efficient and standardized.

Working with Your Dataset

After you've successfully created your dataset(s), a variety of features become available for you to explore and utilize. These features are designed to help you interact with your dataset in a more intuitive and productive manner. Here's a brief overview of what you can do:

Chat

The Chat feature is tailored for conversational AI models, enabling you to engage in dialogues or interactions akin to conversing with a human. Whether you're asking a question, making a statement, or giving a command, this feature is designed to generate responses that mimic human conversation.

To use the Chat and query info:

  1. Select the Embedding model from the dropdown list. Note: It must be the same one which is used for creating the datasource.
  2. Choose an Chat model (e.g., gpt-4-0125-preview, gpt-35-turbo, etc.) suited to your conversation needs.
  3. Optionally, you can configure Advanced Settings for more tailord outputs by clicking the Settings icon. Note: Please exercise caution with these settings. If unsure about their functions, it's advisable to leave them at their default values. The following settings are available:
    • Initial Lookup Result (1 – 50) - specifies the number of initial results retrieved from the indexed dataset(s) for further processing.
      • Higher values: More initial results are retrieved, which can increase the chances of finding relevant information but may slow down processing.
      • Lower values: Fewer initial results are retrieved, which can speed up processing but might miss some relevant information.
    • Pages Per Document (1 – 30) - defines the number of pages to be considered per document during the retrieval or processing phase.
      • Higher values: More pages per document are considered, which can provide more comprehensive information but may slow down processing.
      • Lower values: Fewer pages per document are considered, which can speed up processing but might miss some important details.
    • Expected Search Results (1 – 40) - sets the anticipated number of search results to be returned, guiding the system's retrieval scope.
      • Higher values: More search results are returned, which can provide a broader range of information but may include less relevant results.
      • Lower values: Fewer search results are returned, which can provide more focused and relevant information but might miss some useful results.
    • Temperature (0.1-1.0) - adjusts the level of creativity or unpredictability in responses.
      • Higher values: Responses are more creative and varied, but may be less consistent and more unpredictable.
      • Lower values: Responses are more consistent and predictable, but may be less creative and varied.
    • Top P (0-1) - determines the cumulative probability threshold for selecting words, balancing between creativity and consistency.
      • Higher values: A wider range of words is considered, leading to more creative and diverse responses.
      • Lower values: A narrower range of words is considered, leading to more consistent and predictable responses.
    • Top K (1-40) - limits the choice of words to the K most probable, affecting the response's diversity and predictability.
      • Higher values: More words are considered, leading to more diverse and potentially creative responses.
      • Lower values: Fewer words are considered, leading to more predictable and focused responses.
    • Maximum length - sets the cap on the response length, helping tailor responses to be as concise or detailed as desired.
      • Higher values: Responses can be longer and more detailed.
      • Lower values: Responses are shorter and more concise.
  4. Type your text in the chat box and click the Send icon to initiate the dialogue.

Datasources-Chat

Additional Interaction Features:

Post-Output Actions:

The Search feature allows you to quickly locate specific information within your indexed dataset.

How to Conduct a Search:

  1. Select the Embedding model from the dropdown list. Note: It must be the same one which is used for creating the datasource.
  2. Optionally, you can configure Advanced Settings for more tailord outputs by clicking the Settings icon. Note: Please exercise caution with these settings. If unsure about their functions, it's advisable to leave them at their default values. The following settings are available:
    • Initial Lookup Result (1 – 50) - specifies the number of initial results retrieved from the indexed dataset(s) for further processing.
      • Higher values: More initial results are retrieved, which can increase the chances of finding relevant information but may slow down processing.
      • Lower values: Fewer initial results are retrieved, which can speed up processing but might miss some relevant information.
    • Pages Per Document (1 – 30) - defines the number of pages to be considered per document during the retrieval or processing phase.
      • Higher values: More pages per document are considered, which can provide more comprehensive information but may slow down processing.
      • Lower values: Fewer pages per document are considered, which can speed up processing but might miss some important details.
    • Expected Search Results (1 – 40) - sets the anticipated number of search results to be returned, guiding the system's retrieval scope.
      • Higher values: More search results are returned, which can provide a broader range of information but may include less relevant results.
      • Lower values: Fewer search results are returned, which can provide more focused and relevant information but might miss some useful results.
    • String content - determines whether the system should include or consider specific text data in its processing or generation.
  3. Type your query into the input field and hit the Send icon.

Datasources-Search

Post-Output Actions:

Deduplicate

The Deduplication is a handy feature for identifying duplicate information.

How to Run Deduplication:

  1. Select the Embedding model from the dropdown list. Note: It must be the same one which is used for creating the datasource.
  2. Configure the following settings:
    • Cut-off Score (0.1-1.0) - determines the threshold for identifying duplicate content based on similarity scores.
      • Higher values: More items will be excluded, as they will not be considered similar enough. This can be useful if you want to ensure that only highly unique content remains.
      • Lower values: More items will be retained, as they will be considered similar enough. This can be useful if you want to keep more variations of the content.
  3. Optionally, you can configure Advanced Settings for more tailord outputs by clicking the Settings icon. Note: Please exercise caution with these settings. If unsure about their functions, it's advisable to leave them at their default values. The following settings are available:
    • Show Additional Metadata - a checkbox option that determines whether to display extra information about the content.
      • Selected: Additional metadata will be shown, providing more context and details about the content.
      • Not Selected: Additional metadata will not be shown, resulting in a cleaner and simpler display of the content.
    • Exclude Fields - a comma-separated list of field names that should be excluded from the deduplication process.
      • Specified fields: The fields listed will be ignored during deduplication, which means differences in these fields will not affect whether content is considered a duplicate. This can be useful if certain fields (like timestamps or IDs) are not relevant to the deduplication criteria.
      • No fields specified: All fields will be considered during deduplication, which means any differences in any field can affect whether content is considered a duplicate.
  4. Click the Run button to start the process.

Note: Remember, deduplication efforts will hinge upon the parameters you've set for the dataset.

By using these features, you’re equipped to enhance your dataset, making it a more efficient and effective tool for your AI applications. Proceed with adjustments only if you're confident in your understanding of their implications.

Datasources-Deduplicate

Post-Output Actions:

Public project - Datasources menu

The Datasources menu within Public project showcases a collection of published and shared datasources within the community.

Layout of the Datasources Menu

The Datasources menu is organized into three distinct pages, each designed to offer a unique perspective on the available datasources:

Datasources-Menu_Public

Engaging with Published Datasources

Interaction within the community is highly encouraged to recognize and appreciate valuable datasources. The following actions enable active participation:

Liking Published Datasources

Upon publication, a datasource becomes a crucial resource for the community. To support and acknowledge a datasource, use the Like functionality:

Other Actions for Published Datasources

Using Published Datasources: