Data ingestion via MCP

Estimated reading time: 8 minutes

Three ways to load data into a clariBI workspace from an LLM client. Pick the path that matches the data's location: inline (small files in chat), URL (public hosted files), or OAuth handoff (cloud sources like Google Ads or Jira).

Choose a path

PathToolCapWhen to use
Inline upload_data_source ~18 MB raw (25 MB base64) The LLM already has the file in context (user dragged a CSV in, exported analysis to JSON, paste of a small Excel sheet).
URL ingest_url_data_source 100 MB raw The file lives at a public http(s) URL (S3 link, integration partner export, public dataset). Streamed server-side.
OAuth request_oauth_integration_url + check_integration_status n/a (live source) Google Ads / Analytics 4 / Search Console / Sheets / Drive / BigQuery, Meta Ads, Atlassian Jira / Confluence. Credentials never travel through chat.

Inline upload

Encode the file as base64, then call upload_data_source:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "tools/call",
  "params": {
    "name": "upload_data_source",
    "arguments": {
      "name": "Q3 Sales Export",
      "format": "csv",
      "data_base64": "<base64-encoded file contents>",
      "description": "Pulled from Salesforce on 2026-06-01",
      "wait_seconds": 30
    }
  }
}

Supported formats: csv, tsv, json, xlsx, xls, txt, pdf.

The server validates the file, runs the same preprocessing pipeline a web upload would, and returns a data_source_id. Pass wait_seconds > 0 to block until preprocessing finishes; otherwise poll get_data_source_schema(data_source_id) until status == "active".

Size cap

25 MB is the encoded size; raw bytes are ~18 MB after decode. For larger files, host them at a public URL and use ingest_url_data_source.

Quota gating

Every upload counts against your tier's data source limit, file upload quota, and storage cap. Trial allows 3 sources / 5 uploads / 1 GB; Starter through Enterprise scale up linearly. Upload errors clearly state which quota was hit.

URL ingest

Pass the file's public http(s) URL:

{
  "jsonrpc": "2.0",
  "id": 2,
  "method": "tools/call",
  "params": {
    "name": "ingest_url_data_source",
    "arguments": {
      "name": "Stripe charges export",
      "url": "https://example.com/exports/charges-2026-06.csv",
      "format": "csv"
    }
  }
}

The format argument is optional: clariBI infers it from the Content-Type header or the URL extension. The server fetches the file once and stores it; subsequent analyses do not re-fetch.

Network safety

The fetcher resolves the hostname server-side and rejects:

  • Private ranges: RFC 1918 (10/8, 172.16/12, 192.168/16), loopback (127/8, ::1), link-local (169.254/16), IPv6 ULA (fc00::/7).
  • Cloud metadata endpoints (169.254.169.254, GCP/Azure equivalents).
  • URLs with embedded credentials (user:pass@host).
  • Non-http(s) schemes.

Redirects are validated at each hop and capped at 3.

OAuth handoff

For cloud sources where credentials should never traverse chat, call request_oauth_integration_url to get a browser URL:

{
  "jsonrpc": "2.0",
  "id": 3,
  "method": "tools/call",
  "params": {
    "name": "request_oauth_integration_url",
    "arguments": {
      "provider": "google",
      "integration_type": "ga4"
    }
  }
}

The response includes an authorize_url and a handoff_id. The LLM hands the URL to the user, who opens it in a browser and clicks Allow. The LLM then polls until the handoff completes:

{
  "jsonrpc": "2.0",
  "id": 4,
  "method": "tools/call",
  "params": {
    "name": "check_integration_status",
    "arguments": {
      "handoff_id": "<uuid from request_oauth_integration_url>"
    }
  }
}

The status flips through pendingconnected (or failed / expired after 10 minutes). On connected, the response carries a connection_id and, once preprocessing runs, a data_source_id.

Supported providers and integration types

Providerintegration_type values
Google basic, google_ads, gsheets, ga4, gsc, gdrive, gdocs, bigquery, gcs, gcp
Meta basic, ads
Jira basic
Confluence basic

After ingestion

Once get_data_source_schema reports status == "active", call run_analysis with a natural-language question. The analysis engine uses the new source automatically.

{
  "jsonrpc": "2.0",
  "id": 5,
  "method": "tools/call",
  "params": {
    "name": "run_analysis",
    "arguments": {
      "question": "Which channel drove the most revenue in Q3?",
      "wait_seconds": 30
    }
  }
}

Database connections are deliberately not exposed

Pasting a PostgreSQL or MySQL connection string into chat would leak credentials to wherever the LLM transcript is stored. clariBI's MCP server intentionally does not include a connect_database tool. Connect databases from Settings → Data Sources in the web app.