Data ingestion via MCP

Estimated reading time: 8 minutes

Three ways to load data into a clariBI workspace from an LLM client. Pick the path that matches the data's location: inline (small files in chat), URL (public hosted files), or OAuth handoff (cloud sources like Google Ads or Jira).

Choose a path

Path	Tool	Cap	When to use
Inline	`upload_data_source`	~18 MB raw (25 MB base64)	The LLM already has the file in context (user dragged a CSV in, exported analysis to JSON, paste of a small Excel sheet).
URL	`ingest_url_data_source`	100 MB raw	The file lives at a public http(s) URL (S3 link, integration partner export, public dataset). Streamed server-side.
OAuth	`request_oauth_integration_url` + `check_integration_status`	n/a (live source)	Google Ads / Analytics 4 / Search Console / Sheets / Drive / BigQuery, Meta Ads, Atlassian Jira / Confluence. Credentials never travel through chat.

Inline upload

Encode the file as base64, then call upload_data_source:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "tools/call",
  "params": {
    "name": "upload_data_source",
    "arguments": {
      "name": "Q3 Sales Export",
      "format": "csv",
      "data_base64": "<base64-encoded file contents>",
      "description": "Pulled from Salesforce on 2026-06-01",
      "wait_seconds": 30
    }
  }
}

Supported formats: csv, tsv, json, xlsx, xls, txt, pdf.

The server validates the file, runs the same preprocessing pipeline a web upload would, and returns a data_source_id. Pass wait_seconds > 0 to block until preprocessing finishes; otherwise poll get_data_source_schema(data_source_id) until status == "active".

Size cap

25 MB is the encoded size; raw bytes are ~18 MB after decode. For larger files, host them at a public URL and use ingest_url_data_source.

Quota gating

Every upload counts against your tier's data source limit, file upload quota, and storage cap. Trial allows 3 sources / 5 uploads / 1 GB; Starter through Enterprise scale up linearly. Upload errors clearly state which quota was hit.

URL ingest

Pass the file's public http(s) URL:

{
  "jsonrpc": "2.0",
  "id": 2,
  "method": "tools/call",
  "params": {
    "name": "ingest_url_data_source",
    "arguments": {
      "name": "Stripe charges export",
      "url": "https://example.com/exports/charges-2026-06.csv",
      "format": "csv"
    }
  }
}

The format argument is optional: clariBI infers it from the Content-Type header or the URL extension. The server fetches the file once and stores it; subsequent analyses do not re-fetch.

Network safety

The fetcher resolves the hostname server-side and rejects:

Private ranges: RFC 1918 (10/8, 172.16/12, 192.168/16), loopback (127/8, ::1), link-local (169.254/16), IPv6 ULA (fc00::/7).
Cloud metadata endpoints (169.254.169.254, GCP/Azure equivalents).
URLs with embedded credentials (user:pass@host).
Non-http(s) schemes.

Redirects are validated at each hop and capped at 3.

OAuth handoff

For cloud sources where credentials should never traverse chat, call request_oauth_integration_url to get a browser URL:

{
  "jsonrpc": "2.0",
  "id": 3,
  "method": "tools/call",
  "params": {
    "name": "request_oauth_integration_url",
    "arguments": {
      "provider": "google",
      "integration_type": "ga4"
    }
  }
}

The response includes an authorize_url and a handoff_id. The LLM hands the URL to the user, who opens it in a browser and clicks Allow. The LLM then polls until the handoff completes:

{
  "jsonrpc": "2.0",
  "id": 4,
  "method": "tools/call",
  "params": {
    "name": "check_integration_status",
    "arguments": {
      "handoff_id": "<uuid from request_oauth_integration_url>"
    }
  }
}

The status flips through pending → connected (or failed / expired after 10 minutes). On connected, the response carries a connection_id and, once preprocessing runs, a data_source_id.

Supported providers and integration types

Provider	integration_type values
Google	`basic`, `google_ads`, `gsheets`, `ga4`, `gsc`, `gdrive`, `gdocs`, `bigquery`, `gcs`, `gcp`
Meta	`basic`, `ads`
Jira	`basic`
Confluence	`basic`

After ingestion

Once get_data_source_schema reports status == "active", call run_analysis with a natural-language question. The analysis engine uses the new source automatically.

{
  "jsonrpc": "2.0",
  "id": 5,
  "method": "tools/call",
  "params": {
    "name": "run_analysis",
    "arguments": {
      "question": "Which channel drove the most revenue in Q3?",
      "wait_seconds": 30
    }
  }
}

Database connections are deliberately not exposed

Pasting a PostgreSQL or MySQL connection string into chat would leak credentials to wherever the LLM transcript is stored. clariBI's MCP server intentionally does not include a connect_database tool. Connect databases from Settings → Data Sources in the web app.