Data ingestion via MCP
Estimated reading time: 8 minutes
Three ways to load data into a clariBI workspace from an LLM client. Pick the path that matches the data's location: inline (small files in chat), URL (public hosted files), or OAuth handoff (cloud sources like Google Ads or Jira).
Choose a path
| Path | Tool | Cap | When to use |
|---|---|---|---|
| Inline | upload_data_source |
~18 MB raw (25 MB base64) | The LLM already has the file in context (user dragged a CSV in, exported analysis to JSON, paste of a small Excel sheet). |
| URL | ingest_url_data_source |
100 MB raw | The file lives at a public http(s) URL (S3 link, integration partner export, public dataset). Streamed server-side. |
| OAuth | request_oauth_integration_url + check_integration_status |
n/a (live source) | Google Ads / Analytics 4 / Search Console / Sheets / Drive / BigQuery, Meta Ads, Atlassian Jira / Confluence. Credentials never travel through chat. |
Inline upload
Encode the file as base64, then call upload_data_source:
{
"jsonrpc": "2.0",
"id": 1,
"method": "tools/call",
"params": {
"name": "upload_data_source",
"arguments": {
"name": "Q3 Sales Export",
"format": "csv",
"data_base64": "<base64-encoded file contents>",
"description": "Pulled from Salesforce on 2026-06-01",
"wait_seconds": 30
}
}
}Supported formats: csv, tsv, json, xlsx, xls, txt, pdf.
The server validates the file, runs the same preprocessing pipeline a web upload would, and returns a data_source_id. Pass wait_seconds > 0 to block until preprocessing finishes; otherwise poll get_data_source_schema(data_source_id) until status == "active".
Size cap
25 MB is the encoded size; raw bytes are ~18 MB after decode. For larger files, host them at a public URL and use ingest_url_data_source.
Quota gating
Every upload counts against your tier's data source limit, file upload quota, and storage cap. Trial allows 3 sources / 5 uploads / 1 GB; Starter through Enterprise scale up linearly. Upload errors clearly state which quota was hit.
URL ingest
Pass the file's public http(s) URL:
{
"jsonrpc": "2.0",
"id": 2,
"method": "tools/call",
"params": {
"name": "ingest_url_data_source",
"arguments": {
"name": "Stripe charges export",
"url": "https://example.com/exports/charges-2026-06.csv",
"format": "csv"
}
}
}The format argument is optional: clariBI infers it from the Content-Type header or the URL extension. The server fetches the file once and stores it; subsequent analyses do not re-fetch.
Network safety
The fetcher resolves the hostname server-side and rejects:
- Private ranges: RFC 1918 (
10/8,172.16/12,192.168/16), loopback (127/8,::1), link-local (169.254/16), IPv6 ULA (fc00::/7). - Cloud metadata endpoints (
169.254.169.254, GCP/Azure equivalents). - URLs with embedded credentials (
user:pass@host). - Non-http(s) schemes.
Redirects are validated at each hop and capped at 3.
OAuth handoff
For cloud sources where credentials should never traverse chat, call request_oauth_integration_url to get a browser URL:
{
"jsonrpc": "2.0",
"id": 3,
"method": "tools/call",
"params": {
"name": "request_oauth_integration_url",
"arguments": {
"provider": "google",
"integration_type": "ga4"
}
}
}The response includes an authorize_url and a handoff_id. The LLM hands the URL to the user, who opens it in a browser and clicks Allow. The LLM then polls until the handoff completes:
{
"jsonrpc": "2.0",
"id": 4,
"method": "tools/call",
"params": {
"name": "check_integration_status",
"arguments": {
"handoff_id": "<uuid from request_oauth_integration_url>"
}
}
}The status flips through pending → connected (or failed / expired after 10 minutes). On connected, the response carries a connection_id and, once preprocessing runs, a data_source_id.
Supported providers and integration types
| Provider | integration_type values |
|---|---|
basic, google_ads, gsheets, ga4, gsc, gdrive, gdocs, bigquery, gcs, gcp |
|
| Meta | basic, ads |
| Jira | basic |
| Confluence | basic |
After ingestion
Once get_data_source_schema reports status == "active", call run_analysis with a natural-language question. The analysis engine uses the new source automatically.
{
"jsonrpc": "2.0",
"id": 5,
"method": "tools/call",
"params": {
"name": "run_analysis",
"arguments": {
"question": "Which channel drove the most revenue in Q3?",
"wait_seconds": 30
}
}
}Database connections are deliberately not exposed
Pasting a PostgreSQL or MySQL connection string into chat would leak credentials to wherever the LLM transcript is stored. clariBI's MCP server intentionally does not include a connect_database tool. Connect databases from Settings → Data Sources in the web app.