Skip to main content
Built-in

PDF & Image OCR

Upload PDFs, scanned invoices, Word documents, and images. clariBI extracts tables and text automatically using optical character recognition, then turns them into queryable datasets.

Supports PDF, Word (.docx), PowerPoint (.pptx), email (.eml, .msg), and images (PNG, JPG, TIFF). Each page counts toward your monthly OCR quota.

Turn documents into data

Many businesses receive data locked inside PDF reports, scanned invoices, and images. clariBI's OCR capability extracts tables, numbers, and text from these documents and turns them into queryable datasets you can analyze with plain-English questions.

When you upload a PDF, clariBI detects whether it contains selectable text or scanned images. For text-based PDFs, data is extracted directly. For scanned documents and images, OCR (optical character recognition) processes each page to identify and extract tabular data, line items, and structured content.

Beyond PDFs, clariBI also processes Word documents (.docx), PowerPoint presentations (.pptx), email files (.eml, .msg), and images (PNG, JPG, TIFF). Each page processed counts toward your monthly OCR page quota.

Once extracted, the data behaves exactly like any other data source in clariBI. You can run conversational analytics, apply analysis templates, build dashboards, and generate reports from your document data.

How to extract data from documents

Upload a document and clariBI handles the rest. No configuration, no manual mapping, no third-party OCR tools required.

1

Open Data Sources and click "Add Data Source"

From your clariBI workspace, navigate to Data Sources in the left sidebar. Click the + Add Data Source button in the top-right corner of the page. You will see a categorised list of all supported integrations.

2

Select "Image/Document"

Under the Files category, click the Image/Document tile. The upload area will open, ready to accept your PDF, Word file, PowerPoint, email, or image.

3

Upload your document

Drop your file into the upload area or click to browse. clariBI detects the file type automatically. For PDFs, it determines whether OCR is needed (scanned pages) or direct text extraction is sufficient.

4

Automatic OCR processing and extraction

clariBI processes each page, extracts text and tables, and identifies data structures. Each page counts toward your monthly OCR page quota. You can review the extracted tables and text before creating a dataset.

5

Analyze and visualize the extracted data

Once extracted, the data is available for conversational analytics, analysis templates, and auto-generated dashboards like any other data source. Ask "What is the total across all invoices?" or "Show vendor spend by month" and get instant answers.

What you can do with document OCR

Once extracted, document data becomes a full data source in clariBI with access to every analytics feature on the platform.

  • Extract tables from scanned PDFs and images

    OCR identifies tabular structures in scanned documents and converts them into clean, queryable datasets automatically.

  • Process text-based and scanned PDFs

    clariBI auto-detects whether a PDF contains selectable text or scanned images and uses the right extraction method for each.

  • Six document formats supported

    PDF, Word (.docx), PowerPoint (.pptx), email (.eml, .msg), and images (PNG, JPG, TIFF) are all accepted and processed.

  • Plain-English queries on extracted data

    Ask questions like "What is the total invoice amount by vendor?" and get instant answers from data that was locked in PDFs moments ago.

  • 440+ pre-built analysis templates

    Apply any of clariBI's 440+ analysis templates to extracted document data. Finance, operations, sales: templates work with any data source.

  • Combine with other data sources

    Use extracted document data alongside database connections, file uploads, and API integrations in multi-source dashboards.

  • Review before committing

    After OCR processing, clariBI shows you the extracted tables and text. You can verify and adjust before creating a dataset.

  • Process email files and attachments

    Upload .eml or .msg files from Outlook and other email clients. clariBI extracts embedded data and processes attachments.

Supported file types

clariBI processes six document and image formats. Each page processed counts toward your monthly OCR quota.

Format Details
PDF Text-based and scanned PDFs; auto-detects which extraction method to use
Word (.docx) Microsoft Word documents with tables, text, and embedded data
PowerPoint (.pptx) Presentation files containing data tables and charts
Email (.eml, .msg) Email files from Outlook and other clients; attachments also processed
Images (PNG, JPG, TIFF) Scanned documents, screenshots, and photographs of tables or reports

Use cases

Data locked in documents is data you cannot analyze. clariBI's OCR frees that data and makes it available for AI-powered analytics.

Invoice Processing

Extract line items, totals, tax amounts, and vendor details from PDF invoices. Analyze expense patterns across hundreds of invoices without manual data entry.

Financial Statements

Upload bank statements or financial reports in PDF format and extract the data for trend analysis. Track revenue, expenses, and cash flow from documents that only exist as PDFs.

Vendor Reports

Process PDF reports from suppliers, partners, or agencies that do not offer data exports. Extract the tables and numbers they send you and analyze them alongside your other data.

Legacy Data

Digitize data from scanned paper documents and old reports that only exist as PDFs or images. Bring historical data into clariBI for comparison with current metrics.

Form Processing

Extract data from scanned forms, applications, or surveys submitted as images or PDFs. Turn paper-based workflows into structured, analyzable data.

Email Attachments

Process email files (.eml, .msg) and their attachments to extract embedded data and reports. Turn email-based reporting workflows into structured analytics.

Security & requirements

Documents often contain sensitive business data. Here is how clariBI protects your uploads and what you need to get started.

Security measures

  • Encrypted storage

    Uploaded documents and extracted data are stored with encryption at rest. Files are isolated per organization and never shared across workspaces.

  • TLS in transit

    All document uploads are transmitted over HTTPS with TLS encryption. Your files are protected from browser to server.

  • Organization-level isolation

    Each organization's documents are stored in isolated containers. Only authenticated members of your workspace can access uploaded files and extracted data.

  • Audit logging

    All document uploads, OCR processing, and queries against extracted data are logged in clariBI's audit trail.

  • RBAC access controls

    Control which team members can upload documents and query extracted data using clariBI's role-based access control system (Professional plan and above).

Prerequisites

  • Trial or paid plan required

    OCR processing is not available on the Free plan. You need at least a Trial account (40 OCR pages included) or a paid plan to process documents.

  • OCR page quota

    Each page of a document counts toward your monthly OCR quota. Trial includes 40 pages; paid plans include 100 to 2,000 pages depending on your tier.

  • Supported file formats

    PDF, Word (.docx), PowerPoint (.pptx), email (.eml, .msg), and images (PNG, JPG, TIFF). Other formats are not currently supported for OCR.

  • Document quality

    OCR accuracy depends on document quality. Clear, high-resolution scans produce the best results. Handwritten text is not currently supported.

  • No additional software

    All OCR processing happens on clariBI's servers. You do not need to install any software, plugins, or browser extensions.

Pricing & availability

OCR processing is available on Trial and all paid plans. Each page of a document counts toward your monthly quota. No extra fees beyond your plan.

Plan OCR pages / month
Free Not available
Trial (14 days) 40 pages
Starter ($99/mo) 100 pages
Professional ($199/mo) 200 pages
Enterprise ($999/mo) 2,000 pages

Annual billing saves up to 17% • The free 14-day trial includes 40 OCR pages to test document processing • See full pricing details

Extract data from any document

Start your free 14-day trial and process your first document today. No credit card required.

6 file formats • 40 OCR pages included in trial • Auto table extraction • All paid plans from $99/month