PDF & Image OCR
Upload PDFs, scanned invoices, Word documents, and images. clariBI extracts tables and text automatically using optical character recognition, then turns them into queryable datasets.
Supports PDF, Word (.docx), PowerPoint (.pptx), email (.eml, .msg), and images (PNG, JPG, TIFF). Each page counts toward your monthly OCR quota.
Turn documents into data
Many businesses receive data locked inside PDF reports, scanned invoices, and images. clariBI's OCR capability extracts tables, numbers, and text from these documents and turns them into queryable datasets you can analyze with plain-English questions.
When you upload a PDF, clariBI detects whether it contains selectable text or scanned images. For text-based PDFs, data is extracted directly. For scanned documents and images, OCR (optical character recognition) processes each page to identify and extract tabular data, line items, and structured content.
Beyond PDFs, clariBI also processes Word documents (.docx), PowerPoint presentations (.pptx), email files (.eml, .msg), and images (PNG, JPG, TIFF). Each page processed counts toward your monthly OCR page quota.
Once extracted, the data behaves exactly like any other data source in clariBI. You can run conversational analytics, apply analysis templates, build dashboards, and generate reports from your document data.
How to extract data from documents
Upload a document and clariBI handles the rest. No configuration, no manual mapping, no third-party OCR tools required.
Open Data Sources and click "Add Data Source"
From your clariBI workspace, navigate to Data Sources in the left sidebar. Click the + Add Data Source button in the top-right corner of the page. You will see a categorised list of all supported integrations.
Select "Image/Document"
Under the Files category, click the Image/Document tile. The upload area will open, ready to accept your PDF, Word file, PowerPoint, email, or image.
Upload your document
Drop your file into the upload area or click to browse. clariBI detects the file type automatically. For PDFs, it determines whether OCR is needed (scanned pages) or direct text extraction is sufficient.
Automatic OCR processing and extraction
clariBI processes each page, extracts text and tables, and identifies data structures. Each page counts toward your monthly OCR page quota. You can review the extracted tables and text before creating a dataset.
Analyze and visualize the extracted data
Once extracted, the data is available for conversational analytics, analysis templates, and auto-generated dashboards like any other data source. Ask "What is the total across all invoices?" or "Show vendor spend by month" and get instant answers.
What you can do with document OCR
Once extracted, document data becomes a full data source in clariBI with access to every analytics feature on the platform.
-
Extract tables from scanned PDFs and images
OCR identifies tabular structures in scanned documents and converts them into clean, queryable datasets automatically.
-
Process text-based and scanned PDFs
clariBI auto-detects whether a PDF contains selectable text or scanned images and uses the right extraction method for each.
-
Six document formats supported
PDF, Word (.docx), PowerPoint (.pptx), email (.eml, .msg), and images (PNG, JPG, TIFF) are all accepted and processed.
-
Plain-English queries on extracted data
Ask questions like "What is the total invoice amount by vendor?" and get instant answers from data that was locked in PDFs moments ago.
-
440+ pre-built analysis templates
Apply any of clariBI's 440+ analysis templates to extracted document data. Finance, operations, sales: templates work with any data source.
-
Combine with other data sources
Use extracted document data alongside database connections, file uploads, and API integrations in multi-source dashboards.
-
Review before committing
After OCR processing, clariBI shows you the extracted tables and text. You can verify and adjust before creating a dataset.
-
Process email files and attachments
Upload .eml or .msg files from Outlook and other email clients. clariBI extracts embedded data and processes attachments.
Supported file types
clariBI processes six document and image formats. Each page processed counts toward your monthly OCR quota.
| Format | Details |
|---|---|
| Text-based and scanned PDFs; auto-detects which extraction method to use | |
| Word (.docx) | Microsoft Word documents with tables, text, and embedded data |
| PowerPoint (.pptx) | Presentation files containing data tables and charts |
| Email (.eml, .msg) | Email files from Outlook and other clients; attachments also processed |
| Images (PNG, JPG, TIFF) | Scanned documents, screenshots, and photographs of tables or reports |
Use cases
Data locked in documents is data you cannot analyze. clariBI's OCR frees that data and makes it available for AI-powered analytics.
Invoice Processing
Extract line items, totals, tax amounts, and vendor details from PDF invoices. Analyze expense patterns across hundreds of invoices without manual data entry.
Financial Statements
Upload bank statements or financial reports in PDF format and extract the data for trend analysis. Track revenue, expenses, and cash flow from documents that only exist as PDFs.
Vendor Reports
Process PDF reports from suppliers, partners, or agencies that do not offer data exports. Extract the tables and numbers they send you and analyze them alongside your other data.
Legacy Data
Digitize data from scanned paper documents and old reports that only exist as PDFs or images. Bring historical data into clariBI for comparison with current metrics.
Form Processing
Extract data from scanned forms, applications, or surveys submitted as images or PDFs. Turn paper-based workflows into structured, analyzable data.
Email Attachments
Process email files (.eml, .msg) and their attachments to extract embedded data and reports. Turn email-based reporting workflows into structured analytics.
Security & requirements
Documents often contain sensitive business data. Here is how clariBI protects your uploads and what you need to get started.
Security measures
-
Encrypted storage
Uploaded documents and extracted data are stored with encryption at rest. Files are isolated per organization and never shared across workspaces.
-
TLS in transit
All document uploads are transmitted over HTTPS with TLS encryption. Your files are protected from browser to server.
-
Organization-level isolation
Each organization's documents are stored in isolated containers. Only authenticated members of your workspace can access uploaded files and extracted data.
-
Audit logging
All document uploads, OCR processing, and queries against extracted data are logged in clariBI's audit trail.
-
RBAC access controls
Control which team members can upload documents and query extracted data using clariBI's role-based access control system (Professional plan and above).
Prerequisites
-
Trial or paid plan required
OCR processing is not available on the Free plan. You need at least a Trial account (40 OCR pages included) or a paid plan to process documents.
-
OCR page quota
Each page of a document counts toward your monthly OCR quota. Trial includes 40 pages; paid plans include 100 to 2,000 pages depending on your tier.
-
Supported file formats
PDF, Word (.docx), PowerPoint (.pptx), email (.eml, .msg), and images (PNG, JPG, TIFF). Other formats are not currently supported for OCR.
-
Document quality
OCR accuracy depends on document quality. Clear, high-resolution scans produce the best results. Handwritten text is not currently supported.
-
No additional software
All OCR processing happens on clariBI's servers. You do not need to install any software, plugins, or browser extensions.
Pricing & availability
OCR processing is available on Trial and all paid plans. Each page of a document counts toward your monthly quota. No extra fees beyond your plan.
| Plan | OCR pages / month |
|---|---|
| Free | Not available |
| Trial (14 days) | 40 pages |
| Starter ($99/mo) | 100 pages |
| Professional ($199/mo) | 200 pages |
| Enterprise ($999/mo) | 2,000 pages |
Annual billing saves up to 17% • The free 14-day trial includes 40 OCR pages to test document processing • See full pricing details
Extract data from any document
Start your free 14-day trial and process your first document today. No credit card required.
6 file formats • 40 OCR pages included in trial • Auto table extraction • All paid plans from $99/month