close
The Wayback Machine - https://web.archive.org/web/20201106100053/https://github.com/dohliam/html-table2text
Skip to content
master
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 

README.md

HTML Table to Text - Extract and convert HTML tables to plain text formats

This is a simple (optionally interactive) script that can extract any or all tables from a given HTML file or URL. The data can be output to CSV (comma-separated values), TSV (tab-separated values), Markdown, Asciidoc, or raw HTML.

Requirements

This script relies on Nokogiri to parse HTML. You can install it with:

gem install nokogiri

Markdown conversion uses reverse_markdown which can be installed the same way:

gem install reverse_markdown

Asciidoc conversion uses the reverse_adoc gem:

gem install reverse_adoc

Usage

To extract tables from an arbitrary URL, just run the webtable_to_text script with the -u option followed by the URL:

./webtable_to_text.rb -u [URL]

For example:

./webtable_to_text.rb -u "https://en.wikipedia.org/wiki/Gabon"

This will print out all the tables found on the specified page.

To output a specific table only, use the n option, followed by the number of the table:

./webtable_to_text.rb -u "https://en.wikipedia.org/wiki/Gabon" -n 3

The script also works with local files, using the f option, e.g.:

./webtable_to_text.rb -f some_file.html

Interactive mode

To use interactive mode, add the -i option to the command and specify a URL or file as normal. For example:

./webtable_to_text.rb -u "https://en.wikipedia.org/wiki/Gabon" -i

This will print a message with the total number of tables found in the document. If you enter a number at the prompt, it will print the corresponding table. Otherwise, pressing ENTER or RETURN will print all tables found.

For example, pressing 3 will print something like the following:

Population in Gabon
Year	Million 
1950	0.5 
2000	1.2 
2016	2

To run tests, just enter the following command:

ruby tests.rb

Options

The following options are available:

  • -A, --all: Print all tables found on the specified page
  • -a, --asciidoc: Output in asciidoc/asciidoctor format
  • -c, --csv: Output in CSV / comma separated values format
  • -f, --file FILE: Specify HTML input file as source for extracting tables
  • -h, --help: Print help text
  • -i, --interactive: Interactive mode
  • -m, --markdown: Output in markdown format
  • -n, --number NUM: Print specific table number only; separate multiple numbers with commas
  • -o, --output FILE: Specify output file (default: output to STDOUT)
  • -r, --raw: Output raw table HTML
  • -t, --tsv: Output in TSV / tab separated values format (default)
  • -u, --url URL: Specify URL as source for extracting tables

To do

  • add options and non-interactive mode
  • output to raw HTML
  • output to Markdown
  • output to AsciiDoc

Credits

License

MIT.

Releases

No releases published

Packages

No packages published

Languages

You can’t perform that action at this time.