Journalist/developer. Storytelling developer @ USA Today Network. Builder of @HomicideWatch. Sinophile for fun. Past: @frontlinepbs @WBUR, @NPR, @NewsHour.
2008 stories
·
45 followers

Colorado Sun politics reporter kicked out of GOP state assembly

1 Share
Read the whole story
chrisamico
7 days ago
reply
Boston, MA
Share this story
Delete

Israel must be held to account for the targeting and killing of journalists

1 Share

Protest in Tel Aviv against the Netanyahu government last June. Photo (cc) 2023 by RG TLV.

CNN media reporter Oliver Darcy wrote an important analysis last week about journalists who have been killed by Israeli forces in the the Gaza war. Citing figures from the Committee to Protect Journalists, Darcy observes that at least 95 journalists have been killed since Hamas’ terrorist attack on Israel last Oct. 7, and that all but five of those journalists are Palestinian — the highest death toll for members of the press since CPJ began tracking such casualties in 1992.

In addition to deaths that might be attributed to the fog of war, there have also been killings that Israel carried out despite what appear to be clear indications that it was targeting media workers. Darcy writes that the United Nations recently finished a report showing that Reuters journalist Issam Abdallah had been killed in southern Lebanon after a tank fired at a group of “clearly identified journalists.” Israeli officials responded to the U.N. that it “does not deliberately shoot at civilians, including journalists.”

In addition, The Washington Post last week found that a Jan. 7 missile attack resulting in the deaths of two Al Jazeera journalists and two freelancers in southern Gaza may have lacked any military justification. The Israeli military claimed it had “identified and struck a terrorist who operated an aircraft that posed a threat to IDF troops” — but the Post found that the “aircraft” was a drone apparently being used for reporting purposes.

Darcy includes accounts of Palestinian journalists who have alleged been abused by Israeli forces as well — a topic that is the subject of a new report from CPJ, which “found multiple kinds of incidents of journalists being targeted while carrying out their work in Israel and the two Palestinian territories, Gaza and the West Bank” as well as the deaths of journalists’ families.

CPJ has posted an open letter signed by 36 leaders of top U.S. and international news organizations calling Israel to end its attacks on journalists. Among the Americans the letter are Julie Pace, the executive editor of The Associated Press; Mark Thompson, the chair and CEO of CNN; A.G. Sulzberger, the publisher of The New York Times; Sally Buzbee, the executive editor of The Washington Post; Kim Godwin, the president of ABC News; and Rebecca Blumenstein, the president of editorial at NBC News. Significantly, the international news leaders signing the letter include Aluf Benn, the editor-in-chief of the Israeli newspaper Haaretz. The letter includes this:

Journalists are civilians and Israeli authorities must protect journalists as noncombatants according to international law. Those responsible for any violations of that longstanding protection should be held accountable. Attacks on journalists are also attacks on truth. We commit to championing the safety of journalists in Gaza, which is fundamental for the protection of press freedom everywhere.

This weekend, as NPR reports, tens of thousands of Israelis demonstrated against the government of Prime Minister Benjamin Netananyu, calling for a deal with Hamas to release the more than 100 hostages the terrorist group is still believed to be holding.

The horrendous situation in the Middle East began with Hamas’ attacks, claiming some 1,200 lives and leading to Israel’s invasion of Gaza, which have killed more than 30,000 people, mostly civilians. Starvation looms. President Biden has been ever-so-slowly been backing away from the Netanyahu government, allowing a U.N. Security Council resolution calling for a cease-fire and the release of the hostages to take effect.

Israel’s targeting of media workers is a small part of a much larger picture — a horrendous problem that would seem to have no good solution. But let’s start with this: Journalists are the world’s eyes and ears. They need to be able to tell us what is taking place on the ground without fear of being killed.

Leave a comment | Read comments

Read the whole story
chrisamico
13 days ago
reply
Boston, MA
Share this story
Delete

Running OCR against PDFs and images directly in your browser

1 Share

30th March 2024

I attended the Story Discovery At Scale data journalism conference at Stanford this week. One of the perennial hot topics at any journalism conference concerns data extraction: how can we best get data out of PDFs and images?

I’ve been having some very promising results with Gemini Pro 1.5, Claude 3 and GPT-4 Vision recently—I’ll write more about that soon. But those tools are still inconvenient for most people to use.

Meanwhile, older tools like Tesseract OCR are still extremely useful—if only they were easier to use as well.

Then I remembered that Tesseract runs happily in a browser these days thanks to the excellent Tesseract.js project. And PDFs can be processed using JavaScript too thanks to Mozilla’s extremely mature and well-tested PDF.js library.

So I built a new tool!

tools.simonwillison.net/ocr provides a single page web app that can run Tesseract OCR against images or PDFs that are opened in (or dragged and dropped onto) the app.

Crucially, everything runs in the browser. There is no server component here, and nothing is uploaded. Your images and documents never leave your computer or phone.

Here’s an animated demo:

First an image file is dragged onto the page, which then shows that image and accompanying OCR text. Then the drop zone is clicked and a PDF file is selected - that PDF is rendered a page at a time down the page with OCR text displayed beneath each page.

It’s not perfect: multi-column PDFs (thanks, academia) will be treated as a single column, illustrations or photos may result in garbled ASCII-art and there are plenty of other edge cases that will trip it up.

But... having Tesseract OCR available against PDFs in a web browser (including in Mobile Safari) is still a really useful thing.

How I built this #

For more recent examples of projects I’ve built with the assistance of LLMs, see Building and testing C extensions for SQLite with ChatGPT Code Interpreter and Claude and ChatGPT for ad-hoc sidequests.

I built the first version of this tool in just a few minutes, using Claude 3 Opus.

I already had my own JavaScript code lying around for the two most important tasks: running Tesseract.js against an images and using PDF.js to turn a PDF into a series of images.

The OCR code came from the system I built and explained in How I make annotated presentations (built with the help of multiple ChatGPT sessions). The PDF to images code was from an unfinished experiment which I wrote with the aid of Claude 3 Opus a week ago.

I composed the following prompt for Claude 3, where I pasted in both of my code examples and then added some instructions about what I wanted it to build at the end:

This code shows how to open a PDF and turn it into an image per page:

<!DOCTYPE html>
<html>
<head>
  <title>PDF to Images</title>
  <script src="<a href="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/2.9.359/pdf.min.js" rel="nofollow">https://cdnjs.cloudflare.com/ajax/libs/pdf.js/2.9.359/pdf.min.js</a>"></script>
  <style>
    .image-container img {
      margin-bottom: 10px;
    }
    .image-container p {
      margin: 0;
      font-size: 14px;
      color: #888;
    }
  </style>
</head>
<body>
  <input type="file" id="fileInput" accept=".pdf" />
  <div class="image-container"></div>

  <script>
  const desiredWidth = 800;
    const fileInput = document.getElementById('fileInput');
    const imageContainer = document.querySelector('.image-container');

    fileInput.addEventListener('change', handleFileUpload);

    pdfjsLib.GlobalWorkerOptions.workerSrc = 'https://cdnjs.cloudflare.com/ajax/libs/pdf.js/2.9.359/pdf.worker.min.js';

    async function handleFileUpload(event) {
      const file = event.target.files[0];
      const imageIterator = convertPDFToImages(file);

      for await (const { imageURL, size } of imageIterator) {
        const imgElement = document.createElement('img');
        imgElement.src = imageURL;
        imageContainer.appendChild(imgElement);

        const sizeElement = document.createElement('p');
        sizeElement.textContent = `Size: ${formatSize(size)}`;
        imageContainer.appendChild(sizeElement);
      }
    }

    async function* convertPDFToImages(file) {
      try {
        const pdf = await pdfjsLib.getDocument(URL.createObjectURL(file)).promise;
        const numPages = pdf.numPages;

        for (let i = 1; i <= numPages; i++) {
          const page = await pdf.getPage(i);
          const viewport = page.getViewport({ scale: 1 });
          const canvas = document.createElement('canvas');
          const context = canvas.getContext('2d');
          canvas.width = desiredWidth;
          canvas.height = (desiredWidth / viewport.width) * viewport.height;
          const renderContext = {
            canvasContext: context,
            viewport: page.getViewport({ scale: desiredWidth / viewport.width }),
          };
          await page.render(renderContext).promise;
          const imageURL = canvas.toDataURL('image/jpeg', 0.8);
          const size = calculateSize(imageURL);
          yield { imageURL, size };
        }
      } catch (error) {
        console.error('Error:', error);
      }
    }

    function calculateSize(imageURL) {
      const base64Length = imageURL.length - 'data:image/jpeg;base64,'.length;
      const sizeInBytes = Math.ceil(base64Length * 0.75);
      return sizeInBytes;
    }

    function formatSize(size) {
      const sizeInKB = (size / 1024).toFixed(2);
      return `${sizeInKB} KB`;
    }
  </script>
</body>
</html>

This code shows how to OCR an image:

async function ocrMissingAltText() {
    // Load Tesseract
    var s = document.createElement("script");
    s.src = "https://unpkg.com/tesseract.js@v2.1.0/dist/tesseract.min.js";
    document.head.appendChild(s);

    s.onload = async () => {
      const images = document.getElementsByTagName("img");
      const worker = Tesseract.createWorker();
      await worker.load();
      await worker.loadLanguage("eng");
      await worker.initialize("eng");
      ocrButton.innerText = "Running OCR...";

      // Iterate through all the images in the output div
      for (const img of images) {
        const altTextarea = img.parentNode.querySelector(".textarea-alt");
        // Check if the alt textarea is empty
        if (altTextarea.value === "") {
          const imageUrl = img.src;
          var {
            data: { text },
          } = await worker.recognize(imageUrl);
          altTextarea.value = text; // Set the OCR result to the alt textarea
          progressBar.value += 1;
        }
      }

      await worker.terminate();
      ocrButton.innerText = "OCR complete";
    };
  }

Use these examples to put together a single HTML page with embedded HTML and CSS and JavaScript that provides a big square which users can drag and drop a PDF file onto and when they do that the PDF has every page converted to a JPEG and shown below on the page, then OCR is run with tesseract and the results are shown in textarea blocks below each image.

I saved this prompt to a prompt.txt file and ran it using my llm-claude-3 plugin for LLM:

llm -m claude-3-opus < prompt.txt

It gave me a working initial version on the first attempt!

A square dotted border around the text Drag and drop PDF file here

Here’s the full transcript, including my follow-up prompts and their responses. Iterating on software in this way is so much fun.

First follow-up:

Modify this to also have a file input that can be used—dropping a file onto the drop area fills that input

make the drop zone 100% wide but have a 2em padding on the body. it should be 10em high. it should turn pink when an image is dragged over it.

Each textarea should be 100% wide and 10em high

At the very bottom of the page add a h2 that says Full document—then a 30em high textarea with all of the page text in it separated by two newlines

Here’s the interactive result.

A PDF file is dragged over the box and it turned pink. The heading Full document displays below

And then:

get rid of the code that shows image sizes. Set the placeholder on each textarea to be Processing... and clear that placeholder when the job is done.

Which gave me this.

I noticed that it didn’t demo well on a phone, because you can’t drag and drop files in a mobile browser. So I fired up ChatGPT (for no reason other than curiosity to see how well it did) and got GPT-4 to add a file input feature for me. I pasted in the code so far and added:

Modify this so jpg and png and gif images can be dropped or opened too—they skip the PDF step and get appended to the page and OCRd directly. Also move the full document heading and textarea above the page preview and hide it u til there is data to be shown in it

Then I spotted that the Tesseract worker was being created multiple times in a loop, which is inefficient—so I prompted:

Create the worker once and use it for all OCR tasks and terminate it at the end

I’d tweaked the HTML and CSS a little before feeding it to GPT-4, so now the site had a title and rendered in Helvetica.

Here’s the version GPT-4 produced for me.

A heading reads OCR a PDF or Image - This tool runs entirely in your browser. No files are uploaded to a server. The dotted box now contains text that reads Drag and drop a PDF, JPG, PNG, or GIF file here or click to select a file

Rather delightfully it used the neater pattern where the file input itself is hidden but can be triggered by clicking on the large drop zone, and it updated the copy on the drop zone to reflect that—without me suggesting those requirements.

Manual finishing touches #

Fun though it was iterating on this project entirely through prompting, I decided it would be more productive to make the finishing touches myself. You can see those in the commit history. They’re not particularly interesting:

  • I added Plausible analytics (which I like because they use no cookies).
  • I added better progress indicators, including the text that shows how many pages of the PDF have been processed so far.
  • I bumped up the width of the rendered PDF page images from 800 to 1000. This seemed to improve OCR quality—in particular, the Claude 3 model card PDF now has less OCR errors than it did before.
  • I upgraded both Tesseract.js and PDF.js to the most recent versions. Unsurprisingly, Claude 3 Opus had used older versions of both libraries.

I’m really pleased with this project. I consider it finished—it does the job I designed it to do and I don’t see any need to keep on iterating on it. And because it’s all static JavaScript and WebAssembly I expect it to continue working effectively forever.

Read the whole story
chrisamico
15 days ago
reply
Boston, MA
Share this story
Delete

Eclipse 2024 - Andy Woodruff, cartographer

1 Share

Eclipse and elevation data by NASA. Base map data and land cover colors from Natural Earth/Tom Patterson.

Read the whole story
chrisamico
20 days ago
reply
Boston, MA
Share this story
Delete

Open Source Spotlight: Old Insurance Maps — Development Seed

1 Share
Read the whole story
chrisamico
26 days ago
reply
Boston, MA
Share this story
Delete

American politics is undergoing a racial realignment

1 Share

Unlock the Editor’s Digest for free

Roula Khalaf, Editor of the FT, selects her favourite stories in this weekly newsletter.

Read the whole story
chrisamico
31 days ago
reply
Boston, MA
Share this story
Delete
Next Page of Stories