Extracting Metadata from PDF Files with JabRef

Since version 2.7 Beta 1 the reference manager JabRef is using Mr. dLib to extract metadata from PDF files for automatically creating BibTeX entries. On this page we want to give answers to some questions you might ask yourselve.

What and who is Mr. dLib?

Mr. dLib is a machine readable digital library from the SciPlore and Docear team. We aim to provide metadata of millions of academic articles via a free REST based API. Currently, Mr. dLib is under construction. However, we already have developed some tools and services we would like to share, such as metadata extraction from PDF files. JabRef is our first guinipig :-).

How does metadata extraction work?

In short: Your PDF file is transfered to our servers, we analyse it and return the metadata. The long version: First, your PDF file is transfered to our servers. Second, we extract as much metadata from the full text of the PDF as possible. For this, we use some self-developed algorithms and existing tools based on machine learning approaches. From the extracted metadata we build something like a fingerprint and check our database (which already contains a few millions metadata for academic articles) for more metadata for your PDF file. That's it :-)

How good is the metadata extraction?

To be honest: not too good... yet! In about 80% of all cases you should get a correct title. In about 50% you should get some more information like the author(s). And if you are having an article in the field of computer science there is a good chance to get a complete set of metadata. However, we consider the metadata extraction being in Alpha status. From around April 2011 we will spend a lot of time improving the algorithms and performance and we are confident to achieve excellent results.

However, you can help us: whenever there is a PDF file you did not get the correct title for, please send this file to us via email. You can also try our test PDF. If you don't get a complete set of metadata for this PDF, please tell us.

Why is the metadata extraction so slow?

Depending on your internet connection uploading a PDF might take a while. However, we work on this problem. Probably we will modify JabRef, so not the entire PDF is transferred but only the first page which should be sufficient in most cases for metadata extraction.

Why don't you put the functionality into JabRef directly?

First, our tools are not platform independent as JabRef is (they run only with Linux). Second, it's just not the concept of Mr. dLib to offer some tools for download. Our goal is to offer services via an API that anybody can access from anywhere without a lot of programming knowledge. Third, we will improve our service much more frequently than JabRef is releasing updates. This way you always have the most current version of our service.

Do you store any private data or my PDF files on your server?

No! Once your PDF file is analyzed on our servers it will be deleted. Also, no other data is transmitted or stored on our servers.

Can I use Mr. dLib for my software?

Sure. You can either have a look at the source code of JabRef and see how they are using our service. Or you can wait until April 2011 when we release a documentation for Mr. dLib or you contact us telling what exactly you want to use Mr. dLib for.