Jamzy Wang

life is a struggle,be willing to do,be happy to bear~~~

[转]如何实现科技论文里面的算法

2014-11-16 14:30

原文链接

英文原文:How to implement an algorithm from a scientific paper

This article is a short guide to implementing an algorithm from a scientific paper. I have implemented many complex algorithms from books and scientific publications, and this article sums up what I have learned while searching, reading, coding and debugging. This is obviously limited to publications in domains related to the field of Computer Science. Nevertheless, you should be able to apply the guidelines and good practices presented below to any kind of paper or implementation.

1 – Before you jump in

There are a few points you should review before you jump into reading a technical paper and implementing it. Make sure you cover them carefully each time you are about to start working on such a project.

这是一篇关于如何实现科研论文中算法的简要指南。作者曾实现过很多书本上和科研论文中的复杂算法,在这篇文章中作者总结他在研究,阅读,编码和调试时积累的大量经验。很显然,这篇文章主要集中在和计算机科学相关的研究领域中。 然而,你也可以在其他任何领域的论文中使用下面提及的准则。

1. – 开始之前

在你开始阅读一篇论文和实现它之前,有几个你需要注意的地方。并请确保每次你要开始类似的项目之前,都仔细的注意过这几个方面。

1.1 – Find an open source implementation to avoid coding it

Unless you want to implement the paper for the purpose of learning more about the field, you have no need to implement it. Indeed, what you want is not coding the paper, but just the code that implements the paper. So before you start anything, you should spend a couple of days trying to find an open source implementation on the internet. Just think about it: would you rather lose two days looking for the code, or waste two months implementing an algorithm that was already available?

1.2 – Find simpler ways to achieve your goal

Ask yourself what you are trying to do, and if simpler solutions would work for what you need. Could you use another technique – even if the result is only 80% of what you want – that does not require to implement the paper, and that you could get running within the next two days or so with available open source libraries? For more regarding this, see my article The 20 / 80 Productivity Rule.

1.1 – 看看是不是已经有开源软件实现过

除非你是为了纯粹的学习目的而去实现一篇文论中的算法, 否则你不必一定要实现它。实际上,你所需要的不是自己去实现它的过程,而是已经实现了这个算法的代码。所以在你开始编码之前,你应该花一些时间去找找网上是不是已经有实现了这个算法的开源软件。想想你是愿意花两天时间去找到已经完成的代码,还是浪费两个月的时间去实现一个别人已经实现的了算法呢?   

1.2 – 用最简单的方法达到你的目的

你要先搞清楚你想要达到的目标是什么,以及是不是有简单的方法已经可以达到你的目标。 或许你可以尝试使用另外的技术,即使它只能达到你80%的目标,但它不需要你去实现一篇论文。然后你可以再花几天的时间去尝试是不是可以用开源软件运行起来。关于这个更多的详情,请参考我的另外一篇文章 The 20 / 80 Productivity Rule.

1.3 – Beware of software patents

If you are in the U.S., beware of software patents. Some papers are patented and you could get into trouble for using them in commercial applications.

1.4 – Learn more about the field of the paper

If you are reading a paper about the use of Support Vector Machines (SVM) in the context of Computational Neuroscience, then you should read a short introduction to Machine Learning and the different types of classifiers that could be alternatives to SVM, and you should as well read general articles about Computational Neuroscience to know what is being done in research right now.

1.5 – Stay motivated

If you have never implemented a paper and/or if you are new to the domain of the paper, then the reading can be very difficult. Whatever happens, do not let the amount and the complexity of the mathematical equations discourage you. Moreover, speed is not an issue: even if you feel that you understand the paper slower than you wish you would, just keep on working, and you will see that you will slowly and steadily understand the concepts presented in the paper, and pass all difficulties one after the other.

1.3 – 注意软件的专利

如果你再美国, 你需要注意软件的专利问题。有一些论文是注册了专利的,你可能因为在商业软件中使用了它的算法而惹上麻烦。

1.4 – 关注更多该领域的论文

如果你正在研究一篇关于在计算神经学领域中使用支持向量机(SVM)的论文的话,那你应该再读一些关于机器学习以及其他可替代支持向量机(SVM)的分类算法的介绍。同时,你还可以读一些关于计算神经学领域的文章,看看学术界正在研究什么.

1.5 – 保持积极性

如果你还没有实现过一篇论文或你正在研究一个新领域内的论文,这样的阅读研究是很困难的。不管发生什么,不要让那些复杂的数学公式吓到你。而且,不要去担心进度的问题,即使你感觉理解这篇论文要比你预计的要慢很多, 坚持做下去你会发现你慢慢的就理解了这篇论文中所要表述的概念。 

2 – Three kinds of papers

It is never a good idea to pick a random paper and start implementing in right away. There are a lot of papers out there, which means there is a lot of garbage. All publications can fit into three categories:

2.1 – The groundbreaking paper

Some really interesting, well-written, and original research. Most of these papers are coming out of top-tier universities, or out of research teams in smaller universities that have been tackling the problem for about six to ten years. The later is easy to spot: they reference their own publications in the papers, showing that they have been on the problem for some time now, and that they base their new work on a proven record of publications. Also, the groundbreaking papers are generally published in the best journals in the field.

2 – 三种类型的论文

随便选一篇论文,马上开始实现绝对不是一个好注意. 世界上有很多很多的论文,这也意味着有很多很多的垃圾论文。所以发表的论文可以分成三类。

2.1 – 创新型论文

这是一些非常有趣的,写的很好的,原创性的论文。大部分这类论文来自世界顶尖大学或者是已经研究某个领域很久的小学校的研究团队。后者很好区分,他们往往都在自己论文中引用自己之前的研究成果,已展示他们对这个问题的研究已经很长时间了, 而且他们最新的研究工作都是基于已经证实了的论文之上。 同时,这类型的论文基本都在该领域内最好的期刊杂志上发表。

2.2 – The copycat paper

Some research group that is just following the work of the groundbreaking teams, proposing improvements to it, and publishing their results of the improvements. Many of these papers lack proper statistical analysis and wrongly conclude that the improvements are really beating the original algorithm. Most of the time, they really are not bringing anything except for unnecessary additional complexity. But not all copycats are bad. Some are good, but it’s rare.

2.3 – The garbage paper

Some researchers really don’t know what they are doing and/or are evil. They just try to maintain their status and privileges in the academic institution at which they teach. So they need funding, and for that they need to publish, something, anything. The honest ones will tell you in the conclusion that they failed and that the results are accurate only N% of the time (with N being a bad value). But some evil ones will lie, and say that their research was a great success. After some time reading publications, it becomes easy to spot the garbage paper and ditch them.

2.2-模仿型论文

一些研究群体仅仅只是跟随哪些有突破性创新的团体,他们的目的就是提高已经发表的论文(算法),并且发表他们提高后的结果。很多这样的论文缺少合适的统计分析和错误的结论,这样的提高其实是破坏了原有的算法。很多时候,他们这样做并没有带来任何东西,除了带来不必要与而外的复杂性。但也并不是所有的模仿是不好的,有些还是不错的,但却很稀少。

2.3-垃圾型论文

一些研究者并不知道他们做的事情是非常不好的。他们仅仅是为了维持他们在自己学术领域中的状态与特权。所以他们需要资金,并且他们需要出版任何他们需要的东西。比较诚实的研究者会告诉你,在结论中他们失败了,在结果中只有N%的时间(N是有比较差的值)。但一些心怀不轨的研究员会说谎话,他们会说他们的研究取得了巨大的成功。在阅读出版物一段时间以后,很容易就能发现那些垃圾型论文,并抛弃他们。

3 – How to read a scientific paper

A lot has already been written on the topic, so I am not going to write much about it. A good starting point is: How to Read a Paper by Srinivasan Keshav. Below are a few points that I found useful while I was reading scientific publications.

3.1 – Find the right paper

What you want to implement is an original paper, one that started a whole domain. It is sometimes okay to pick a copycat paper, if you feel that it brings real improvements and consistency to a good but immature groundbreaking paper.

So let’s say you have a paper as your starting point. You need to do some research in its surroundings. For that, the strategy is to look for related publications, and for the publications being listed in the “References” section at the end of the paper. Go on Google Scholar and search for the titles and the authors. Does any of the papers you found do a better job than the paper you had originally? If yes, then just ditch the paper you were looking at in the first place, and keep the new one you found. Another cool feature of Google Scholar is that you can find papers that cite a given paper. This is really great, because all you have to do is to follow the chain of citations from one paper to the next, and you will find the most recent papers in the field. Finding the good paper from a starting point is all about looking for papers being cited by the current paper, and for papers citing the current paper. By moving back and forth in time you should find the paper that is both of high quality and fits your needs.

Important: note that at this stage of simple exploration and reckoning, you should not be reading and fully understand the papers. This search for the right paper should be done just by skimming over the papers and using your instinct to detect the garbage (this comes with experience).

3-如何阅读科技论文

这个话题已经讨论过很多次,因此不打算继续写很多关于这个问题的探讨。一个好的出发点可以参见这篇文章:How to Read a Paper ,下面的一些观点是我在阅读科技论文时候发现的有用的观点。

3.1-找到正确的论文

原始论文就是你想实现的是什么,描述的是这个领域的开端。有时候也可以选择相似的论文,前提是你觉得这篇论文是能带来算法上本质的提升并且有良好的一致性却不成熟的创新的文章。

那么让我告诉你以该论文作为出发点。你需要在该论文相关领域做些研究。对此,方法就是寻找相关联的论文,即是那些在该论文引用部分里面所提到的论文。再去Google Scholar依据论文题目和作者搜索。你找到的论文是否比你现在看的论文所做的效果更好呢?如果是,那么就去研究找到的新论文。Google Scholar还有个优点就是能标注论文的引用。这个非常棒,因为你能从一篇论文的引用链接找到另外的论文,这样可以找到某个研究领域最新的成果。总体来说就是从当前论文的引用中寻找新的论文。通过这样来回搜索,就会找到适合你的高质量的论文。

重点:在这个搜索论文过程中,你不需要从头到尾仔细阅读理解论文。只需要大体浏览论文并用你的直觉去判断论文的价值。

3.2 – Do not read on the screen

Print the publication on hard paper and read the paper version. Also, do not reduce the size in order to print more on each page. Yes, you will save three sheets of paper, but you will lose time as you will get tired faster reading these tiny characters. Good font size for reading is between 11 and 13 points.

3.3 – Good timing and location

Do not read a paper in the middle of the night, do it at a moment of the day when your brain is still fresh. Also, find a quiet area, and use good lighting. When I read, I have a desk lamp pointing directly at the document.

3.4 – Marker and notes

Highlight the important information with a marker, and take notes in the margin of whatever idea that pops in your head as you read.

3.2 – 不要在屏幕上阅读

打印成纸质阅读,打印时不要降低字体大小,如果你降低了字体大小可能节省了几页纸却使你在阅读时因为眼睛更易疲劳而浪费了更多的时间,字体大小一般建议在11pt-13pt

3.3 –选择恰当的阅读时间和地点

不要在半夜阅读,最好在大脑清醒时才阅读.也应该找一个安静、光线适中的地方。我通常都在台灯下阅读。

3.4 – 注意标记和备注

标记重要信息,并在重要信息旁写下阅读时脑袋中浮现的各种想法。

3.5 – Know the definitions of all the terms

When you are used to read mostly news articles and fiction, your brain is trained to fill-in meaning for words that you do not know, by using context as a deduction device. Reading scientific publications is a different exercise, and one of the biggest mistake is to assume false meaning for a word. For instance in this sentence “The results of this segmentation approach still suffer from blurring artifacts”. Here two words, “segmentation”, and “artifacts”, have a general meaning in English, but also have a particular meaning in the domain of Computer Vision. If you do not know that these words have a particular meaning in this paper, then while reading without paying attention, your brain will fill-in the general meaning, and you might be missing some very important information. Therefore you must (i) avoid assumptions about words, and whenever in doubt look up the word in the context of the domain the publication was written, and (ii) write a glossary on a piece of paper of all the concepts and vocabulary specific to the publication that you did not know before. If you encounter for the first time concepts such as “faducial points” and “piece-wise affine transform”, then you should look-up their precise definitions and write them down in your glossary. Concepts are language-enabled brain shortcuts, and allow you to understand the intent of the authors faster.

3.5 知道所有术语的定义

通常你阅读大多数文章和小说时,对于生词你的大脑会象计算机一样根据上下文自动得出意思。但是科技出版物完全不同,其中最多的错误就是单词意思的错误假设。比如说句子“The results of this segmentation approach still suffer from blurring artifacts”中“segmentation”和“artifacts”通常有着“分割”和“手工品”的意思,但是在计算机视觉领域有着特殊的意思。如果你阅读这篇论文时没有注意这个,你的大脑用了他们的通常意义去理解,你会漏掉很重要的信息。因此你必须(1)避免假设单词的意思,任何时候有疑问需查阅专业词典(2)抄写一张你不熟悉的专业生词概念表做参考。如果你遇到概念比如“faducial points”和“piece-wise affine transform”,首先你应该查出他们的准确的专业含义,然后加在你的生词表里。概念是语言式的大脑快捷键可以帮助你更快理解作者的意图。

3.6 – Look for statistical analysis in the conclusion

If the authors present only one curve from their algorithm and one curve from another algorithm, and say “look, it’s 20% more accurate”, then you know you’re reading garbage. What you want to read is: “Over a testing set of N instances, our algorithm shows significant improvement with a p-value of 5% using a two-sample t-test.” The use of statistical analysis shows a minimum of driving from the author, and is a good proof that the results can be trusted for generalization (unless the authors lied to make their results look more sexy, which can always happen).

3.7 – Make sure the conclusions are demonstrating that the paper is doing what you need

Let’s say you want an algorithm that can find any face in a picture. The authors of the paper say in the conclusion that their model was trained using 10 poses from 80 different people (10 x 80 = 800 pictures), and that the accuracy of face detection with the training set was 98%, but was only 70% with the testing set (picture not used during training). What does this mean? This means that apparently, the algorithm has issues to generalize properly. It performs well when used on the training set (which is useless), and perform worse when used in real-world cases. What you should conclude at this point is that maybe, this paper is not good enough for what you need.

3.6 – 关注结论中的统计分析

如果论文作者仅仅用曲线展示他的算法和其他算法的差异,并且说“你看,比原有算法准确了20%”,这种论文就一垃圾论文. 你应该阅读的是那种在论文中说: “在N个实例的测试集中,我们的算法在两个样本测试中检验值提升了5%(这句真不好翻译)” 统计结论更多的证据应该来自数据而不是作者的个人想法 (除非作者撒谎夸大他的结论).

3.7 – 请确定论文所展示的结论是你所需要的

假设你需要一个能够从图片中人脸识别的算法.作者在论文结论中说他的模型用80个人不同照片(10x 80 = 800张图片)进行训练, 对训练集进行测试时人脸识别准确率达到98%,但是对测试集(测试的图片没有被用于训练集)进行人脸识别时准确率只有70%。这样的描述是什么意思呢?它以意味着该算法仍然存在问题(这句真不好翻译)——该算法当用于对训练集进行识别时是非常不错的(但种情况没有任何实际意义),但是用于对测试集进行识别时就会非常糟糕。你也可以通过这点,来判断一篇论文是否是你所需的

3.8 – Pay attention to the input data used by the authors

If you want to perform face detection with a webcam, and the authors have used pictures taken with a high-definition camera, then there are chances that the algorithm will not perform as well in your case as it did for the authors. Make sure that the algorithm was tested on data similar to yours or you will end up with a great implementation that is completely unusable in your real-world setup.

3.9 – Authors are humans

The authors are humans, and therefore they make mistakes. Do not assume that the authors are absolutely right, and in case an equation is really hard to understand or follow, you should ask yourself whether or not the authors made a mistake there. This could just be a typo in the paper, or an error in the maths. Either case, the best way to find out is to roll out the equations yourself, and try to verify their results.

3.8 注意作者用的输入数据

如果你想用网络摄像头做脸部识别,而作者的输入数据用的是出自高清相机的相片,那么算法结果可能会完全不同。必须保证你的输入数据近似与算法测试数据,否则会造成算法完全无用在你的case。

3.9 作者也是人

人都会犯错,不要认为作者是神。特别是算式真的很难理解,你应该问问你自己是否作者写错了。这可能仅仅是论文的抄写错误或一个数学计算错误。不管怎么说,最好的方法是自己做做计算去验证结果。


3.10 – Understand the variables and operators

The main task during the implementation of a publication is the translation of math equations in the paper into code and data. This means that before jumping into the code, you must understand 100% of the equations and processes on these equations. For instance, “C = A . B” could have different meaning. A and B could be simple numbers, and the “.” operator could simply be a product. In that case, C would be the product of two numbers A and B. But maybe that A and B are matrices, and that “.” represents the matrix product operator. In that case, C would be the product matrix of the matrices A and B. Yet another possibility is that A and B are matrices and that “.” is the term-by-term product operator. In that case, each element C(i,j) is the product of A(i,j) and B(i,j). Notations for variables and operators can change from one mathematical convention to another, and from one research group to another. Make sure you know what each variable is (scalar, vector, matrix or something else), and what every operator is doing on these variables.

3.11 – Understand the data flow

A paper is a succession of equations. Before you start coding, you must know how you will plug the output of equation N into the input of equation N+1.

3.10-理解变量与运算符

实现算法的主要的工作量在于将纸面上的数学算式转化为代码以及数据。这意味着在转化完毕之前,你必须百分百理解那些数学式子到底是在做什么。例如“C=A.B”可能有很多意义。A和B

可以是简单的数字,而“.”是一个数字运算符号。这种情况下,C的结果可以是A和B的和。但是A和B也有可能是两个矩阵,这种情况下,“.”又可以作为一个矩阵运算符存在。这种情况下,C是A和B两个矩阵运算的结果矩阵。唉……其实还有个可能,A,B都是矩阵,然后“.”是一个矩阵加法运算符,这种情况下,每一个C(i,j)都是A(i,j)和B(i,j)的和。变量和符号到底代表什么意思(矩阵、向量、数字、坐标,还是其他?),这值得你好好研究下。(译者注:其中关于矩阵的运算符号,我不清楚这几个英文单词都对应哪些符号……)

3.11-弄清数据处理流程

论文就是一堆方程的大杂烩。在你开始写代码之前,你需要了解如何把N根据要求变成N+1

4 – Prototyping

Once you have read and understood the paper, it’s time to create a prototype. This is a very important step and avoiding it can result in wasted time and resources. Implementing a complex algorithm in languages such as C, C++ or Java can be very time consuming. And even if you have some confidence in the paper and think the algorithm will work, there is still a chance that it won’t work at all. So you want to be able to code it as quickly as possible in the dirtiest way, just to check that it’s actually working.

4.1 – Prototyping solutions

The best solution for that is to use a higher level versatile language or environment such as Matlab, R, Octave or SciPy/NumPy. It is not that easy to represent a mathematical equation in C++ and then print the results to manually check them. On the contrary, it is extremely straightforward to write equations in Matlab, and then print them. What would take you two to three weeks in C++ will take take you two days in Matlab.

4-模型

一旦读懂了论文,接下来你应当设计一个模型。这是一个非常重要的步骤,通过这一步你可以避免大量的资源以及时间的浪费。要把一个算法用c,c++或者java写出来是非常费时间的。即使你认为自己对论文有足够的了解而且算法肯定会按照你所想的去工作,这里仍然有个问题:如果他完全不动弹呢?所以如果你想尽快将算法转化为代码,还是先建立一个模型检验一下他能否工作吧。

4.1-模型建立方法

最好的方法当然是使用高级通用语言来建立,比如matlab,R,Octave或者SciPy/NumPy。像c++这样的工程语言一般不太容易用于描述一个数学方程并输出结果以进行人工检查。相反的,在matlab上这就很容易实现。你用c++两三周才可以干完的事情或许matlab上面只需要两天。

4.2 – Prototyping helps the debugging process

An advantage of having a prototype is that when you will have your C++ version, you will be able to debug by comparing the results between the Matlab prototype and the C++ implementation. This will be developed further in the “Debugging” section below.

4.3 – Wash-off implementation issues beforehand

You will certainly make software design mistakes in your prototype, and this is a good thing as you will be able to identify where are the difficulties with both the processes or data. When you will code the C++ version, you will know how to better architect the software, and you will produce way cleaner and more stable code than you would have without the prototyping step (this is the “throw-away system” idea presented by Frederick Brooks in The Mythical Man-Month).

4.4 – Verify the results presented in the paper

Read the “Experiment” section of the paper carefully, and try to reproduce the experimental conditions as closely as possible, by using test data as similar as possible to the ones used by the authors. This increases your chances of reproducing the results obtained by the authors. Not using similar conditions can lead you to a behavior of your implementation that you might consider as an error, whereas you are just not feeding it with the correct data. As soon as you can reproduce the results based on similar data, then you can start testing it on different kinds of data.

4.2-用模型进行调试

模型的另外一个应用是进行调试,特别是当你开发出你的c++版本以后,你可以通过你的matlab版本的输出与c++版本进行对比。对该条的进一步描述请参看“调试”(译者注:“7-调试”)章节。

4.3-模型可以精实你的程序

你肯定会在模型建立当中遇到一些问题。这是好事情,因为你可以从中发现数据处理的难点在哪里。当你编写c++代码的时候你就知道如何更好的构造程序,因而使他更加的可靠简洁。而如果你没有经历过模型构造这一步,你的代码大概会走向相反方向(这一点可以从Frederick Brooks的《人月神话》中找到)。

4.4-检验文章中的试验结果

请仔细阅读“试验”(译者注:我没有找到相关章节)这一章节,然后尝试使用论文中给出的数据重复试验。这将会使你可能看到作者文章中所描述的结果。如果不能较好的重复文章中的试验,你将在试验中看到不一样的结果,让你疑心是否是文章有问题。而其实你只要较好的重复试验并输入文章中的试验数据你就可以看到文章所描述的结果。然后你就可以开始测试其他不同的数据了。

5 – Choose the right language and libraries

At this stage, you must have a clear understanding of the algorithm and concepts presented in the publication, and you must have a running prototype which convinces that the algorithm is actually working on the input data you wish to use in production. It is now time to go into the next step, which consists in implementing the publication with the language and framework that you wish to use in production.

5.1 – Pre-existing systems

Many times, the production language and libraries are being dictated by pre-existing systems. For instance, you have a set of algorithm for illumination normalization in a picture, in a library coded in Java, and you want to add a new algorithm from a publication. In that case, obviously, you are not going to code this new algorithm in C++, but in Java.

5 - 正确地选择实现语言和库

在这一阶段,你必须已对论文提出的算法和概念有了一个清晰的认识,而且必须能有一个可运行原型程序,以确认算法能够在你预想的输入数据下,正常的工作。好的,进入下一步,确定你要使用哪种语言和框架来实现论文的算法。

5.1 - 现有体系

大多数情况,现有体现已经指定了实现语言和库。实际的例子:你有一套Java实现的计算图像照度归一化的算法库,而你想要将论文的新算法添加进库。这种情况,很显然,你要使用Java来实现新算法,而不是C++。

5.2 – Predicted future uses of the implementation

In the case there is no pre-existing system imposing you a language, then the choice of the language should be done based upon the predicted uses of the algorithm. For example, if you believe that within four to six months, a possible port of your application will be done to the iPhone, then you should choose C/C++ over Java as it would be the only way to easily integrate the code into an Objective-C application without having to start everything from scratch.

5.3 – Available libraries that solve fully or partly the algorithm

The available libraries in different languages can also orient the choice of the production language. Let’s imagine that the algorithm you wish to implement makes use of well-known algebra techniques such as principal component analysis (PCA) and single value decomposition (SVD). Then you could either code PCA and SVD from scratch, and if there is a bug could end up debugging for a week, or you could re-use a library that already implements these techniques and write the code of your implementation using the convention and Matrix class of this library. Ideally, you should be able to decompose your implementation into sub-tasks, and try to find libraries that already implement as many of these sub-tasks as possible. If you find the perfect set of libraries that are only available for a given language, then you should pick that language. Also, note that the choice of libraries should be a trade-off between re-using existing code and minimizing dependencies. Yes, it is good to have code for every sub-task needed for your implementation, but if that requires to create dependencies over 20 different libraries, then it might be not very practical and can even endanger the future stability of your implementation.

5.2 - 前瞻算法的应用场景

在没有现有体系限制你选择语言的情况下,则要基于对该算法在将来的使用场景的前瞻来选择实现语言。比如,如果你坚信在四到六个月后,你的应用程序将会实现面向iPhone的接口,那你就应该更倾向于选择C/C++,而不是Java,因为如此可以很简单的将算法代码集成的Objective-C的应用中,而不需要重头来过。

5.3 - 参考能够解决或部分解决算法问题的现有库

不同语言实现的现有库也会影响我们对实现语言的选择。让我们来想象一下,你想要实现的算法使用了诸如主成分分析(PCA)以及单值分解(SVD)等经典的代数技巧。那你是重头编码实现PCA和SVD算法,遇到bug,可能还要停滞一周来调试自己的代码?还是使用现有的,已经实现了这些功能的库,使用他们提供的常规接口和矩阵类来实现自己的编码?理论上讲,你应该将你的实现工作分解为若干个子任务,然后尝试尽可能多的寻找能够解决这些子任务问题的现有库。如果你发现有一套完美解决你的问题的库,且仅有某一语言版本提供,那么,就使用这种语言。同时也要注意,在选择库的时候,需要在代码的重用及最小依赖之间做出权衡。没错,如果有能够解决你所有子任务问题的代码是再好不过了,但是如果这需要依赖超过20各不同的库,那实在是不太实用,甚至危及你的实现代码将来的稳定性。

6 – Implementation

Here are some tips from my experience in implementing publications

6.1 – Choose the right precision

The type you will use for your computation should be chosen carefully. It is generally way better to use double instead of float. The memory usage can be larger, but the precision in the calculation will greatly improve, and is generally worth it. Also, you should be aware of the differences between 32-bit and 64-bit systems. Whenever you can, create your own type to encapsulate the underlying type (float or double, 32-bit or 64-bit), and use this type in your code. This can be done with a define is C/C++ or a class in Java.

6.2 – Document everything

Although it is true that over-documenting can slow down a project dramatically, in the case of the implementation of a complex technical paper, you want to comment everything. Even if you are the only person working on the project, you should document your files, classes and methods. Pick a convention like Doxygen or reStructuredText, and stick to it. Later in the development, there will be a moment where you will forget how some class works, or how you implemented some method, and you will thank yourself for documenting the code!

6 – 实现

实现算法的一些个人经验

6.1 – 选择正确的精度

通常最好用double代替float,虽然使用double内存会占用更多,但却改善了精度,这是值得的.另外,你需要注意32bit和64bit操作系统之间的不同处.如果可以,自己创建一个封装(封装了float或者double,32-bit或64-bit)类型,然后在代码中使用封装了的数据类型,在c/c++中可以使用define实现,java中可以使用class来实现.

6.2 – 为代码做好注释

虽然文档维护会显著降低项目进度,但在实现一个复杂的技术论文时,你最好做好文档.即使项目只有你一个人,你也应该对你的文件、类、方法进行注释说明.选择一种转换方法如Doxygen或reStructuredText,并坚持使用它.在以后的开发中,你可能会忘记某个class是如何使用的或者某个方法的实现方式,你就会觉得当初的文档太有用了.

6.3 – Add references to the paper in your code

For every equation from the paper that you implement, you need to add a comment citing the paper (authors and year) and either the paragraph number or the equation number. That way, when later re-reading the code, you will be able to connect directly the code to precise locations in the paper. These comments should look like:

// See Cootes et al., 2001, Equation 2.3
// See Matthews and Baker, 2004, Section 4.1.2

6.4 – Avoid mathematical notations in your variable names

Let’s say that some quantity in the algorithm is a matrix denoted A. Later, the algorithm requires the gradient of the matrix over the two dimensions, denoted dA = (dA/dx, dA/dy). Then the name of the the variables should not be “dA_dx” and “dA_dy”, but “gradient_x” and “gradient_y”. Similarly, if an equation system requires a convergence test, then the variables should not be “prev_dA_dx” and “dA_dx”, but “error_previous” and “error_current”. Always name things for what physical quantity they represent, not whatever letter notation the authors of the paper used (e.g. “gradient_x” and not “dA_dx”), and always express the more specific to the less specific from left to right (e.g. “gradient_x” and not “x_gradient”).

6.3 - 在你的代码中添加引用

对于来自你所要实现的论文的每一条方程,你需要添加一段注释引用这篇论文(作者和时间)以及段落号码或方程号码。这样,以后重读代码时,你可以直接把代码和论文中的准确位置联系起来。这些注释就像:
// 参见 Cootes et al., 2001, 方程 2.3
// 参见 Matthews and Baker, 2004, 章节 4.1.2

6.4 - 在变量名中避免数学符号

我们假设一些数量在算法中是一个矩阵表示为A。然后,算法要求矩阵在两个维度上的梯度,表示位 dA = (dA/dx, dA/dy)。那么变量名不应当为"dA_dx"和"dA_dy",而是"gradient_x"和"gradient_y"。相似地,如果一个方程系统要求收敛判定,那么变量不应为"prev_da_dx"和"dA_dx",而是"error_previous"和"error_current"。把变量永远命名为它们所表示的物理数量,而不是论文作者使用的任何字母符号(例如"gradient_x"而不是"dA_dx"),并且永远要把更具体的表示写在左边(例如"gradient_x"而不是"x_gradient")。

6.5 – Do not optimize during the first pass

Leave all the optimization for later. As you can never be absolutely certain which part of your code will require optimization. Every time you see a possible optimization, add a comment and explain in a couple of lines how the optimization should be implemented, such as:

// OPTIMIZE HERE: computing the matrix one column at a time
// and multiplying them directly could save memory

That way, you can later find all the locations in your code at which optimizations are possible, and you get fresh tips on how to optimize. Once your implementation will be done, you will be able to find where to optimize by running a profiler such as Valgrind or whatever is available in the programming language you use.

6.6 – Planning on creating an API?

If you plan on using your current code as a basis for an API that will grow with time, then you should be aware of techniques to create interfaces that are actually usable. For this, I would recommend the “coding against the library” technique, summarized by Joshua Bloch in his presentation How to Design a Good API and Why it Matters.

6.5-不要一开始就优化

请把优化这种事情往后面放放。否则你绝对不可能确定你的代码到底是哪里出了问题。而每当你进行优化,你最好添加一些注释来进行解释,例如:

//优化:一次计算矩阵一列以减少内存开销

通过这种方式,不论是什么语言,你都可以很快的从你的代码中发现到底是哪处优化导致了问题。

6.6-打算创建一个api?

如果你打算把你的代码构建成一个api,你需要注意的是如何如何创建接口。对于这一点,我建议你使用Joshua Bloch How to Design a Good API and Why it Matters书中的“面向函数库编程”(译者注:原文“coding against the library”真的不知道怎么翻译……)技术。

7 – Debugging

Implementing a new algorithm is like cooking a dish you never ate before. Even if it tastes kind of good, you will never know if this is what it was supposed to taste. Now we are lucky, since unlike for cooking, software development has some helpful trick to increase the confidence we have in an implementation.

7.1 – Compare results with other implementations

A good way to wash out the bugs is to compare the results of your code with the results of an existing implementation of the same algorithm. As I assume that you did correctly all the tasks in the “But before you jump” section presented above, you did not find any available implementation of the algorithm (or else you would have used it instead of implementing the paper!). As a consequence, the only other implementation that you have at this stage is the prototype that you programmed earlier.

7 - 调试

实现一个新算法如同烧一道你之前从未吃过的菜。即使它味道不错,你也不知道这是不是它原本的味道。和烧菜不同的是,软件开发有一些有用的诀窍来增添我们在实施中的信心,所以我们还算运气不错。

7.1 - 和其他实现比较结果

一个淘出bug的好方法就是把你代码的结果和同一个算法的已有的实现的结果进行比较。假设你正确完成了上面“着手之前”章节提及的所有任务,你会发现该算法还没有任何可行的实现(否则你就该直接用它,不必去实现那篇论文)。所以,你眼下的另一种实现只能是你以前编写的原型。

The idea is therefore to compare the results of the prototype and the production implementation at every step of the algorithm. If the results are different, then one of the two implementations is doing something wrong, and you must find which and why. Precision can change (the prototype can give you x = 1.8966 and the production code x = 1.8965), and the comparison should of course take this into account.

7.2 – Talk with people who have read the paper

Once all the steps for both implementations (prototype and production) are giving the exact same results, you can gain some confidence that your code is bug free. However, there is still a risk that you made a mistake in your understanding of the paper. In that case, both implementations will give the same results for each step, and you will think that your implementations are good, whereas this just proves that both implementations are equally wrong. Unfortunately, there is no way that I know of to detect this kind of problems. Your best option is to find someone who has read the paper, and ask that person questions regarding the parts of the algorithm you are not sure about. You could even try to ask the authors, but your chances to get an answer are very low.

这个想法就是在算法的每一步比较原型和产品实现的结果。如果结果不同,那么两者其一出了错,你得找到哪个出了错以及出错的原因。精确度可能变化(原型给出x=1.8966但产品却给出x=1.8965),比较时也要将此考虑进来。

7.2 - 和读过论文的人谈谈

一旦两个实现(原型和产品)的所有步骤给出相同的结果,你就可以在一定程度上相信你的代码没有bug。但是,还存在你对论文理解错误的风险。在这种情形下,两个实现都会在每一步给出相同的结果,你会认为你的实现还不错,然而这只能证明两个实现同样都是错的。不幸的是,我也不知道该如何察觉到这种错误。最好的选择是找一个读过这篇论文的人,并问他关于算法中你不确定的部分的问题。你甚至可以设法询问作者,但你得到回答的几率非常低。


7.3 – Visualize your variables

While developing, it is always good to keep an eye on the content of the variables used by the algorithm. I am not talking about merely printing all the values in the matrices and data you have, but finding the visualization trick adapted to any variable in your implementation. For instance, if a matrix is suppose to represent the gradient of an image, then during the coding and debugging, you should have a window popping up and showing that gradient image, not just the number values in the image matrix. That way, you will associate actual an image with the data you are handling, and you will be capable of detecting when there is a problem with one of the variables, which in turn will indicate a possible bug. Inventive visualization tricks include images, scatter plots, graphs, or anything that is not just a stupid list of 1,000 numbers and upon which you can associate a mental image.

7.3 - 可视化你的变量

开发时,你应该留意算法中用到的变量的内容。我不是说仅仅打印出你所有矩阵和数据的值就可以了,而是找到适合你实现中的任何变量的可视化诀窍。例如,如果一个矩阵用来表示一张图片的梯度,那么在编码和调试过程中,你应当有一个弹出窗口,展示出那张梯度图,而不只是图片矩阵中的数值。这样,你就会把真实的图片和你要处理的数据联系起来,并且在变量出现问题时,你可以察觉到,这可能将是一个bug。创造性的可视化诀窍包括图像、散点图、曲线图,但绝不是一张包含1000个数字的愚蠢列表,然后你必须再依靠这张表在脑海中呈现出图像。

7.4 – Testing dataset

Generating data to experiment with your implementation can be very time consuming. Whenever you can, try to find databases (face database, text extract databases, etc.) or tools for generating such data. If there are none, then do not lose time generating 1000 samples manually. Code a quick data generator in 20 lines and get done with it.

Conclusion

In this article, I have presented good practices for the implementation of a scientific publication. Remember that these are only based on my personal experience, and that they should not be blindly followed word for word. Always pay attention when you read and code, and use your judgement to determine which of the guidelines presented above fit your project. Maybe some of the practices will hurt your project more than it will help it, and that’s up to you to find out.

Now go implement some cool algorithm!

7.4 - 测试数据集

生成数据来试验你的实现可能非常耗费时间。无论何时,只要你能,就设法去找到数据库(人脸数据库、原文摘录数据库等等)或者生成这种数据的工具。如果没有,那么不要把时间牺牲在手动生成1000个样例数据上。编写一个20行以内的快速数据生成器,再用它干活。

总结

这篇文章中,我已经展示了如何实现科学出版物的良好实践。记住,这些仅仅基于我的经验,不能一字一字地盲目效仿。当你阅读和编码时,永远要小心,并用你的判断力来决定上面哪条准则适合你的项目。也许有些做法对你项目的伤害会比好处更多,这就靠你去发现了。

现在,去实现一些很酷的算法吧!

Comments