This book presents statistical models that have recently been developed within several research communities to access information contained in text collections. The problems considered are linked to applications aiming at facilitating information access:
- information extraction and retrieval;
- text classification and clustering;
- opinion mining;
- comprehension aids (automatic summarization, machine translation, visualization).
In order to give the reader as complete a description as possible, the focus is placed on the probability models used in the applications concerned, by highlighting the relationship between models and applications and by illustrating the behavior of each model on real collections.
Textual Information Access is organized around four themes: informational retrieval and ranking models, classification and clustering (regression logistics, kernel methods, Markov fields, etc.), multilingualism and machine translation, and emerging applications such as information exploration.
Contents
Part 1: Information Retrieval
1. Probabilistic Models for Information Retrieval, Stéphane Clinchant and Eric Gaussier.
2. Learnable Ranking Models for Automatic Text Summarization and Information Retrieval, Massih-Réza Amini, David Buffoni, Patrick Gallinari,
Tuong Vinh Truong and Nicolas Usunier.
Part 2: Classification and Clustering
3. Logistic Regression and Text Classification, Sujeevan Aseervatham, Eric Gaussier, Anestis Antoniadis,
Michel Burlet and Yves Denneulin.
4. Kernel Methods for Textual Information Access, Jean-Michel Renders.
5. Topic-Based Generative Models for Text
Information Access, Jean-Cédric Chappelier.
6. Conditional Random Fields for Information Extraction, Isabelle Tellier and Marc Tommasi.
Part 3: Multilingualism
7. Statistical Methods for Machine Translation, Alexandre Allauzen and François Yvon.
Part 4: Emerging Applications
8. Information Mining: Methods and Interfaces for Accessing Complex Information, Josiane Mothe, Kurt Englmeier and Fionn Murtagh.
9. Opinion Detection as a Topic Classification Problem, Juan-Manuel Torres-Moreno, Marc El-Bèze, Patrice Bellot and
Fréderic Béchet.
Edited by:
Eric Gaussier,
Francois Yvon
Imprint: ISTE Ltd and John Wiley & Sons Inc
Country of Publication: United Kingdom
Volume: 588
Dimensions:
Height: 241mm,
Width: 163mm,
Spine: 29mm
Weight: 794g
ISBN: 9781848213227
ISBN 10: 1848213220
Pages: 448
Publication Date: 01 May 2012
Audience:
General/trade
,
ELT Advanced
Format: Hardback
Publisher's Status: Active
Introduction xiii Eric Gaussier and François Yvon PART 1: INFORMATION RETRIEVAL 1 Chapter 1. Probabilistic Models for Information Retrieval 3 Stéphane Clinchant and Eric Gaussier 1.1. Introduction 3 1.3. Probability ranking principle (PRP) 10 1.4. Language models 15 1.5. Informational approaches 21 1.6. Experimental comparison 27 1.7. Tools for information retrieval 28 1.8. Conclusion 28 1.9. Bibliography 29 Chapter 2. Learnable Ranking Models for Automatic Text Summarization and Information Retrieval 33 Massih-Réza Amini, David Buffoni, Patrick Gallinari, Tuong Vinh Truong, and Nicolas Usunier 2.1. Introduction 33 2.2. Application to automatic text summarization 45 2.3. Application to information retrieval 49 2.4. Conclusion 54 2.5. Bibliography 54 PART 2: CLASSIFICATION AND CLUSTERING 59 Chapter 3. Logistic Regression and Text Classification 61 Sujeevan Aseervatham, Eric Gaussier, Anestis Antoniadis,Michel Burlet, and Yves Denneulin 3.1. Introduction 61 3.2. Generalized linear model62 3.3. Parameter estimation 65 3.4. Logistic regression 68 3.5. Model selection 70 3.6. Logistic regression applied to text classification 74 3.7. Conclusion 81 3.8. Bibliography 82 Chapter 4. Kernel Methods for Textual Information Access 85 Jean-Michel Renders 4.1. Kernel methods: context and intuitions 85 4.2. General principles of kernel methods 88 4.3. General problems with kernel choices (kernel engineering) 95 4.4. Kernel versions of standard algorithms: examples of solvers 97 4.5. Kernels for text entities 103 4.6. Summary 123 4.7. Bibliography 124 Chapter 5. Topic-Based Generative Models for Text Information Access 129 Jean-Cédric Chappelier 5.1. Introduction 129 5.2. Topic-based models 135 5.3. Topic models 142 5.4. Term models 161 5.5. Similarity measures between documents 164 5.6. Conclusion 168 5.7. Appendix: topic model software 169 5.8. Bibliography 170 Chapter 6. Conditional Random Fields for Information Extraction 179 Isabelle Tellier and Marc Tommasi 6.1. Introduction 179 6.2. Information extraction 180 6.3. Machine learning for information extraction 184 6.4. Introduction to conditional random fields 187 6.5. Conditional random fields 193 6.6. Conditional random fields and their applications 203 6.7. Conclusion 214 6.8. Bibliography 215 PART 3: MULTILINGUALISM 221 Chapter 7. Statistical Methods for Machine Translation 223 Alexandre Allauzen and François Yvon 7.1. Introduction 223 7.2. Probabilistic machine translation: an overview 227 7.3. Phrase-based models 235 7.4. Modeling reorderings 250 7.5. Translation: a search problem 259 7.6. Evaluating machine translation 272 7.7. State-of-the-art and recent developments 279 7.8. Useful resources 287 7.9. Conclusion 289 7.10. Acknowledgments 291 7.11. Bibliography 291 PART 4: EMERGING APPLICATIONS 305 Chapter 8. Information Mining: Methods and Interfaces for Accessing Complex Information 307 Josiane Mothe, Kurt Englmeier, and Fionn Murtagh 8.1. Introduction 307 8.2. The multidimensional visualization of information 309 8.3. Domain mapping via social networks 320 8.4. Analyzing the variability of searches and data merging 323 8.5. The seven types of evaluation measures used in IR 327 8.6. Conclusion 331 8.7. Acknowledgments 332 8.8. Bibliography 332 Chapter 9. Opinion Detection as a Topic Classification Problem 337 Juan-Manuel Torres-Moreno, Marc El-Bèze, Patrice Bellot, and Fréderic Béchet 9.1. Introduction 337 9.2. The TREC and TAC evaluation campaigns 339 9.3. Cosine weights - a second glance 347 9.4. Which components for a opinion vectors? 348 9.5. Experiments 352 9.6. Extracting opinions from speech: automatic analysis of phone polls 357 9.7. Conclusion 365 9.8. Bibliography 366 Appendix A. Probabilistic Models: An Introduction 369 François Yvon A.1. Introduction 369 A.2. Supervised categorization 370 A.3. Unsupervised learning: the multinomial mixture model 384 A.4. Markov models: statistical models for sequences 391 A.5. Hidden Markov models 397 A.6. Conclusion 410 A.7. A primer of probability theory 411 A.8. Bibliography 420 List of Authors 423 Index 425
Eric Gaussier is deputy director of the Grenoble Informatics Laboratory, one of the largest Computer Science laboratories in France. François Yvon is professor of Computer Science at the University of Paris Sud in Orsay and member of the Spoken Language Processing group of LIMSI/CNRS, Paris, France.