Public:Seminarium CeON

Z CeON Research

Seminaria CeON odbywają się w środy o godzinie 14:00 w sali 405 w siedzibie ICM na czwartym piętrze budynku przy ulicy Prostej 69 (wejście od ul. Prostej, winda na przeciwko recepcji).

Publiczny google-kalendarz seminariów dostępny jest tutaj.

Seminarium ma swoją listę emailową.

Jeżeli...

  • chciał(a)byś zaprezentować swoją pracę na naszym seminarium lub
  • chciał(a)byś zostać dopisany(-a) do seminaryjnej listy emailowej lub
  • masz innego rodzaju sprawę organizacyjno-koordynacyjną związaną z seminarium

... proszę pisać na seminarium@ceon.pl.

Spis treści

Najbliższe spotkania

Bieżący ranking tekstów na Journal Club dostępny jest pod tym adresem. Ranking generowany jest automatycznie, ale publikowanie wymaga ręcznego "pociągnięcia za spust".

Środa, 24 września 2014, 14.00 (sala 538, V piętro)
Prezentacja
Marcin Kosiński & Przemysław Biecek
Pakiet R `archivist: archiwizowanie danych i wykresów`
Więcej informacji: https://github.com/pbiecek/archivist, http://smarterpoland.pl/index.php/2014/09/lazy-load-with-archivist/


Dotychczasowe spotkania

Czwartek, 22 maja, 2014, 14.00 (UWAGA niestandardowy termin!)
Journal Club
Przegląd wybranych artykułów o prezentacji danych na bazie "Points of view" z Nature Methods
Omawiane teksty:
Prowadzi Przemysław Biecek (ICM UW)


Środa, 7 maja, 2014
Journal Club
David Lazer, Ryan Kennedy, Gary King, Alessandro Vespignani (2014) The Parable of Google Flu: Traps in Big Data Analysis, Science
Abstrakt In February 2013, Google Flu Trends (GFT) made headlines but not for a reason that Google executives or the creators of the flu tracking system would have hoped. Nature reported that GFT was predicting more than double the proportion of doctor visits for influenza-like illness (ILI) than the Centers for Disease Control and Prevention (CDC), which bases its estimates on surveillance reports from laboratories across the United States (1, 2). This happened despite the fact that GFT was built to predict CDC reports. Given that GFT is often held up as an exemplary use of big data (3, 4), what lessons can we draw from this error?
Alternatywny link do tekstu
Literatura dodatkowa:
Moderuje Mateusz Kobos (ICM UW)


Środa, 23 kwietnia, 14.00
Prezentacja
Wykorzystanie wartości Shapleya i innych indeksów siły gier kooperacyjnych w badaniu sieci
Szymon Matejczyk (IPI PAN)
Abstrakt Badanie centralności jest jednym z najbardziej istotnych zagadnień w analizie sieci. W ostatnich latach rozbudowano tradycyjne metody mierzenia centralności o centralności obliczane jako wartość Shapley specjalnie zdefiniowanych gier kooperacyjnych na sieciach. Tak powstałe nowe centralności w szczególny sposób uwzględniają zależności między poszczególnymi węzłami i ich wspólne działanie. W swoim referacie krótko wprowadzę pojęcia związane z teorią gier kooperacyjnych, które są wykorzystywane w tym podejściu, pokażę możliwości ich wykorzystania, złożoność obliczeniową tych metod oraz ich praktyczne zastosowania.


Wtorek, 8 kwietnia, 2014. 14.00-16.00 (UWAGA niestandardowy termin!)
Sesja łączona z Kolokwium Naukowym ICM
Wykorzystanie technik bibliometrycznych w badaniach stosowanych
Krzysztof Klincewicz (WZ UW)
Abstrakt Spotkanie dotyczyć będzie bibliometrii jako podstawy do prowadzenia badań stosowanych, w których przetwarzanie informacji dotyczących publikacji i patentów może być punktem wyjścia do tworzenia pomysłów nowych rozwiązań technicznych lub doskonalenia strategii organizacji. Przedstawione zostaną m.in. podejścia LDB (literature-based discovery), TRIZ i tech mining, jak również praktyczne przykłady zastosowań bibliometrii w praktyce gospodarczej. Wprowadzeniem do spotkania może być poradnik, wydany przez Ministerstwo Nauki i Szkolnictwa Wyższego, dotyczący zastosowań bibliometrii w zarządzaniu badaniami naukowymi i technologiami (http://www.nauka.gov.pl/g2/oryginal/2013_05/6c4cdc1d79e308a8377e3a4bc06e3d21.pdf).
Slajdy
Środa, 2 kwietnia, 2014
Prezentacja
Wprowadzenie do Apache Spark
Mateusz Fedoryszak i Michał Oniszczuk (ICM)
Abstrakt
Prezentacja będzie stanowiła wprowadzenie do narzędzia Apache Spark, uznawanego za następcę Hadoopa. Przedstawimy zaimplementowany model obliczeń, a także omówimy aspekty praktyczne, w tym sposób korzystania ze Sparka na klastrze CeON. Częściowo będziemy opierać się na artykule: Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica (2012) "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing" NSDI 2012
Slajdy W formacie .key, .pptx, .ppt i .pdf
Środa, 12 Marca, 2014
Prezentacja
Indeks Hirscha i okolice
Marek Gągolewski (IBS PAN)
Abstrakt
* Problem oceny producentów zasobów informacyjnych (model), czyli indeks h jako narzędzie wcale nie tylko dla bibliometrów: youtube, twitter, pakiety R-a i taśma w fabryce
* Indeks h z punktu widzenia teorii agregacji oraz całek rozmytych (niemonotonicznych); uogólnienia; przegląd własności aksjomatycznych, czyli czemu nie możemy wymagać zbyt dużo
* Probabilistyczne i statystyczne własności indeksu h, czyli kiedy indeks h = 12 nie różni się istotnie od h = 16
* Porządki i rangowanie: twierdzenie o niemożliwej "uczciwości", manipulacje, analiza wrażliwości, czyli jak (i kiedy) łatwo dostosować wyjściowe rankingi do "własnych potrzeb".
Przerwa
Środa, 29 Maja, 2013
Journal Club
Dwa teskty:
Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8): e124. doi:10.1371/journal.pmed.0020124
Leah R. Jager, Jeffrey T. Leek (2013) Empirical estimates suggest most published medical research is true
Moderuje Witold Rudnicki (ICM UW)


środa, 15 Maja, 2013
Journal Club
Seans wykładu Yoshua Bengio pt. "Representation Learning". Poglądowe artykuły na ten temat w New York Times oraz w The New Yorker
Moderuje Mateusz Kobos (ICM UW)


środa 27 marca, 2013
Wykład
Kambiz Badie (IT Research Faculty, Cyber-Space Research Institute, Iran)
Ontology-Driven Content Creation & Its Applications
Abstrakt: The significance of content creation mostly goes back to the point that due to multiplicity & high variety of decisions on cyber-environment and also the high variety of users at different levels, contents should preferably be created in a way customized to on-line conditions. In this speech, a framework is discussed which shows how a content can be created using ontologies of different sources of knowledge based on an inter-play between those of focal segments in the desired content.


Środa, 6 marca, 2013
Journal Club
Ying Zhao, George Karypis (2004) Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering, Machine Learning, Volume 55, Issue 3, pp 311-331
Abstrakt: This paper evaluates the performance of different criterion functions in the context of partitional clustering algorithms for document datasets. Our study involves a total of seven different criterion functions, three of which are introduced in this paper and four that have been proposed in the past. We present a comprehensive experimental evaluation involving 15 different datasets, as well as an analysis of the characteristics of the various criterion functions and their effect on the clusters they produce. Our experimental results show that there are a set of criterion functions that consistently outperform the rest, and that some of the newly proposed criterion functions lead to the best overall results. Our theoretical analysis shows that the relative performance of the criterion functions depends on (i) the degree to which they can correctly operate when the clusters are of different tightness, and (ii) the degree to which they can lead to reasonably balanced clusters.
Moderuje: Michał Łopuszyński (ICM UW)


Środa, 20 lutego, 2013
Journal Club
Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis (2011) Dremel: Interactive Analysis of Web-Scale Datasets
Abstrakt: Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel, and explain how it complements MapReduce-based computing. We present a novel columnar storage representation for nested records and discuss experiments on few-thousand node instances of the system.
Moderuje: Mateusz Kobos (ICM UW)


Środa, 13 lutego, 2013
Journal Club
Jessica B. Voytek, Bradley Voytek (2012) Automated cognome construction and semi-automated hypothesis generation
Abstrakt: Modern neuroscientific research stands on the shoulders of countless giants. PubMed alone contains more than 21 million peer-reviewed articles with 40–50,000 more published every month. Understanding the human brain, cognition, and disease will require integrating facts from dozens of scientific fields spread amongst millions of studies locked away in static documents, making any such integration daunting, at best. The future of scientific progress will be aided by bridging the gap between the millions of published research articles and modern databases such as the Allen brain atlas (ABA). To that end, we have analyzed the text of over 3.5 million scientific abstracts to find associations between neuroscientific concepts. From the literature alone, we show that we can blindly and algorithmically extract a “cognome”: relationships between brain structure, function, and disease. We demonstrate the potential of data-mining and cross-platform data-integration with the ABA by introducing two methods for semi-automated hypothesis generation. By analyzing statistical “holes” and discrepancies in the literature we can find understudied or overlooked research paths. That is, we have added a layer of semi-automation to a part of the scientific process itself. This is an important step toward fundamentally incorporating data-mining algorithms into the scientific method in a manner that is generalizable to any scientific or medical field.
Moderuje: Dominka Tkaczyk (ICM UW)


Środa, 23 stycznia, 2013
Journal Club
Katy Börner, Luca Dall'Asta, Weimao Ke, Alessandro Vespignani (2005) Studying the Emerging Global Brain: Analyzing and Visualizing the Impact of Co-Authorship Teams. Complexity 10(4):57-67
Abstrakt: This paper introduces a suite of approaches and measures to study the impact of co-authorship teams based on the number of publications and their citations on a local and global scale. In particular, we present a novel weighted graph representation that encodes coupled author-paper networks as a weighted co-authorship graph. This weighted graph representation is applied to a dataset that captures the emergence of a new field of science and comprises 614 papers published by 1,036 unique authors between 1974 and 2004. In order to characterize the properties and evolution of this field we first use four different measures of centrality to identify the impact of authors. A global statistical analysis is performed to characterize the distribution of paper production and paper citations and its correlation with the co-authorship team size. The size of co-authorship clusters over time is examined. Finally, a novel local, author-centered measure based on entropy is applied to determine the global evolution of the field and the identification of the contribution of a single author's impact across all of its co-authorship relations. A visualization of the growth of the weighted co-author network and the results obtained from the statistical analysis indicate a drift towards a more cooperative, global collaboration process as the main drive in the production of scientific knowledge.
Moderuje: Dominika Czerniawska


Środa, 12 grudnia, 2012
Journal Club
P. E. Bourne et al. (2012) Improving Future Research Communication and e-Scholarship
Abstrakt: Research and scholarship lead to the generation of new knowledge. The dissemination of this knowledge has a fundamental impact on the ways in which society develops and progresses, and at the same time it feeds back to improve subsequent research and scholarship. Here, as in so many other areas of human activity, the internet is changing the way things work: it opens up opportunities for new processes that can accelerate the growth of knowledge, including the creation of new means of communicating that knowledge among researchers and within the wider community. Two decades of emergent and increasingly pervasive information technology have demonstrated the potential for far more effective scholarly communication. However, the use of this technology remains limited; research processes and the dissemination of research results have yet to fully assimilate the capabilities of the web and other digital media. Producers and consumers remain wedded to formats developed in the era of print publication, and the reward systems for researchers remain tied to those delivery mechanisms. Force11 (the Future of Research Communication and e-Scholarship) is a community of scholars, librarians, archivists, publishers and research funders that has arisen organically to help facilitate the change toward improved knowledge creation and sharing. Individually and collectively, we aim to bring about a change in scholarly communication through the effective use of information technology. Force11 has grown from a small group of like-minded individuals into an open movement with clearly identified stakeholders associated with emerging technologies, policies, funding mechanisms and business models. While not disputing the expressive power of the written word to communicate complex ideas, our foundational assumption is that scholarly communication by means of semantically-enhanced media-rich digital publishing is likely to have a greater impact than communication in traditional print media or electronic facsimiles of printed works. However, to date, online versions of ‘scholarly outputs’ have tended to replicate print forms, rather than exploit the additional functionalities afforded by the digital terrain. We believe that digital publishing of enhanced papers will enable more effective scholarly communication, which will also broaden to include, for example, better links to data, the publication of software tools, mathematical models, protocols and workflows, and research communication by means of social media channels. This document highlights the findings of the Force11 workshop on the Future of Research Communication and e-Scholarship held at Schloss Dagstuhl, Germany, in August 2011: it summarizes a number of key problems facing scholarly publishing today, and presents a vision that addresses these problems, proposing concrete steps that key stakeholders can take to improve the state of scholarly publishing. More about Force11 can be found at http://www.force11.org. This White Paper is a collaborative effort that reflects the input of all Force11 attendees at the Dagstuhl Workshop1, and is very much a living document. We see it as a starting point that will grow and be updated and augmented by individual and collective efforts by the participants and others. We invite you to join and contribute to this enterprise.
Moderuje: Łukasz Bolikowski



Środa, 5 grudnia, 2012
Journal Club
Henk F. Moed (2009) New developments in the use of citation analysis in research evaluation
Abstrakt: This paper presents an overview of research assessment methodologies developed in the field of evaluative bibliometrics, a subfield of quantitative science and technology studies, aimed to construct indicators of research performance from a quantitative statistical analysis of scientific-scholarly documents. Citation analysis is one of its key methodologies. The paper illustrates the potentialities and limitations of the use of bibliometric indicators in research assessment. It discusses the relationship between metrics and peer review; databases used as sources of bibliometric analysis; the pros and cons of indicators often applied, including journal impact factors, Hirsch indices, and normalized indicators of citation impact; and approaches to the bibliometric measurement of institutional research performance.
Moderuje: Michał Bojanowski


Środa, 28 listopada, 2012
Odwołane.
Środa, 21 listopada, 2012
Journal Club
Jimmy Lin and Alek Kolcz (2012) Large-Scale Machine Learning at Twitter
Abstrakt: The success of data-driven solutions to difficult problems, along with the dropping costs of storing and processing massive amounts of data, has led to growing interest in large-scale machine learning. This paper presents a case study of Twitter’s integration of machine learning tools into its existing Hadoop-based, Pig-centric analytics platform. We begin with an overview of this platform, which handles “traditional” data warehousing and business intelligence tasks for the organization. The core of this work lies in recent Pig extensions to provide predictive analytics capabilities that incorporate machine learning, focused specifically on supervised classification. In particular, we have identified stochastic gradient descent techniques for online learning and ensemble methods as being highly amenable to scaling out to large amounts of data. In our deployed solution, common machine learning tasks such as data sampling, feature generation, training, and testing can be accomplished directly in Pig, via carefully crafted loaders, storage functions, and user-defined functions. This means that machine learning is just another Pig script, which allows seamless integration with existing infrastructure for data management, scheduling, and monitoring in a production environment, as well as access to rich libraries of user-defined functions and the materialized output of other scripts.
Moderuje: Piotr Dendek


Środa, 14 listopada, 2012
Journal Club
Hamish Cunningham; Norbert Fuhr; Benno Stein (2012) Challenges in Document Mining
Abstrakt: This report documents the programme and outcomes of the Dagstuhl Seminar 11171 "Challenges in Document Mining". Our starting point was the observation that document mining techniques are often applied in an isolated manner, with the consequence that their potential is still to be fully realised. The goal of the seminar was to analyze this untapped potential. To this end researchers from the main areas of document mining were invited to present their views, to synthesise an understanding of where and how the latest disciplinary achievements can be combined, and to develop a more integrative view on the state of the art and the prospects for future progress.
Moderuje: Mateusz Kobos (ICM UW)


Środa, 7 listopada, 2012
Journal Club
Calcagno, E. Demoinet, K. Gollner, L. Guidi, D. Ruths, C. de Mazancourt (2012) Flows of Research Manuscripts Among Scientific Journals Reveal Hidden Submission Patterns. Science
Abstrakt: The study of science-making is a growing discipline that builds largely on online publication and citation databases, while prepublication processes remain hidden. Here, we report results from a large-scale survey of the submission process, covering 923 scientific journals from the biological sciences in years 2006–2008. Manuscript flows among journals revealed a modular submission network, with high-impact journals preferentially attracting submissions. However, about 75% of published articles were submitted first to the journal that would publish them, and high-impact journals published proportionally more articles that had been resubmitted from another journal. Submission history affected postpublication impact: Resubmissions from other journals received significantly more citations than first-intent submissions, and resubmissions between different journal communities received significantly fewer citations.
Moderuje: Paweł Szostek (ICM UW)


Środa, 24 paźdzernika, 2012
Journal Club
Staša Milojević (2012) How are academic age, productivity and collaboration related to citing behavior of researchers arXiv:1210.3727
Abstrakt: References are an essential component of research articles and therefore of scientific communication. In this study we investigate referencing (citing) behavior in five diverse fields (astronomy, mathematics, robotics, ecology and economics) based on 213,756 core journal articles. At the macro level we find: (a) a steady increase in the number of references per article over the period studied (50 years), which in some fields is due to a higher rate of usage, while in others reflects longer articles and (b) an increase in all fields in the fraction of older, foundational references since the 1980s, with no obvious change in citing patterns associated with the introduction of the Internet. At the meso level we explore current (2006-2010) referencing behavior of different categories of authors (21,562 total) within each field, based on their academic age, productivity and collaborative practices. Contrary to some previous findings and expectations we find that senior researchers use references at the same rate as their junior colleagues, with similar rates of re-citation (use of same references in multiple papers). High Modified Price Index (MPI, which measures the speed of the research front more accurately than the traditional Price Index) of senior authors indicates that their research has the similar cutting-edge aspect as that of their younger colleagues. In all fields both the productive researchers and especially those who collaborate more use a significantly lower fraction of foundational references and have much higher MPI and lower re-citation rates, i.e., they are the ones pushing the research front regardless of researcher age. This paper introduces improved bibliometric methods to measure the speed of the research front, disambiguate lead authors in co-authored papers and decouple measures of productivity and collaboration.
Moderuje: Wojtek Fenrich (ICM UW)


Środa, 17 paźdzernika, 2012
Journal Club
Jimmy Lin (2012) MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That's Not a Nail!. arXiv:1209.2191
Abstrakt: Hadoop is currently the large-scale data analysis "hammer" of choice, but there exist classes of algorithms that aren't "nails", in the sense that they are not particularly amenable to the MapReduce programming model. To address this, researchers have proposed MapReduce extensions or alternative programming models in which these algorithms can be elegantly expressed. This essay espouses a very different position: that MapReduce is "good enough", and that instead of trying to invent screwdrivers, we should simply get rid of everything that's not a nail. To be more specific, much discussion in the literature surrounds the fact that iterative algorithms are a poor fit for MapReduce: the simple solution is to find alternative non-iterative algorithms that solve the same problem. This essay captures my personal experiences as an academic researcher as well as a software engineer in a "real-world" production analytics environment. From this combined perspective I reflect on the current state and future of "big data" research.
Moderuje: Michał Lopuszyński (ICM UW)



Środa, 10 paźdzernika, 2012
Journal Club
Raj Kumar Pan, Kimmo Kaski, Santo Fortunato (2012) World citation and collaboration networks: uncovering the role of geography in science. arXiv:1209.0781
Abstrakt: Modern information and communication technologies, especially the Internet, have diminished the role of spatial distances and territorial boundaries. This has enabled scientists for closer collaboration and internationalization. Nevertheless, geography remains an important factor affecting the dynamics of science. Hence, assessing the influence of spatial proximity between scientists is crucial to promote efficient collaboration strategies and, ultimately, to improve the quality of science. Here we present a systematic analysis of citation and collaboration streams between cities and countries. By assigning papers to the geographic locations of their authors' affiliations, we construct weighted networks of citations and collaborations. The citation flows as well as the collaboration strengths between cities decrease with the distance between them and follow gravity laws with exponents close to 1. Moreover, for a given number of authors, the diversity of affiliations increases the number of citations, especially when many countries are represented. In addition, the total research impact of a country grows linearly with the amount of national funding for research & development. However, the average impact reveals a peculiar threshold effect: the scientific output of a country may reach an impact larger than the world average only if the country invests more than 120,000 US $ per researcher annually. Our results reveal the overall structure of scientific research by showing the correlation between collaboration, citation, geography and funding, and could provide valuable inputs in shaping the future science policies.
Moderuje: Dominika Czerniawska (ICM UW)


Środa, 3 października, 2012
Journal Club
Drugi czteropak ze Special Issue on Mining Scientific Publications w D-Lib Magazine:
  1. Visual Search for Supporting Content Exploration in Large Document Collections
  2. Extraction and Visualization of Technical Trend Information from Research Papers and Patents
  3. Specialized Research Datasets in the CiteSeerx Digital Library
  4. Automatic and Interactive Browsing Hierarchy Construction for Scientific Publication Collections
Moderatorzy: Paweł Szostek (1,3), Mateusz Kobos (2), Piotr Dendek (4)
Środa, 26 września, 2012
Journal Club
Nicolas Chopin, Andrew Gelman, Kerrie L. Mengersen, Christian P. Robert (2012) "In praise of the referee" arXiv:1205.4304
Abstrakt: There has been a lively debate in many fields, including statistics and related applied fields such as psychology and biomedical research, on possible reforms of the scholarly publishing system. Currently, referees contribute so much to improve scientific papers, both directly through constructive criticism and indirectly through the threat of rejection. We discuss ways in which new approaches to journal publication could continue to make use of the valuable efforts of peer reviewers.
Moderuje: Michał Bojanowski (ICM UW)


Środa, 19 września, 2012
Journal Club
Small, Henry (2011) Interpreting maps of science using citation context sentiments: a preliminary investigation, Scientometrics 87:373-388
Abstrakt: It is proposed that citation contexts, the text surrounding references in scientific papers, be analyzed in terms of an expanded notion of sentiment, defined to include attitudes and dispositions toward the cited work. Maps of science at both the specialty and global levels are used as the basis of this analysis. Citation context samples are taken at these levels and contrasted for the appearance of cue word sets, analyzed with the aid of methods from corpus linguistics. Sentiments are shown to vary within a specialty and can be understood in terms of cognitive and social factors. Within-specialty and between-specialty co-citations are contrasted and in some cases suggest a correlation of sentiment with structural location. For example, the sentiment of ‘‘uncertainty’’ is important in interdisciplinary co-citation links, while ‘‘utility’’ is more prevalent within the specialty. Suggestions are made for linking sentiments to technical terms, and for developing sentiment ‘‘baselines’’ for all of science.
Moderator: Wojciech Fenrich (ICM UW)
Środa, 12 września, 2012
Journal Club
Cztery krótkie teksty z numeru specjalnego D-lib Magazine:
Moderatorzy: Dominika Tkaczyk, Mateusz Kobos, Piotr Dendek, Łukasz Bolikowski
lipiec - sierpień 2012
Przerwa wakacyjna.
Środa, 27 czerwca, 2012
Journal Club
J. Davis and M. Goadrich, “The relationship between Precision-Recall and ROC curves,” in Proceedings of the 23rd International Conference on Machine Learning, 2006, pp. 233-240
Abstrakt: Receiver Operator Characteristic (ROC) curves are commonly used to present results for binary decision problems in machine learning. However, when dealing with highly skewed datasets, Precision-Recall (PR) curves give a more informative picture of an algorithm's performance. We show that a deep connection exists between ROC space and PR space, such that a curve dominates in ROC space if and only if it dominates in PR space. A corollary is the notion of an achievable PR curve, which has properties much like the convex hull in ROC space; we show an efficient algorithm for computing this curve. Finally, we also note differences in the two types of curves are significant for algorithm design. For example, in PR space it is incorrect to linearly interpolate between points. Furthermore, algorithms that optimize the area under the ROC curve are not guaranteed to optimize the area under the PR curve.
Moderuje Piotr Dendek (ICM UW)
Środa, 20 czerwca, 2012
Journal Club
Cohen, J. (2009) "Graph Twiddling in a MapReduce World", Computer Science and Engineering 11(4):29-41
Abstrakt: As the size of graphs for analysis continues to grow, methods of graph processing that scale well have become increasingly important. One way to handle large datasets is to disperse them across an array of networked computers, each of which implements simple sorting and accumulating, or MapReduce, operations. This cloud computing approach offers many attractive features. If decomposing useful graph operations in terms of MapReduce cycles is possible, it provides incentive for seriously considering cloud computing. Moreover, it offers a way to handle a large graph on a single machine that can't hold the entire graph as well as enables streaming graph processing. This article examines this possibility.
Moderują Łukasz Bolikowski (ICM UW) i Piotr Wendykier (ICM UW)
Środa, 5 czerwca, 2012
Journal Club
Jean-Charles Lamirel (2012) "A new approach for automatizing the analysis of research topics dynamics: application to optoelectronics research" Scientometrics OnlineFirst 2012-05-08
Abstrakt: The objective of this paper is to propose a new unsupervised incremental approach in order to follow the evolution of research themes for a given scientific discipline in terms of emergence or decline. Such behaviors are detectable by various methods of filtering. However, our choice is made on the exploitation of neural clustering methods in a multi-view context. This new approach makes it possible to take into account the incremental and chronological aspects of information by opening the way to the detection of convergences and divergences of research themes at a large scale.
Moderuje Mateusz Kobos (ICM UW)
Środa, 30 maja, 2012
Prezentacja
Zmiany w systemie komunikacji w środowisku naukowym pod wpływem nowych technologii
Dominika Czerniawska (ICM UW)
Abstrakt: Technologie komunikacyjno-informacyjne zmieniły w sposób fundamentalny sposób komunikacji naukowej. Głównym celem mojej pracy ma być zrozumienie tych zmian. Naukowcy występują tutaj w dwóch rolach: jako producenci i konsumenci treści. Aktywności te zamieniają się w łańcuch cyrkulacji wiedzy w środowisku naukowy, którego zmiany chciałabym poddać analizie. W sferze konsumpcji treści zmiany są nieoczywiste: dotychczasowe badania wskazują, że upowszechnienie się dostępu do baz czasopism elektronicznych było równoczesne ze zmniejszeniem się różnorodności źródeł bibliograficznych w artykułach (Evans 2008), zwiększeniem wykorzystania źródeł nieformalnych (Thelwall i Kousha 2008). Dzięki rozwojowi technologii kom.-inf. możemy obserwować dwa zjawiska: z jednej strony zwiększa się nieustannie korpus tekstów, do których mamy łatwy dostęp, po drugie pojawiają się coraz precyzyjniejsze narzędzia odszukiwania, monitorowania i katalogowania tych treści. Interesujące wydaje się jakie są konsekwencje tego stanu rzeczy. Z drugiej strony naukowcy są też producentami treści i wykorzystują dostępne im narzędzie, aby się tymi treściami dzielić. Wyróżnić tu można komunikację formalną (publikowanie w czasopismach recenzowanych, książki, monografie). Internet pozwolił jednak na wprowadzenie i powszechne wykorzystanie komunikacji bezpośredniej, z pominięciem pośredników (w tym wypadku wydawnictw). Jaka jest rola bezpośredniej i rozproszonej komunikacji jaką oferują repozytoria, self-publishing, open access, strony uczelni itp? Na jakie aspekty komunikacji pomiędzy naukowcami wpływają? Czy zmieniają sposób (m.in. dzięki szybkości, mniejszym barierom) w jaki wiedza dociera do innych naukowców?
Środa, 23 maja, 2012
Journal Club
Bernius, Steffen; Hanauske, Matthias; Wolfgang König; Dugall Berndt (2009): Open Access Models and their Implications for the Players on the Scientific Publishing Market, ECONOMIC ANALYSIS & POLICY, VOL. 39 NO. 1
Abstrakt: Open Access (OA) as a new form of distributing scientific literature is broadly accepted by scholars, but in most disciplines the realization of the paradigm is progressing rather slowly. The reason for this lies, on the one hand, in a lack of incentives for individual authors. On the other hand, there are many different approaches to OA and their effects on market participants are complex. In this article we regard the implications of different OA models for scholars, publishers, libraries and funding organizations and try to explain the motivations behind the actions currently taking place on the scientific publishing market.
Moderuje Jakub Szprot (ICM UW)
Środa, 16 maja, 2012
Journal Club
Sune Lehmann, Andrew D. Jackson and Benny E. Lautrup (2008) "A quantitative analysis of indicators of scientific performance" Scientometrics 76(2):369-390
Abstrakt:Condensing the work of any academic scientist into a one-dimensional indicator of scientific performance is a difficult problem. Here, we employ Bayesian statistics to analyze several different indicators of scientific performance. Specifically, we determine each indicator’s ability to discriminate between scientific authors. Using scaling arguments, we demonstrate that the best of these indicators require approximately 50 papers to draw conclusions regarding long term scientific performance with usefully small statistical uncertainties. Further, the approach described here permits statistical comparison of scientists working in distinct areas of science.
Moderuje Karol Leszczyński (ICM UW).
Środa, 9 maja, 2012
Journal Club
Radicchi, F., Castellano, C. (2012) "A reverse engineering approach to the suppression of citation biases reveals universal properties of citation distributions" arXiv:1203.6742
Abstrakt: The large amount of information contained in bibliographic databases has recently boosted the use of citations, and other indicators based on citation numbers, as tools for the quantitative assessment of scientific research. Citations counts are often interpreted as proxies for the scientific influence of papers, journals, scholars, and institutions. However, a rigorous and scientifically grounded methodology for a correct use of citation counts is still missing. In particular, cross-disciplinary comparisons in terms of raw citation counts systematically favors scientific disciplines with higher citation and publication rates. Here we perform an exhaustive study of the citation patterns of millions of papers, and derive a simple transformation of citation counts able to suppress the disproportionate citation counts among scientific domains. We find that the transformation is well described by a power-law function, and that the parameter values of the transformation are typical features of each scientific discipline. Universal properties of citation patterns descend therefore from the fact that citation distributions for papers in a specific field are all part of the same family of univariate distributions.
Moderuje Michał Bojanowski (ICM UW)
Środa, 25 kwietnia, 2012. Wyjątkowo o 10.30
Journal Club
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S. (2007) "Duplicate Record Detection: A Survey", IEEE Transactions on Knowledge and Data Engineering, 19(1):1-16
Abstrakt: Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats, or any combination of these factors. In this paper, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar field entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the efficiency and scalability of approximate duplicate detection algorithms. We conclude with coverage of existing tools and with a brief discussion of the big open problems in the area
Moderuje Piotr Dendek (ICM UW)
Środa, 11 kwietnia, 2012
Journal Club
Video wykładu Petera Norviga The Unreasonable Effectiveness of Data wraz z dodatkowym artykułem uzupełniającym
Abstrakt: In decades past, models of human language were wrought from the sweat and pencils of linguists. In the modern day, it is more common to think of language modeling as an exercise in probabilistic inference from data: we observe how words and combinations of words are used, and from that build computer models of what the phrases mean. This approach is hopeless with a small amount of data, but somewhere in the range of millions or billions of examples, we pass a threshold, and the hopeless suddenly becomes effective, and computer models sometimes meet or exceed human performance. This talk gives examples of the data available in large repositories of text, images, and videos, and shows some tasks that can be accomplished with the resulting models.
Środa, 4 kwietnia, 2012
Journal Club
Theresa A. Velden, Asif-ul Haque, Carl J. Lagoze (2010) "A New Approach to Analyzing Patterns of Collaboration in Co-authorship Networks - Mesoscopic Analysis and Interpretation", arXiv:0911.4761
Abstrakt: This paper focuses on methods to study patterns of collaboration in co-authorship networks at the mesoscopic level. We combine qualitative methods (participant interviews) with quantitative methods (network analysis) and demonstrate the application and value of our approach in a case study comparing three research fields in chemistry. A mesoscopic level of analysis means that in addition to the basic analytic unit of the individual researcher as node in a co-author network, we base our analysis on the observed modular structure of co-author networks. We interpret the clustering of authors into groups as bibliometric footprints of the basic collective units of knowledge production in a research specialty. We find two types of coauthor-linking patterns between author clusters that we interpret as representing two different forms of cooperative behavior, transfer-type connections due to career migrations or one-off services rendered, and stronger, dedicated inter-group collaboration. Hence the generic coauthor network of a research specialty can be understood as the overlay of two distinct types of cooperative networks between groups of authors publishing in a research specialty. We show how our analytic approach exposes field specific differences in the social organization of research.
Moderuje Wojciech Fenrich (ICM UW)
Środa, 28 marca, 2012
Journal Club
Tudor Groza, Gunnar AAstrand Grimnes, Siegfried Handschuh and Stefan Decker (2011) "From raw publications to Linked Data", Knowledge and Information Systems OnlineFirst, DOI: 10.1007/s10115-011-0473-6
Abstrakt: The continuous development of the Linked Data Web depends on the advancement of the underlying extraction mechanisms. This is of particular interest for the scientific publishing domain, where currently most of the data sets are being created manually. In this article, we present a Machine Learning pipeline that enables the automatic extraction of heading metadata (i.e., title, authors, etc) from scientific publications. The experimental evaluation shows that our solution handles very well any type of publication format and improves the average extraction performance of the state of the art with around 4%, in addition to showing an increased versatility. Finally, we propose a flexible Linked Data-driven mechanism to be used both for refining and linking the automatically extracted metadata.
Moderuje Michał Łopuszyński (ICM UW)
Środa, 21 marca, 2012
Journal Club
Michael P. H. Stumpf, Mason A. Porter (2012) "Critical Truths About Power Laws", Science Vol. 335 no. 6069 pp. 665-666 DOI: 10.1126/science.1216142
Abstrakt: The ability to summarize observations using explanatory and predictive theories is the greatest strength of modern science. A theoretical framework is perceived as particularly successful if it can explain very disparate facts. The observation that some apparently complex phenomena can exhibit startling similarities to dynamics generated with simple mathematical models (1) has led to empirical searches for fundamental laws by inspecting data for qualitative agreement with the behavior of such models. A striking feature that has attracted considerable attention is the apparent ubiquity of power-law relationships in empirical data. However, although power laws have been reported in areas ranging from finance and molecular biology to geophysics and the Internet, the data are typically insufficient and the mechanistic insights are almost always too limited for the identification of power-law behavior to be scientifically useful (see the figure). Indeed, even most statistically “successful” calculations of power laws offer little more than anecdotal value.
oraz
TED Talk by Geoffrey West: The surprising math of cities and corporations
Środa, 14 marca, 2012
Prezentacja
Mateusz Kobos (ICM UW)
Wielorozdzielczościowa klasyfikacja za pomocą kombinacji estymatorów jądrowych
Środa, 7 marca, 2012
Journal Club
Rawlings, Craig M. and McFarland, Daniel A. (2011) "Influence flows in the academy: Using affiliation networks to assess peer effects among researchers", Social Science Research 40(3):1001-1017
Abstrakt: Little is known about how influence flows in the academy, because of inherent difficulties in collecting data on large samples of friendship and advice-seeking networks over time. We propose taking advantage of the relative abundance of “affiliation network” data to assess aggregate patterns of how individual and dyadic characteristics channel influence among researchers. We formulate and test our approach using new data on 2034 faculty members at Stanford University over a 15-year period, analyzing different affiliations as potential influence channels for changes in grant productivity. Results indicate that research productivity is more malleable to ongoing interpersonal influence processes than suggested in prior research: a strong, salient tie to a colleague in an authority position is most likely to transmit influence, and most forms of influence are likely to spill over to behaviors outside those jointly produced by collaborators. However, the genders and institutional locations of ego-alter pairs significantly affect how influence flows.
Moderuje Dominika Czerniawska (ICM UW)
Środa, 29 lutego, 2012
Prezentacja
Michał Łukasik (ICM UW)
Wieloetykietowa, hierarchiczna klasyfikacja dokumentów na przykładzie automatycznego nadawania kodów Mathematics Subject Classification
Środa, 22 lutego, 2012
Prezentacja
Krzysztof Siewicz (ICM UW)
Towards an Improved Regulatory Framework of Free Software
Środa, 15 lutego, 2012
Prezentacja
Dominika Czerniawska (ICM UW)
Ścieżki karier naukowych
Środa, 8 lutego, 2012
Journal Club
L. Kronegger, F. Mali, A. Ferligoj, and P. Doreian (2012) “Collaboration structures in Slovenian scientific communities,” Scientometrics, vol. 90, pp. 631-647
Abstrakt: We combine two seemingly distinct perspectives regarding the modeling of network dynamics. One perspective is found in the work of physicists and mathematicians who formally introduced the small world model and the mechanism of preferential attachment. The other perspective is sociological and focuses on the process of cumulative advantage and considers the agency of individual actors in a network. We test hypotheses, based on work drawn from these perspectives, regarding the structure and dynamics of scientific collaboration networks. The data we use are for four scientific disciplines in the Slovene system of science. The results deal with the overall topology of these networks and specific processes that generate them. The two perspectives can be joined to mutual benefit. Within this combined approach, the presence of small-world structures was confirmed. However preferential attachment is far more complex than advocates of a single autonomous mechanism claim.
Moderuje Michał Bojanowski (ICM UW)
Środa, 1 lutego, 2012
Prezentacja
Dominika Tkaczyk (ICM UW)
Szablony narracji
Środa, 25 stycznia, 2012
Journal Club
F. Sebastiani, "Machine learning in automated text categorization" (2002) ACM Computing Surveys 34(1):1-47 (rozdziały 1-5)
Abstrakt: The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.
Moderuje Łukasz Bolikowski (ICM UW)
Środa, 18 stycznia, 2012
Inaugurujące spotkanie organizacyjne

Planowanie

Journal club

W ramach "Journal Club" omawiamy nowe i ciekawe pozycje w literaturze.

Procedura wyboru tekstów korzysta z arkusza GoogleDocs dostępnego tutaj, do którego można dopisywać propozycje tekstów do omówienia oraz w którym "głosujemy" zaznaczając nasze pereferencje.

Aktualne wyniki są dostępne tutaj.

Format arkusza

Wiersze odpowiadają tekstom. Pierwsze dwie kolumny zawierają współrzędne bibliograficzne tekstu (kolumna "artykuł") oraz URL do tekstu (kolumna "link"). Pozostałe kolumny przynależą do poszczególnych uczestników seminarium i służą do wpisywania ocen. Zmiany w arkuszu są rejestrowane.

Arkusz jest publicznie dostępny read-only w formacie CSV pod następującym adresem: https://docs.google.com/spreadsheet/pub?key=0AkA6_NzvdEbbdHZYSmdtX0FubktxNjZ2VUQ5UWM1bVE&single=true&gid=0&output=csv

Procedury

Dodawanie nowego proponowanego tekstu:

  1. wypełniamy kolejny wiersz u dołu.
    • W kolumnie "artykuł" wpisujemy współrzędne bibliograficzne
    • w kolumnie "link" URL do tekstu

Głosowanie/ocenianie tekstów:

  • Każdy uczestnik seminarium ma swoją kolumnę na wpisywanie ocen. Jeżeli jeszcze nie masz swojej, to należy ją dodać wstawiając swój nick w nagłówku.
  • Teksty oceniamy wstawiając punkty w swojej kolumnie dla poszczególnych tekstów.
  • Oceny mogą być dowolnymi liczbami. Przy konstrukcji rankingu będą one normalizowane do przedziału [0;1]
  • Wyższa liczba dla danego tekstu oznacza silniejszą preferencję do omówienia danego tekstu.


Ranking tekstów będzie ustalany w następujący sposób:

  1. Indywidualne oceny będą normalizowane do przedziału [0;1].
  2. Dla każdego tekstu wyliczana jest suma ocen.
  3. Z wybraną częstotliwością teksty o najwyższych ocenach będę wybierane do omówienia na konkretnych spotkaniach. Proponuję ustalać program na dwa tygodnie naprzód.

Dodatkowe informacje:

  • Modyfikować swoje oceny w arkuszu można w dowolnym momencie. Rozbudowany silnik analityczny będzie analizował i publikował wyniki raz w tygodniu.
  • Wiekszą szansę zostania moderatorem dyskusji wybranego tekstu będą miały osoby, które na ów tekst głosowały.
  • Teksty, które przez M tygodni z rzędu okupują ostatnią pozycję rankingowa będą usuwane z listy. Proponuję M=4.

Prezentacje

Potencjalne tematy nieczytelniczych spotkań (kolejność niezobowiązująca)

Inne propozycje niekonkretne

Internals

Linki do szablonów: Szablon:CeONJournalClub, Szablon:CeONSeminarium.