Quantcast
Channel: Criteo Labs
Viewing all 176 articles
Browse latest View live

NABDConf (Not Another Big Data Conference) is back for 2017

$
0
0

NABDConf June 1st, Criteo 32 rue Blanche Paris.

NABDConf (Not Another Big Data Conference) is back for 2017!  The conference by developers for developers is again bringing together the engineers who have spent their careers resolving difficult problems at scale with other engineers who are either facing similar problems or who are just plain curious as to what’s going on behind the scenes at web-scale companies like Criteo, Spotify, Uber and Google.

This year we will be discussing topics like moving PB-scale Hadoop operations from bare metal to the Google Cloud, building billion node, billion edge graphs, visualizing tremendous volumes of data in a browser, pretty much everything you would ever want to know about building large scale data pipelines, how one keeps track of PBs of data and raw database performance.

In addition to spending a riveting day with world class engineers in a small format conference, you’ll get to hobnob with them at the after party on Criteo’s world famous (I exaggerate, but only mildly) rooftop deck in the heart of South Pigalle (that’s in Paris for those who don’t know!).

The date is June 1st.  The price is 50€.  All proceeds go to charity.  You can’t go wrong.

Confirmed talks

Speaker: Josh Baer, Spotify

Title: Moving to the Cloud: A story from the trenches

In 2016, the pointy haired managers that lead Infrastructure at Spotify decided they didn’t want to be in the datacenter business. The future, they said, was big and fluffy: the future was the cloud! And it was “Googley” (https://news.spotify.com/us/2016/02/23/announcing-spotify-infrastructures-googley-future/)

At that time, Spotify had 3000+ machines dedicated to running the services to power data processing: from Kafka, to Hadoop, to Spark, Storm and more.  In this talk, Josh Baer — a product owner at Spotify managing the data processing migration— will walk through how the migration of these services has gone so far: outlining which services have translated to the cloud easily, which have been difficult to move and which have been replaced entirely by GCP offerings.

Speaker: Bruno Roggeri, Criteo

Title: Building a billion node / billion edge graph

At Criteo, besides our own cookie ids, we have access to billions of other identifiers:

– Mobile device ids

– Customer ID from thousands of merchants

– Email address hashes

Each of those identifiers can be associated one another whenever we notice a joint activity.

Yup, this is a graph!

Let’s see how we started computing the groups of connected ids 2 years ago – and the problems we’re starting to face just now as we leverage more and more associations of ids.

Speaker: Francois Visconte, Criteo

Title: DataDisco – One schema to rule them all and kill your data legacy

Big Data begets Big Legacy.

Five years of Hadoop and Kafka in production and you get geological layers of data (90PB of data and 100To/day), jobs (20 000 jobs running every day), and code ( 300+ k LOC).

At some point, the ad-hoc, “anything goes” approach doesn’t work anymore and we had to find a path to make the whole data format agnostic, to remove hardcoded datacenter locations and path, and to move from the obligatory JSON stringly typed behemoth to a binary columnar format. Without converting any data, with no downtime, and as little impact as possible on the code.

From a schema-less approach we moved to using schemas as the source of truth for both data and infrastructure configuration.

We will see how this approach allows to:

  • Describe data formats and localization as code
  • Make hadoop development format agnostic
  • Create observable data flows
  • Describe infrastructure as code

..and how this actually plays out on a large production system, in terms of infrastructure, code, and data.

Speaker: Rafal Wojdyla, Spotify

Title: Data pipeline at Spotify – from the inception to the production

We all use the same tools and frameworks to process data, but the environment and best practices differ from one company to another. In this talk Rafal – an engineer at Spotify, will present the full journey, an idea has to travel from the inception to the full fledged data pipeline at Spotify. We will cover the tools and frameworks we use to ease the processes of bootstrapping, testing, validating and productionizing a new data pipeline. You will hear about some of the open source tools like scio, ratatool, gcs-tools and styx, as well as some internal ones. This talk will give you a sense of how does it feel to be a data engineer at Spotify – including all the struggle – you will see that we still have a long way to go.

Speaker: Nicolas Garcia Belmonte, Uber

Title: VISUALIZING DATA WITH DECK.GL

Data is at the core of Uber’s business and is fundamental for making informed decisions. The mission of the Visualization team at Uber is to deliver intelligence through the crafting of visual exploratory data analysis tools. To meet these needs, the team developed an open source visualization stack. In this talk Nicolas will give a brief overview of the Visualization team, their history, mission, and the most challenging problems they tackle. Then he’ll do a deep dive into their core open source components and libraries that power most data products at Uber. He’ll present their abstract and scientific data visualization stack, focusing on deck.gl, a WebGL framework for high-performance visualizations.

Speaker: François Jehl, Criteo, Pawel Szostek, Criteo, Neil Thombre, Criteo (ex-AWS RedShift)

Title: HLL performance characteristics in large scale aggregations over structured data

We’ve all heard about HyperLogLog by now (if you haven’t, don’t worry, there’s a intro!) and how it can approximate billions of distinct values in data structures on the order of a few kilobytes, and while there’s been quite a lot published on its accuracy there is not a ton available on performance.  More to the point, comparing hundreds of millions of 2KB data structures, even when already located in main memory, is expensive.  As a result of this, we asked ourselves, “when should we aggregate HLL synopses and when should we aggregate raw event level data?”

In this presentation we review the performance of HLL in Vertica versus raw event level data (also in Vertica).  Additionally, to get an idea of the “raw” performance of HLL we use Druid, a popular in-memory datastore with native HLL support, as a baseline.

Speaker: Guillaume Bort, Criteo, Justin Coffey, Criteo

Title: Time-series workflow scheduling with Scala in Langoustine

There are many workflow schedulers available today both in the FOSS and proprietary worlds.  Langoustine is a new offering in this space.  It is close in spirit to Airflow, though written in Scala and with a Scala DSL and with a specific focus on scheduling time series data set pipelines.

Langoustine is in production running the vast majority of Criteo data pipelines on the largest Hadoop cluster in Europe and in this talk we will discuss our hits and misses in building it as well as its future as an Open Source project.

Tickets will officially go on sale next week!

AND MORE TO COME!

 

The post NABDConf (Not Another Big Data Conference) is back for 2017 appeared first on Criteo Labs.


ClrMD Part 3 – Dealing with static and instance fields to list timers

$
0
0

This third post of the ClrMD series focuses on how to retrieve value of static and instance fields by taking timers as an example. The next post will dig into the details of figuring out which method gets called when a timer triggers. As an example, the associated code lists all timers in a dump and covers both articles.

Part 1: Bootstrapping ClrMD

Post 2: Finding duplicated strings with ClrMD

Marshaling data from a dump

Beyond heap navigation shown in the previous post, the big thing to understand about ClrMD is that the retrieved information is often an address. An address from another address space because the dump is seen as another process just like if you were debugging it live. Your code will need to access the other process memory corresponding to this address; not directly with a pointer/reference indirection or with the raw Win32 ReadProcessMemory API function but via APIs like GetObjectType or GetValue.

To illustrate how to navigate into the dump address space with ClrMD, we will show how to list the timers that have been started. This can be useful to investigate various issues, such as leaks or timers being stuck.

Know your framework

A naive implementation, like the string example of the previous post, would try to list all object instances in the CLR heap and look at Timer instances only. However, as it has been mentioned already, this is very inefficient in terms of performance; especially for 10+ GB dumps…

It is time to figure out what happens in the .NET runtime when your code creates a new timer. If the source code of the version of the CLR you are using is not available, start your favorite IL decompiler and look at the System.Threading.Timer implementation details. The parameters given to the constructors (such as the due time, period, and callback method, in addition to its optional parameter if any) are not stored in the class itself but in the TimerQueueTimer helper class.

The Timer constructor code, after a few sanity checks, calls the TimerSetup method to wrap a TimerQueueTimer in a TimerHolder that is stored in the Timer m_timer field.

This is where things start to become interesting: this TimerQueueTimer class adds each new instance into a linked list kept by a singleton object stored in the static s_queue field of the TimerQueue class. The following figure shows the relation between instances after three timers are created:

So… a fast way to list the timers would be to get the unique static instance of TimerQueue, look at its m_timers field and iterate on each TimerQueueTimer by following their m_next field until it contains null. The rest of the post details the following operations with ClrMD:

  • quickly getting a ClrType
  • reading a static field
  • reading an instance field

to fill up a collection of our own TimerInfo used to easily create a summary:


This is wrapped inside a helper method described in the next few sections:


As explained in the previous post, you need to ensure that the process was not in the middle of a garbage collection when the dump was taken by checking the value of the ClrHeap.CanWalkHeap property.

Standing on the shoulders of giants

I have found the different steps to get access to the static fields of classes in the ClrMD implementation from GitHub. In addition to the samples, I highly recommend that you take a look at the classes under Desktop:

These types are using optimized ways to access information from the CLR.

Let’s go back to our first goal: getting the value of the static s_queue field of the TimerQueue class. One of the very efficient optimization found in the ClrMD implementation is to directly get a ClrType from a module and call its GetTypeByName method instead of iterating the heap until an instance of the type is found. In our case, we need to access TimerQueue which is a type from mscorlib. Here is the code of the helper function from Desktop\threadpool.cs to get a ClrModule for mscorlib:


The following line sets timerQueueType with the ClrType corresponding to TimerQueue:

Next, get the ClrStaticField corresponding to the static field s_queue:

The staticField variable is not the static instance but rather a way to access it… or them.

But where are my statics!

Let’s take some time to explain a “detail” of the .NET Framework to help you understand how to get the static TimerQueue instance. Unlike previous Windows frameworks, .NET allows a process to contain several running environments called application domains (a.k.a. AppDomains). For a better isolation, each AppDomain has its own set of static variables: this is why you need to iterate on each AppDomain with ClrMD to access the static instances:

The address returned by ClrStaticField.GetValue is nullable because, in an AppDomain where no TimerQueue has ever been used, its fields won’t be initialized.

We don’t really need to map this address from the dump address space into something usable in the tool. Instead, only the value of the m_timers field is interesting to be able to start iterating on the list of timers.

How to get the values of instance fields?

Now that we have an address in the dump and the ClrType describing the type of the corresponding object (TimerQueue here), it is easy to retrieve the value of one of its instance fields. Since this action is needed again and again to move from one TimerQueueTimer object to the next, it is valuable to create a helper method:

The address of the object in the dump is used to get its ClrType. The ClrInstanceField (instead of a ClrStaticField as for the s_queue case) describing the property exposes the expected GetValue method. Note that the return value of GetValue is defined as System.Object but you should understand it as the numeric value stored in the dump (or the other process address space) at the given address. For the simple value types such as boolean, number and even ulong address, a cast will be enough to transparently marshal the value into the tool with ClrMD.

Let’s go back to writing the code to access to head of the TimerQueueTimer list from the TimerQueue static instance:

currentPointer holds the address of each TimerQueueTimer in the list kept by the static TimerQueue.

Note the ((ulong)currentPointer) != 0) test in the while loop to detect the end of the list when the m_next field is null.

Next step…

After enumerating each timer, the next post will show how to extract details such as the due time, the period, and even which method is called when it ticks.

Post written


 

 

 

 

 

Christophe Nasarre

Staff Software Engineer, R&D.

 

 

 

 

 

Kevin Gosse

Senior Software Engineer

Twitter: KooKiz

The post ClrMD Part 3 – Dealing with static and instance fields to list timers appeared first on Criteo Labs.

It’s All About “Big Data”

$
0
0

Big Data – an overloaded term that spans anything from a few terabytes to petabytes. At Criteo, the real-time nature of our bidding system firmly plants us in the truly Big Data camp. Our bidding system deals with 1M QPS of requests, generating 150+ TB of new data while guaranteeing 99.95% availability. So, you can bet that we appreciate all things truly Big Data. Towards this end, we recently launched a meetup group in Palo Alto, Criteo Labs Tech Talks, that is focused on exchanging knowledge, experience and wisdom that was gained from engineering and managing Big Data systems.

At our first meetup, we hosted Karthik Ramasamy  who gave a talk on Data Stream Processing at Scale. Karthik initiated and oversaw the development of Heron, the next generation streaming system at Twitter as a Storm alternative. He is the author of a book “Network Routing – Algorithms, Protocols and Architectures”, has authored several publications and patents in large scale data processing.

Karthik Ramasamy

Karthik gave us an insider’s view on the journey from Storm to Heron, walking us through the real-time constraints of Twitter and on the limitations that were faced while using the popular alternatives of Kestrel and Kafka and on the need to use distributed logs. He also answered practical questions from the audience on how the Heron system was provided as a service to other groups and on the ease of upgrading topologies. It was a comprehensive talk that spoke to all aspects of Heron’s evolution and we urge you to check it out here.

 

 

We also had a lightning talk, by our Staff Dev Lead, Neil Thombre who gave a quick introduction to the Analytics stack at Criteo, which manages 7 petabytes of data fielding time-sensitive queries from over 1500+ users. Neil also spoke about the Open Source culture at Criteo and highlighted our recent contribution to Vertica and our upcoming Not Another Big Data Conference.

Criteo Labs has an exciting lineup of Big Data speakers lined up for the upcoming instances of our Tech Talk Meetup group, spanning topics from large scale machine learning to database optimization. Looking forward to interacting with you there!

Jump into the talks presented below:

Subscribe to our Palo Alto Tech Talks Meetup

Post written by:


 

 

 

 

 

Zheng Guo

Staff Software Enginer, R&D – Palo Alto

The post It’s All About “Big Data” appeared first on Criteo Labs.

Data Serialization Comparison

$
0
0

At Criteo, Performance is everything.

The serialization formats considered:

  • Protocol buffers
  • Thrift
  • Avro
  • Json
  • XML

We did the benchmarking using a specialized library: http://benchmarkdotnet.org/ , and C# .net 4.5.

The data model

Before digging into the implementation details, let us see the Data Model. Our data is similar to an Excel workbook. It has got many pages, and each page is a table. In our case each table has got some keys indexing the rows. One difference with Excel though is that each cell may contains more than just a single item. It could be a list of floating point values, a dictionary of floating point values, a dictionary of dictionaries, and so on.

We represent data in a table like structure. Each column may have a different data type and every cell in a column has the same data type as shown in table below. The cell’s value are implemented as subclasses of a base class called IData. We have one implementation of IData for each type of data structure we want to put in cells.

Key
Column1
Column2
Column3
Column4
...
String
double
Double[]
Dictionary<int, double>
Dictionary<int,double[]>
String
double
Double[]
Dictionary<int, double>
Dictionary<int,double[]>
String
double
Double[]
Dictionary<int, double>
Dictionary<int,double[]>
String
double
Double[]
Dictionary<int, double>
Dictionary<int,double[]>

Table 1 Example of the table like structure.

In order to have fixed sample of data to serialize, We wrote a data generator that randomly generates the different possible values for each type of columns.

The XML Story

The original implementation was in XML so it became the reference benchmark for the other formats. Implementation was easy using the standard .net runtime serialization, simply decorate the classes with the correct attributes and voila.

Figure 1 DataContract annotations for serialization

The interesting part is the “[DataContract]” and “[DataMember]” attributes which indicates to the serializer what members to serialize.

The JSON Story

Json is supposed to be faster and light-weight than XML. The serialization is handled by the Newtonsoft library, easily available in C#. There is just one small glitch here, in order to be able to correctly serialize and deserialize such dynamic data, we had to set the type name handling to automatic. This resulted in json text with a type field.

For example.

The Protocol Buffer story

This has a lot of hype, which kind of makes sense because binary formats are most of the time faster than text formats, also the data model for the messages could be generated in many languages from the protobuf IDL file. The drill here is to create the IDL, generate C# objects, write convertors and serialize!

But wait, which library should we use ? We found at least three different nugets, two of them claimed to implement the same version of Protobuf V3.

After much investigation, we realized that Google.Protobuf is provided by Google and had the best performance. Protobuf3 is compiled by an individual from the same source code but it is slower.

There is more than one way to solve the problem with protobuf and we decided to try three different implementations and compare the performance.

First implementation

This implementation is referenced as protobuf-1 in our benchmarks. The design had to solve the problem of storing a polymorphic list. This had to be done using inheritance, and this blog post explores different methods of implementing it. We compared them and chose to use the type identification field  approach as it had a better performance.

Let’s see the example.

Here, each cell of the table would contain one object of DataColumnMessage, which would have one field filled with values and the rest of them are null values. Protobuf does not store null values for optional fields, so the file size should not change a lot. But still this meant 4 null values and if the number of fields increase, that would mean even higher number of null values. Does that effect the performance ? Keep reading for the comparison of results.

Second Implementation

This implementation is referenced as protobuf-2 in our benchmarks. We knew that each column has the same data type, so we tried a column based design. Instead of creating one value for each cell, we decided to create one object per column. This object will store one field for the type of the objects stored, and a repeated value for each cell. Therefore drastically decreasing the number of null values and the number of field “type”.

Let’s look at how the IDL would look like,

We believed that this should improve the performance by quite a lot in comparison to the previous version.

Third Implementation

This implementation is referenced as protobuf-3 in our benchmarks. We wanted to leverage the new “map” keyword introduced in the protobuf version3 and benchmark its performance. This is the new specialized way of defining dictionaries so we were hoping for performance improvements. Our hypothesis was that we don’t need to allocate and copy data while converting to our business objects. Are we right ? We’ll see in the comparisons.

The dictionary object description changes from a list of key/value pair to a map.

Protobuf V2

Protobuf V3

This generates code that directly uses the C# dictionary implementation.

The Thrift Story

You know the drill, create the IDL first, then generate the message objects, write the needed conversions and serialize. Thrift has a much richer IDL, with a lot of things that do not exist in protobuf. In the test example, one big advantage we had was the availability of the “list” keyword. This meant that we can now specify r

The rest of the IDL is not that different from protobuf.

Let’s look at the example.

In our simple case, the Thrift IDL allows us to specify our map of list in a single line :

Protobuf IDL

Thrift IDL.

We found Thrift mandates a stricter format where you declare classes before using them. It also does not support nested classes. However it natively supports list and maps, which simplified the IDL file.

Thrift syntax is far more expressive and clear. But is it the same for performance ? To know more keep reading.

The Avro story

Apache Avro is a serialization format whose support in C# is officially provided by Microsoft. As with the other serialization systems, one can create a schema (in JSON) and generate C# classes from the schema.

The example of Avro JSON Schema (excerpt):

We found the JSON schema very verbose and redundant in comparison to the other serialization formats. It was a bit difficult to actually write and generate the classes. So we took a shortcut and generated the schema by using the DataContract annotations. All we want is performance and no difficult Schema language could stop us from evaluating all the options.

Note that there exists an IDL for Avro with a Java tool that generates the JSON schema from the IDL file, see: We didn’t evaluate it because we preferred to let the C# Avro library generate the schema from the annotated code.

But, there was another blocker. It was the inconsistent support for dictionary-like structures. For instance, we found that the C# version correctly serializes a dictionary whose value type is a list. But it refuses to serialize a dictionary of dictionaries. In this case, it doesn’t throw any exception. Instead it will just silently fail to serialize anything.

We also came across a bug opened at Microsoft’s github two years ago (still open), where a dictionary with more than 1024 keys throws an exception when serialized. (https://github.com/Azure/azure-sdk-for-net/issues/1487)

Given these constraints we had to serialize dictionaries as list of key and value pairs and create all intermediary classes with annotations. It contributed to make the scheme more complex. Is it now going to beat the other formats ? Let’s find out.

Results

We split our benchmarks into two configurations:

  • Small objects, where the lists and dictionaries contains few items (less than 10). This configuration is used usually by a web-service exchanging small payloads.
  • Big objects, where we accept lists with many hundreds items. This configuration is representative of our Map/Reduce jobs.

We measured the following metrics:

  • Serialization time
  • Deserialization time
  • Serialized file size

In our case the deserialization is more important than serialization because a single reducer deserializes data coming in from all mappers. This creates a bottle neck in the reducer.

Small Objects

File Sizes

Table 2 Small objects serialized file sizes in Bytes

  1. All binary formats have similar sizes except Thrift which is larger.
  2. 2nd implementation of protobuf is the smallest among other protobuf implementations due to the optimization achieved with column based design.
  3. 3rd implementation of protobuf is a bit bigger which implies that the map keyword in protobuf increases the file size by a little.
  4. Json is off course better than XML
  5. XML is the most verbose so the file size is comparatively the biggest.

Performance

Table 3 Small objects serialization time in micro-seconds

Thrift and protobuf are on par. Avro is a clear loser.

  1. The different implementations of protobuf
    1. 3rd implementation of protobuf that uses the map is around 60% slower than the other protobuf implementations. Not recommended if you’re looking for performance but it inherently provides the uniqueness of the keys which is a tradeoff.
    2. 2nd implementation is not optimal. The column based design doesn’t show its full effect for small objects.
    3. Serialization is generally quicker than deserialization which makes sense when we consider the object allocation necessary.
  2. The numbers confirm that text formats (xml, json) are slower than binary formats.
  3. We would never recommend using Avro for handling small objects in C# for small objects. Maybe in other languages the performance would be different. But, if you’re considering to develop micro services in C#, this would not be a wise choice.

Large Objects

File Size

Table 5 Large objects serialized file size in MegaBytes

  1. Avro is the most compact but protobuf is just 4% bigger.
  2. Thrift is no longer an outlier for the file size in the binary formats.
  3. All implementations of protobuf have similar sizes.
  4. Json is still better than XML
  5. XML is still the most verbose so the file size is comparatively the biggest.

Table 6 Large objects serialization time in milli-seconds

  1. This time, Thrift is a clear winner in terms of performance with a serialization 2.5 times faster than the second best performing format and a deserialization more than 1.3 times faster.
  2. Avro, that was a clear disappointment for small objects, is quite fast. This version is not column based, and we can hope it would make a little faster.
  3. The different implementations of protobuf
    1. Column based 2nd implementation of protobuf is the winner. The improvement is not huge, but the impact of this design kicks in when the number of columns starts to be very high.
    2. Serialization is generally quicker than deserialization which makes sense when we consider the object allocation necessary.
  4. Serializing XML is faster than Json. Json on the other hand is way faster.

Final conclusion

Protobuf and Thrift have similar performances, in terms of file sizes and serialization/deserialization time. The slightly better performances of Thrift did not outweigh the easier and less risky integration of Protobuf as it was already in use in our systems, thus the final choice.

Protobuf also has a better documentation, whereas Thrift lacks it. Luckily there was the missing guide that helped us implement Thrift quickly for benchmarking. https://diwakergupta.github.io/thrift-missing-guide/#_types

Avro should not be used if your objects are small. But it looks interesting for its speed if you have very big objects and don’t have complex data structures as they are difficult to express. Avro tools also look more targeted at the Java world than cross-language development. The C# implementation’s bugs and limitations are quite frustrating.

The data model we wanted to serialize was a bit peculiar and complex and then investigation is done using C# technologies. It could be quite interesting to do the same investigation in other programming languages. The performance could be different for different data models and technologies.

We also tried different implementations of protobuf to demonstrate that the performance can be improved by changing the design of the data model. The column based design to solve the real problem had a very high impact on the performance.

 

Post written by:


Afaque Khan

Software Engineer, R&D – Engine – Data Science Lab

Frederic Jardon

Software Developer, R&D – Engine – Data Science Lab

 

The post Data Serialization Comparison appeared first on Criteo Labs.

Romain Lerallut on scaling, scaling, scaling

$
0
0

Tell us about your background and how you ended up at Criteo.

After defending my Ph.D. in image processing in 2006, I joined a French start-up called A2iA working on handwriting recognition. When our robot overlords take over the world, they’ll be able to read your post-it notes on the fridge thanks to our work. For five years, I worked on the engineering aspects of building high-performing machine learning algorithms. This made me a good fit to join the budding Engine team at Criteo, back in 2011.

I joined as Lead of the Engine Infrastructure team, building the pipes necessary to launch the first modern ML algorithms to power our Engine. Then I evolved progressively to a management role, still working on our Engine. I don’t build software anymore; I build the teams that will build the software. I still see it as engineering. It’s Engineer Engineering!

What unique challenge or problem have you worked on here at Criteo?

Scaling, scaling, scaling. Scaling everywhere. Scaling everything. From the number of requests to process in real time, to the amount of data to train our models on, to organizing the work of now 500 R&D engineers. It’s all about growing as fast as we can while maintaining, and even increasing our quality standards.

What makes working in engineering at Criteo unique or different than your other experiences?

It’s very dynamic. Things are constantly in flux, and nothing is set in stone. We are in a permanent ‘try-and-learn’ mode, whether it’s for technologies, project management techniques, or HR processes. We strive to adapt to changing environments, we reshape our organizations to handle our growth, and we always try to make things work better.

What advice would you give to someone to be successful in your role?

As a manager, you need to show that you add value and help your teams deliver. You also need to make sure your teams are empowered and have the right level of information to make the right decisions (what technology to use, what trade-offs to make to deliver faster). You also need to hire people who will fit well in the various teams and make sure they are constantly challenged and motivated. And then you just need to learn to step out of their way and watch the magic happen.

What’s your favorite project you’ve worked on?

The projects I worked on directly are long completed and have been replaced by newer and better stuff. I still have a few lines of code in production that get called several millions of time per second. I don’t think about it very often, but when I do, I admit to feeling a tiny trickle of pride.

How has Criteo changed since you started 5 years ago?

It’s been a wild ride! I joined a post-startup company of fewer than 400 people, and it was pretty much the Wild West. Release process? What process? Just dump your binaries on a prod server and see if it works. Testing? Why? It’s for poor developers and insecure people who fear bugs. The “Hero Culture” of 2011 was very exciting but also exhausting. You never really knew what was going on. I discovered the joy of seeing my code go into production in a matter of days. I fondly remember posting about it on Facebook: “First line of code at Criteo deployed today. If the Internet breaks, it may be me.”

Fast-forward five years, and we have kept most of the good things and even added a few. I don’t miss much that we discarded. We’ve added some processes — of course, you can’t dump your binaries directly on the prod server anymore (thankfully!) — and absolutely all code is peer-reviewed. We even have decent code coverage. But we are still shipping like crazy. The problems have gotten tougher, and we need to think even harder to solve them.

Bonus point: I still have not broken the Internet.

For more updates on Romain, follow him on twitter: @RLerallut 

The post Romain Lerallut on scaling, scaling, scaling appeared first on Criteo Labs.

ClrMD Part 4 – What callbacks are called by my timers?

$
0
0

This fourth post of the ClrMD series digs into the details of figuring out which method gets called when a timer triggers. The associated code lists all timers in a dump.

Part 1: Bootstrapping ClrMD to load a dump.

Part 2: Finding duplicated strings with ClrMD heap traversing.

Part 3: List timers by following static fields links.

Looking at my timer

In the previous post, we explained how to access a static field of TimerQueue to start iterating the list of TimerQueueTimer wrapping the created timers. Now that the currentPointer variable contains the address of each TimerQueueTimer, it is time to extract the details of the timer we have created.

The following code extracts the value of the TimerQueueTimer fields corresponding to each Timer thanks to the GetFieldValue helper presented in the previous post:

Note that the value for m_dueTime is always the same as the value of m_period. This is not a bug but it seems that .NET is only keeping the due time during construction but use the corresponding field for other purpose after.

The m_state field case is a little bit more complicated to decipher because the type of the object passed to the timer needs to be figured out in addition to its address, if the latter is not null:

As usual with ClrMD, you need to get the ClrType corresponding to the object referenced by an address before being able to access its fields or to get its name. However, instead of looking into a module as it has been done for TimerQueue, it is easier and more efficient to call the GetObjectType from ClrHeap. Remember that the mandatory test against a null value for the ClrType might seem overkill but the ClrMD implementation states that it is possible that the internal CLR state could be corrupted.

What is the timer callback?

The last piece of information to retrieve is the callback the timer will call when it triggers. The _timerCallback field references a TimerCallback instance that stores these details.

But how to get the name of the method just with the address of a TimerCallback object? Again, open up your favorite decompiler and look at the type hierarchy:

Here are the two fields of the Delegate type that are interesting:

The _methodPtr field stores the pointer to the method. By chance, the ClrRuntime GetMethodByAddress method takes this address and returns the name of the method!

If this method is static, the _target fields is null. Otherwise, it stores the value of this, the hidden parameter received by all instance methods. In case of type inheritance, it is interesting to know which override will be called. All these steps are wrapped in the following helper function:

Building a usable summary

Even though the EnumerateTimers helper provides a way to list all timers, you often don’t want to show them all; especially when thousands exist and most of them are duplicates. The sample code associated to this post lists the different timers, count the duplicates and sort the result by duplicate count as shown in the following screenshot:

Next step…

After timers, the next post will show how to integrate your ClrMD-based code into an extension for WinDBG to help decyphering Task state.

Post written by:


 

 

 

 

 

Christophe Nasarre

Staff Software Engineer, R&D.

 

 

 

 

 

Kevin Gosse

Senior Software Engineer

Twitter: KooKiz

 

The post ClrMD Part 4 – What callbacks are called by my timers? appeared first on Criteo Labs.

NABDConf 2017 lives up to expectations!

$
0
0

 

This, the second year of NABDConf was, dare I say it, even better than the inaugural 2016 edition!  This year not only did the talks exceed expectations, but so did the weather, with both lunch and evening festivities held on Criteo Paris’ rooftop deck.

You can of course have a look at the schedule of talks to get the abstracts, but I’d like to touch on a few of the highlights.

First we had a great talk from Josh Baer of Spotify to open the conference reviewing the how and the why of getting Spotify up and running on GCP.

Josh Baer, Spotify

Nicolas Belmonte from Uber then wowed the crowd with some ridiculously beautiful in-browser visualizations built off of deck.gl.

Nicolas Belmonté, Uber

While I don’t want to toot Criteo’s horn too much, we did close out the morning session with François Jehl, Pawel Szostek and Neil Thombre’s work on HLL which shows huge promise for distinct counts on OLAP workloads in Vertica.

 

Lunchtime on our rooftop

In the afternoon we had something of a data-production track with inspiring stuff on the data developer’s work cycle at Spotify from Rafal Wojdyla and then a little bit of data workflow development history (and future!) from Guillaume Bort and myself in which we introduced our new open source scheduler Cuttle.  We closed the track with the final talk on the subject from Marc Bux of Humboldt University in Berlin with his approach on scheduling scientific workflows in YARN.

Rounding out the talks you have BigGraphite (Graphite on Cassandra) from Corentin Chary, to how we build our billion node, billion edge user graph from Bruno Roggeri to the discussion of the best named project ever, DataDisco (Criteo’s hdfs data schema/discovery framework) from Francois Visconte and Mathieu Chataigner.

Presentations (Videos)

Last year we had lots of requests to put presentations online, and I am very happy to say that not only have we done so, but we took the extra step of filming all of the talks as well.  You can relive this year’s experience via the videos below:

NABDConf Intro

Moving to the Cloud: A Story from the Trenches – Josh Baer, Spotify

Visualizing Data with deck.gl – Nicolas Garcia Belmonte

HLL performance characteristics in large-scale aggregations over structured data – François Jehl, Pawel Szostek, Neil Thombre, Criteo

Building a billion node / billion edge graph – Bruno Roggeri, Criteo

BigGraphite – Graphite meets Cassandra to Scale Monitoring at Criteo – Corentin Chary, Criteo

Data pipeline at Spotify – from the inception to the production – Rafal Wojdyla, Spotify

One schema to rule them all and kill your data legacy – Francois Visconte, Mathieu Chataigner, Criteo

Time-series workflow scheduling with Scala in Langoustine – Guillaume Bort, Justin Coffey, Criteo

Hi-WAY: Execution of Scientific Workflows on Hadoop YARN – Marc Bux, Humboldt University of Berlin

Photos from event

Curious to see how things went down at this year’s event? Follow this link 

A special thank you to our speakers and sponsors (Vertica and Criteo) and of course to our wonderful attendees.  Without all of you this conference wouldn’t exist!

See you in 2018 for yet Not Another Big Data Conference.

Justin Coffey

The post NABDConf 2017 lives up to expectations! appeared first on Criteo Labs.

Suju Rajan on how innovation and machine learning are the core of Criteo

$
0
0

Tell us about your background and how you ended up at Criteo.
I started out as an intern building behavioral targeting models in Yahoo Labs and got to experience a variety of exciting applications of machine learning from search relevance to personalized recommendations. In parallel, on the management side, I grew from being an intern to directing a research team.

I had friends who worked at Criteo, and I was very aware of the importance given to machine learning here. When I was offered the opportunity to create and grow a research group, I had to sign on to the challenge.

What about our product most excites you or makes you proud to work here?
Even when interviewing at Criteo, I was very impressed by the forward and lateral view that the engineering and product teams had in wanting to constantly identify new opportunities to invest in. Criteo is the only success story in the space of publicly traded ad tech companies. The interviewers could have sold me on just the merit of our market position. Instead, the focus was on what we could do better, what opportunities we are missing out on, and where we could go with the data assets we have. This urge to stay ahead of the game, to constantly innovate and the laser-sharp focus on performance is what makes me proud to work here.

What makes working at Criteo different from your other experiences?
The strong mathematical and statistical training that our engineers and product managers have is a differentiator. It is easy to have an in-depth conversation on what the research team is working on and to get meaningful feedback on how that fits within the engineering and product roadmap. I have even had very insightful conversations with folks on the sales team on how the research team can help translate their insights from client conversations to factor into our machine-learned models. How cool is that? The whole company has a strong sense for what it means to be data-driven.

What’s your favorite project you’ve worked on?
We kicked off an internal machine learning “sabbatical” for our engineering counterparts — the Machine Learning Bootcamp. Members of the research team offer a set of classes that focus on the theoretical aspects of machine learning algorithms from simple categorization to reinforcement learning and, of course, deep learning. Our Bootcampers then sit alongside the research team to deep-dive into a machine learning research project. It has been exciting to see the enthusiasm with which the Bootcamp has been received, and we have even reached the point where our Bootcampers are publishing papers based on their projects.

What are you excited about in the next year?
Well, three topics!

First, something that is dear to my heart is having more women leaders in tech. Our tech hiring and HR teams are making a concerted effort to reach out to women in STEM fields. We are kicking off a number of efforts both in Paris and Palo Alto to engage with women and other minorities. We at Criteo want to do our part by lending our support, sharing our stories to hopefully both inspire and learn. Stay tuned on this front!

Second, deep learning is the hottest topic in the machine learning community these days. The research team has kicked off a bunch of efforts to leverage this class of models at Criteo in applications from product disambiguation to improving our bidders. I am eagerly anticipating the result of these research efforts in the short term.

Finally, we recently announced the recipients of the Criteo Faculty Research Award which funds the research of some stellar ML researchers in academia. The proposals cover a wide swath of new and innovative approaches to solving some of the harder and prevalent challenges in real time bidding systems and computational advertising at large. Imagine having some of the best minds in the world work on problems we care about — now that is super exciting!


The Criteo engineering team is growing across Paris, Palo Alto, and Ann Arbor. We’d love for you to join us.

The post Suju Rajan on how innovation and machine learning are the core of Criteo appeared first on Criteo Labs.


ClrMD Part 5 – How to use ClrMD to extend SOS in WinDBG

$
0
0

This fifth post of the ClrMD series shows how to leverage this API inside a WinDBG extension. The associated code allows you to translate a task state into a human readable value.

Part 1: Bootstrap ClrMD to load a dump.

Part 2: Find duplicated strings with ClrMD heap traversing.

Part 3: List timers by following static fields links.

Part 4: Identify timers callback and other properties.

Introduction

Since the beginning of this series, you have seen how to use ClrMD to write your own tool to extract meaningful information from a dump file (or a live process). However, most of the time, you are also using WinDBG and SOS to navigate inside the .NET data structures.

It would be convenient if you could leverage the new .NET exploration features based on ClrMD the same way you are using SOS. This post will explain how to achieve this goal by implementing an extension that exports commands callable from within WinDBG.

Deciphering a Task status

During one of our debugging investigations, we needed to get the value of the Status property for a few Task instances. If you take a look at the implementation of the property getter in a decompiler (or from source code), you will see that it is computed based on the value of the internal m_stateFlags field.

In WinDBG, the !DumpHeap -stat command lists all types with their instance count. If the .prefer_dml 1 command has been set, you even get hyperlinks on some values such as the address or MT (for MethodTable). If you click the MT value for System.Threading.Tasks.Task, you get all instances of type Task:

Click any address and look at the value of the  m_stateFlags field:

It is easy to automate the retrieval of the m_stateFlags instance field value with ClrMD as explained earlier:

The ClrType corresponding to the address is first checked to ensure that it represents a Task instance. Next, its GetFieldByname helper method returns a ClrInstanceField that provides the status via its GetValue function.

The next step is to transform this number into a TaskStatus enumeration value by simply using a decompiler and copying the logic from the Task getter code:

It would be a time saver if this translation could be done by a command right inside WinDBG instead of relying on another tool based on ClrMD in which addresses are pasted.

 

WinDBG extension 101

In addition of being a native Windows debugger, WinDBG supports extensions: .dll files that you load with the .load command. They are exporting commands that are callable from within WinDBG with the “!” prefix. These commands are usual native exports that can be seen with tools such as http://www.dependencywalker.com/ as shown by the next screenshot:

As you can see, all SOS commands are functions exported by the sos.dll native binary. Before digging into the extension functions implementation, notice that a few other functions could also be exported. Among them, the DebugExtensionInitialize function provides version information (i.e. which version of the debugging API is expected) and must be exported to be called by WinDBG when the dll is loaded.

Read this post for more details about how to develop a native WinDBG extension.

All extension command functions take two parameters:

  • An IDebugClient instance to interact with WinDBG
  • An ANSI string for the arguments (such as “-stat” for !dumpheap)

The bridge between your extension commands and WinDBG is provided by the IDebugClient COM interface. But don’t be scared: no need to manually deal with native COM interface with ClrMD! The DataTarget.CreateFromDebuggerInterface method takes an IDebugClient interface and returns an instance of DataTarget. As you might remember from the initial post of this series, DataTarget is the gateway to the dump (or live-debugged attached process): we are now back to the known ClrMD world.

 

Reuse ClrMD Samples

Hopefully, most of the glue to bind the native world to ClrMD is already available! You simply reuse the partial DebuggerExtensions class given in the samples.
You extend the class with your extension methods that take the following signature:

 

The first parameter is a pointer to the IDebugClient interface provided by WinDBG. The first thing to do in your extension command method is to call the InitApi static method with the interface pointer and let the magic happens.

After that call, the output of the Console will be redirected to WinDBG and your code is free to use the following properties to access the dump via ClrMD:

The second parameter args received by your method is a string that contains the parameters added by the user after the name of your command. For example, if the user types “MyCommand param1 param2”, the args parameter will be “param1 param2”.

 

Exposing native functions

The last part of magic glue is how to export a native function from a .NET assembly. This is made possible by the UnmanagedExports nuget package by Robert Giesecke.

Once added to your project, decorate the functions to export with the DllExport attribute and the native name of the function that will be visible in WinDBG as a command.

 

There is a little trick here: the names of exported functions are case sensitive for WinDBG. If you take a look again at sos.dll in Dependency Walker and sort exports by Function column, you will notice a few duplicates such as CLRStack ClrStack clrstack as shown in the following screenshot:

For usability sake, it is a good practice to provide several syntaxes for the same command, including short version such as !dso for !DumpStackObject in SOS. Unfortunately the DllExport attribute does not allow multiple applications on the same method with different exported names. You need to define a different method per exported name and all of them will call the same internal helper method.

Thanks to the GetTaskStateFromAddress and GetTaskState helper methods described earlier, the implementation of the OnTkState method is straightforward once the address or the value has been extracted from the args parameter.

 

Don’t forget your user: implement help

A good extension always provides an help command that (1) lists the available commands with shortcuts and (2) additional details on each command. Simply add a new file that defines the exports for help/Help and parses the string argument if needed.

Tips to use the extension

Don’t forget that you might need two versions of your assembly: one for the x86 version of WinDBG if your applications are 32 bit and one for the x64 version of WinDBG in the 64 bit case. If you want to be able to easily load your extension with the .load <myextension> command, copy it with Microsoft.Diagnostics.Runtime.dll (i.e. ClrMD assembly) to the winext subfolder of x64/x86 WinDBG folders:

Before being able to use any of its commands, you must load SOS with the well-known .loadby sos clr mantra. But this is not enough: you also have to run at least one SOS command. You are now ready to call any of your extension commands!

 

Next step…

The next episodes will bring you into the mysteries under the dynamic keyword and how to simplify the syntax to leverage ClrMD.

Post written by:


 

 

 

 

 

Christophe Nasarre

Staff Software Engineer, R&D.

 

 

 

 

 

Kevin Gosse

Senior Software Engineer

Twitter: KooKiz

 

The post ClrMD Part 5 – How to use ClrMD to extend SOS in WinDBG appeared first on Criteo Labs.

Making the right move – Relocating at Criteo!

$
0
0

When I (Imane, R&D Recruiter) started at Criteo, I was really impressed by the quality of the candidates Criteo was interviewing. I love my job because I have to find top notch candidates no matter where they are based and offering a relocation doesn’t just apply to Senior and Management roles. One of the examples we have is the relocation from Ireland to France of our SVP of Engineering, Diarmuid : MyCareer: Interview with Diarmuid Gill, SVP of Engineering for Criteo.

Recruiting international talent and working in a multicultural environment is both a challenge and what makes Criteo unique in Paris.

To manage relocations, the Talent Acquisition team works closely with our Global Mobility team. Let’s meet Cynthia, who is instrumental in making the relocation process a smooth one.

The Criteo Mobility team : Hajir Prum & Cynthia Callatin-Sarr

Tell us a bit about yourself.

I have been working at Criteo for 2.5 years. I am part of the Global Mobility Team [pictured here with team member Hajir], in charge of the EMEA region.

Over the years, the number of relocations has increased to the point where it was necessary to create a Global Mobility Team. The aim was, and still is, to define guidelines across regions to help with continuing business growth and provide the best relocation packages to support our Criteos.  Whether hired abroad or changing entities, we want Criteos to have the best relocation experience possible.

What is your job?

The main part of my job is to ensure that our international hires have a hassle-free relocation with the help of dedicated providers. I am also the “go-to-person” for any immigration matters.

Looking after the partner and family and making sure they feel included is crucial to a successful relocation. That’s why we offer an outplacement program providing assistance with school search, French classes and a host of other services. My motto is ” Happy Family, Happy Relocation!”.

When do you first make contact with a candidate?

Relocating is life-changing. That’s why I work with the recruiters to provide more information on our relocation support during the recruitment process. Before, or when the candidate receives an offer, they often have questions about specific topics that I am here to answer.

When it gets real, I assist with immigration, completing documents and building a personalized relocation itinerary.  From the recruiting stage up to finding a permanent flat, I am here to answer any question, explain, and reassure.

Beyond the relocation, the Global Mobility Team is here to ease the onboarding and settling in process.

What part do you like the most in your job?

Being a people person, I love the idea of being part of a family or employee’s new adventure and seeing them settling in over the months.

Relocating to a new country is exciting and scary at the same time. Being the go-to-person allows me to build a special relationship with relocating Criteo employees and their family.

 

They relocated : Luca, Mike, Ricardo & Chiara

Tom Aurelio, EVP of Human Resources at Criteo says “Criteo is proud of the international nature of its employee population; it’s something we both encourage and celebrate”.

Cynthia and Hajir have helped many candidates to join Criteo and settle in Paris. Here are some testimonials from former candidates who relocated to become Criteo employees.

Did you know Criteo before joining?

“Not really, I knew it via S2DS, and I decided to apply the following month.” Chiara, Data Scientist 

I did not know the company. I did some exploration online before applying, which helped me to understand the basics.” Mike, Machine Learning Researcher 

Strangely, I discovered Criteo thanks to my AdBlocker.” Luca, Networks Lead 

I did not know Criteo before the Recruiter contacted me.” Ricardo, DevOps Engineer 

During the recruitment process, when did you talk about the relocation?

“Yes, I talked about it with [my recruiter] when I came to Paris for my panel interview.” Chiara 

“I had a call with a Criteo Engineer who reassured me a lot. I was afraid that not speaking French was a blocker.” Ricardo 

“I discussed this with the recruiter, an Engineer, and Cynthia. […] My main concern was the flat search because […] I was relocating with my girlfriend.  I am happy with the help I got from Criteo and found a very nice apartment.” Luca  

Once you accepted to join Criteo, what was the process?

“My story was unique. I was living in Israel. I am American and my wife is Romanian. […] The process of relocating was delayed by 2 months because of administrative details. Criteo helped me all along and as soon as I had the papers, we moved to Paris. Right now, I am looking for a flat.” Mike 

“Compared to my previous moves, when I had to move on my own, the relocation process was very quick and smooth. I moved in less than one month. Everything was organized and set in place by Criteo. I didn’t have much stuff to move. I got help to open a bank account, apply for social security, and find a new apartment and French lessons.

Without knowing any French, relocating could have been a nightmare, but it all went really smoothly. Same for finding a new flat… I have been told the housing market in Paris was really difficult. Fortunately, the agent found apartments following [my] specifications, arranged for and took us to view them, and took care of all the bureaucracy related to the tenancy agreement, insurance, and bill setup.” Chiara 

I was relocating from Brazil. My process was huge: getting the visa was long (almost 6 months). Criteo helped a lot with administrative aspects: I didn’t have any headache with papers! At this time, I got two offers: one from a German company and Criteo. The two were similar in terms of money. I chose Criteo because they made a difference with the relocation: removal of belongings, 3 months temporary accommodation, relocation allowance… Funny story: friends of mine experienced relocation with shared flats in a-one-month-temporary accommodation!

Family wise, we had to find a school for our son. He onboarded with international classes and then was able to join regular classes where they speak French (within a public school). Criteo was not supposed to help but was essential.” Ricardo  

The help of the housing agency was very precious. They gave me support for utility contracts (phone & electrical company which can be a nightmare if you don’t speak French). They also recommended locations in Paris according to what me & my girlfriend were looking for. ” Luca 

What was the most stressful about relocating for you?

“The administrative issues regarding our situation were a bit stressful. I got some help from Cynthia and one of Criteo’s partners which helped me with the process.” Mike 

“I wanted to move from Brazil even if I had a good job there. I really found what I was looking for: a solid company.” Ricardo 

If you had to give one piece of advice to someone relocating, what would it be?

“Even if there is some stress with relocating, everything will work out in the end.” Mike 

“Check things in advance with the company and ask questions, as things work differently in different countries. I moved two times by myself and it was really stressful and expensive. After those experiences, I would probably never have accepted a move to France on my own without speaking any French. I had other offers from other companies in UK and Denmark, and nobody had any package similar to Criteo. ” Chiara 

“Just do it!” Ricardo 

All our engineers agree that relocating is life changing. Apart from living in a different city / country, it gives you the power to drive your own career. If you are ready to move, don’t be afraid to start our recruitment process!

Caroline Chavier & Imane El Azlouk, R&D Recruiters

 

 

 

The post Making the right move – Relocating at Criteo! appeared first on Criteo Labs.

DBWeb Seminars: Deep Character-Level Click-Through Rate Prediction for Sponsored Search

$
0
0

Telecom Paristech’s DBWeb Seminars  will feature on Wednesday (12:00 pm) Criteo Scientist Amin Mantrach who will be discussing how to make use of last advancements in Natural Language Processing and Deep Neural Networks to improve the sponsored search user experience.

Free lunch will be available & more info for free registration here

This work will also be presented as full Research Paper in Augsut during the  SIGR 2017 Conference in Tokyo.

Pre-print of the paper can be done here

DBWeb Seminars feature talks by members of the group and guests from other research groups, as well as discussions on topics of relevance to the DBWeb group

Speaker Bio

Amin Mantrach is conducting research and developing deep learning based solutions for improving the sponsored search user experience for Criteo Brand Solutions.  His interests are in the domain of deep learning, NLP and information retrieval.

 

The post DBWeb Seminars: Deep Character-Level Click-Through Rate Prediction for Sponsored Search appeared first on Criteo Labs.

Criteo at ICSSP 2017

$
0
0

Processes are everywhere. Whatever the field you are part of (military, healthcare, business or software), they describe precisely all the interactions amongst the different stakeholders by capturing the order of activities and the flow of information and data produced and consumed. Processes are not only representing the means of a company to deliver its product and services, but also to represent its government procedures, regulations, policies, know-how, and best practices. They are the building blocks of a company’s profitability and success. However, process must be agile to capture and maintain customers’ needs to continuously provide value.

UPMC campus, photo credit lemonde.fr

This year, ICSSP, an academic international conferences on software and system processes was organised by UPMC/LIP6 (Université Pierre & Marie Curie, Laboratoire d’Informatique Paris 6) in Paris for it’s 10th edition from the 5th till 7th July. ICSSP belives there is considerable value in the reconciliation of business, systems engineering and software engineering processes.

Tom Zimmermann from Microsoft

Many prestigious speakers from over the world came to speak and network about research outcomes and industrial best-practices in process development.
This year, the keynote was presented by Dr. Tom Zimmermann, a senior researcher in the Software Engineering group at Microsoft Research from Microsoft USA.
Dr. Tom spoke about how at Microsoft they use a scientific approach based on data analytics to infer insights about the way their software teams work.
These information allows them to improve not only the understanding and productivity of their individual software developers, but also to boost the productivity of their team in building software.
Kuhrmann et al. presented a survey on hybrid software development approaches from 69 study participants showing that, in practice, a wide variety of development approaches are used and combined.
Tregubov et al.  studied the impact of task switching and work interruptions on software development processes. While it’s quite common for a software engineer to switch between different tasks throughout a work day, it’s impact on productivity and cost hasn’t been deeply studied.
While it’s not possible here to summarise the whole conference talks and papers, interested readers can have a look at the ICSSP’17 proceedings.

Criteo team spreading love at ICSSP’17

Criteo was a proud sponsor at this year’s conference. Generally speaking, Criteo is big on software processes, hence  one of the main reasons behind our presnce at ICSSP. Every quarter, tech leaders and EPMs  define their team’s Objectives and Keys Results (OKR) to fit the global strategy and needs of the R&D.

Depending on the team size and goals, teams use Kanban  or Scrum  agile practices in Criteo, and sometimes a mix of both. Teams that are more oriented as services provider generally choose Kanban to be more flexible and reactive in case of unexpected needs/problems. They are more geared towards a continuous flow of works.

Otherwise, most of the team applies Scrum practices with a two weeks sprints that we synchronise amongst the whole R&D. This way, it is possible for teams to collaborate and synchronize on new feature requirements, but most of all, commit on what will be shipped over the next two weeks. Sprints tasks are the top priority. Teams are not religiously following all the agile ceremonies. While retrospective, team health checks, planning of tasks, daily stand-up,  code reviews, and code deployment are quite generalized, not all team are doing demo every sprints. Dev leads and EPMs of the respective teams are always striving at improving our processes and believe that they should bends to the team needs.

If you feel ready to start the Criteo journey, please have a look and apply here .

Post written by:


Yoann Laurent

Software Engineer R&D

 

The post Criteo at ICSSP 2017 appeared first on Criteo Labs.

ClrMD Part 6 – Manipulate memory structures like real objects

$
0
0

This sixth post of the ClrMD series details how to make object fields navigation simple with C# like syntax thanks to the dynamic infrastructure. The associated code is part of the DynaMD library and is available on GitHub and nuget.

Part 1: Bootstrap ClrMD to load a dump.

Part 2: Find duplicated strings with ClrMD heap traversing.

Part 3: List timers by following static fields links.

Part 4: Identify timers callback and other properties.

Part 5: Use ClrMD to extend SOS in WinDBG.

As we’ve seen in the previous articles of the series, exploring a complex data structure using ClrMD can quickly become tedious.

Let’s take a concrete example. Imagine we have those types declared:

 

Given the address of the Sample object in the memory dump, even with the GetFieldValue helper method to make it simpler, the code to navigate these recursive data types is still… verbose:

 

And now, how to get the value of the Name property?

Same question for the inner Child fields or deeper Size field of its Description?

Wouldn’t that be great if we could navigate just like through real strongly typed instances? In short, to be able to write:

Instead of:

 

The first issue is: what is the GetProxy method going to return? Since we don’t know at compilation time the properties of the object the code is going to manipulate, we need a way to support some kind of late-binding. Fortunately, this scenario is supported in C# through the usage of the dynamic keyword. As you will see in the rest of this post, this is not only a keyword but also an extensible mechanism that perfectly fits our need to define fields at runtime instead of compile time.

 

We start by creating a class inheriting from System.Dynamic.DynamicObject.

 

This base class provides all the facilities needed for late-binding:

As you will see, only a few of these virtual methods need to be overridden to support our scenario.

To construct our proxy, we need two parameters: the ClrMD ClrHeap object, that allows us to browse the objects in the memory, and the address of the object we want to impersonate.

 

We also provide an extension method for convenience:

 

The next step is to override the virtual TryGetMember method, inherited from DynamicObject. It is automatically invoked whenever somebody tries to access a any member of the dynamic object, including its fields.

 

The Name property of the binder parameter provides the name of the accessed member and we are supposed to return the corresponding proxy object as the out result parameter.

We’re going to need the type of the object. For convenience, we store it in a property:

 

Using the binder.Name property containing the name of the field we’re trying to access, we retrieve the ClrMD field description:

 

From there, we get the value marshalled by ClrMD and assign it to the result out parameter:

 

Finally, we signal that we managed to bind the invoked member:

 

This is just a handful of lines of code, but it’s enough for the simple cases where field values are primitive types.. This covers  the “Value” field for our Sample type. For the auto-property “Name”, that’s trickier, because the name of the underlying field has characters that are forbidden in C#: “<Name>k__BackingField”. If we write this, it won’t compile:

 

We can handle this case by guessing the name of the compiler-generated field, then accessing it:

 

Thanks to this trick, we can write:

 

Great! The next challenge is to transparently manipulate the “Child” field as a reference to another “Sample” object. To achieve this goal, the field could simply return another DynamicProxy object that we can manipulate the same way as its parent.

First, we need a helper to find out whether a value is a reference or not:

 

We treat string as a special case, because ClrMD gives us the marshaled string rather than a reference like for all other types. That’s how we were able to retrieve the value of the Name field previously.

Now we call the helper and return a new proxy whenever we’re dealing with a reference:

 

We can now write:

 

That’s it for accessing a referenced object allocated on the heap. However, this won’t work for accessing an embedded struct such as proxy.Description.Id. We’ll see in the next part how to handle this specific case.

Post written by:


 

 

 

 

 

Christophe Nasarre

Staff Software Engineer, R&D.

 

 

 

 

 

Kevin Gosse

Staff Software Engineer, R&D.

Twitter: KooKiz

 

The post ClrMD Part 6 – Manipulate memory structures like real objects appeared first on Criteo Labs.

Muleine Lim’s Seven-Year Ride at Criteo

$
0
0

Tell us about how you ended up at Criteo.

I joined Criteo as the fifth member of the new Product team. I thought the idea of leveraging data to make ads more relevant was simple, yet brilliant. I knew no other company doing this at that time. Before Criteo, I spent three years working as a Software Engineer, mainly at vente-privee.com, the French leader in online flash sales.

How has Criteo changed since you started seven years ago?

When I joined Criteo we were in a small office in the Art district of Paris. Everyone pretty much knew each other. We outgrew our offices and moved twice within two years, before settling down at 32Blanche in the 9th arrondissement. The number of employees has multiplied by 25 since I joined. And so did the scale and complexity of our work. We evolved as an organisation and we built a team that delivers better value and performance. We even went through an IPO and that was just the beginning.

Criteo is now a global player, impacting more than half of the Internet population. We have Petabytes of data and the biggest Hadoop cluster in Europe. Today we are more structured, yet still agile. We continuously challenge ourselves and adapt our systems.

What makes working at Criteo different from your other experiences?

The people we hire are not only extremely talented in their area of expertise, but also very collaborative and friendly. We have great minds here and empowerment is key at Criteo. If you experience any type of frustration, you are in the best position to work on an improvement. “Better to ask for forgiveness than permission” is an unofficial motto at Criteo.

How would you describe the Criteo culture?

Failure is the secret ingredient to achieving success. We are not afraid to try and learn from our mistakes.

What has pleasantly surprised you the most about your team since joining Criteo?

Joining Criteo is like getting a ticket for a great ride! I have learned a lot and I still have plenty of personal challenges. But what I enjoy most is helping others grow. The best reward in coaching people is to see them develop, have a strong impact on the company and become star performers.

The Criteo engineering team is growing across Paris, Palo Alto, and Ann Arbor. We’d love for you to join us.

The post Muleine Lim’s Seven-Year Ride at Criteo appeared first on Criteo Labs.

Reco in the air

$
0
0

It’s been almost a year since we met in Boston in 2016, so we thought now might be a good time to give you some fresh news about the team as well as matter for discussions when you come see us in Como in August!

In case you’re new to Product Recommendation at Criteo, here is a post that will help you get the gist of it.

Reco in the air

First of all, we are excited to announce that two of our papers have been accepted at the RecSys Deep Learning workshop this year! Congrats to Elena SmirnovaThomas Nedelec and Flavian Vasile for their hard work!

This work is making its way to Production and will be AB tested soon!

We have also been very active leading the French chapter of RecSys meetups. The latest edition was in June and featured presentations by Simon Lefebvre (Antvoice), Olivier Grisel (INRIA) and Robbert van Der Pluijm (Bibblio Labs). Our next session will be in September and will be hosted by tinyClues. Stay tuned!

Recently our very own Simon Dolle went to Berlin Buzzword to give a presentation of our work on Word2Vec applied to product recommendation.

We will also be around at the 2nd RecSys meetup in London where Olivier Koch will be presenting our latest work on Reco at scale.

Finally, our public dataset on large-scale counterfactual learning is making its way out, make sure to take a look at it.

A few challenges we are looking at

After a couple years of intense work on Word2Vec applied to product recommendation, our first version has finally made its way to production. But this is just the beginning. New versions are coming, adding content metadata and more scalability.

Here are a few other hard problems we are tackling:

  • Advanced user representation: can we leverage contextual sequence modeling to build more personalized recommendations at scale?
  • Causality and attribution: moving forward beyond last-click attribution, can we make recommendation better by building a better understanding of the causality between displays, clicks and sales?
  • Vectorized reco to the next level: can we make use of deep learning to build better representations of our users and products?
  • Catalog enrichment: can we build a generic catalog representation that would allow us to make more sense of the events we see online?
  • New machine learning models: RNNs and LSTMs are hot these days. Can they really make a difference at scale for billions of products and users?
  • Evaluating recommender systems over time: AB testing is great but costly. How can we make the best of our AB test slots? How can we make evaluate the long-term effects of a new model in production?

These topics are being addressed in a deeply collaborative way by our machine learning engineers and research scientists.  We use state-of-the-art open technologies (Tensorflow, Spark, Hadoop, python notebooks) and share back with the community every time we can.

Join us!

If you feel excited by these challenges, the Reco team at Criteo offers vast opportunities for machine learning engineers and scientists. The team is growing. We will welcome two new hires this summer and have more open positions in Paris and Palo Alto!

Make sure to apply if you are interested! Even better, come talk to us at RecSys in Como!

 

The Criteo Reco Team, (left to right): Aurel Ghioca, Olivier Koch, Flavian Vasile, Lowik Chanussot, Amine Benhalloum, Alexandre Abraham, Vincent Latrouite, Dmitry Otroshchenko, JP Lam Yee Mui

Post written by:


Olivier Koch

Staff Dev Lead R&D, Engine

 

The post Reco in the air appeared first on Criteo Labs.


I am a recruiter and I love recommender systems

$
0
0

As soon as I knew I would join Criteo as a R&D recruiter, I started to brush up my Machine Learning skills. Quickly, I bumped into the big names : Andrew Ng, Yoshua Bengio, Yann Le Cun, Geoffrey Hinton…etc. Knowing those names is important as, nowadays, researchers have an impactful aura on other researchers but also on engineers belonging to the Machine Learning community. As a recruiter, I consider that I should be part of the same world as my candidates. I want to understand their vocabulary and share the same references. My job goes beyond copy/pasting buzzwords and geek references in job titles. Fortunately for me, I had the chance to fall in love with one particular application of Machine Learning that is Recommenders systems.

At Criteo, we are good at recommending ads to internet users and making the Internet free. In real life, recommending something to someone is complex and puts your reputation on the line. A good recommendation will end up making you be trustworthy. A bad one, well, you will not be pointed out twice to recommend something! Formalizing this habit of “word-to-mouth” into a mathematical formula and code would have been a challenge I would have liked to tackle if I had been an Engineer!

The recommendation team : I almost recruited the whole team!

I tried to get familiar with machine learning and wanted to understand the spirit of recommender systems. Hence, I ended up reading about Kim Falk‘s book on Practical Recommender Systems. In his book, Kim highlights Xerox and GroupLens’ work on what seems to appear as “the foundation […] of what we know as recommendations today“. One of the first recommender systems was called Tapestry and enabled Xerox PARC (Palo-Alto Research Center) to deal with the volumes of emails each user was receiving in 1992. In a very similar manner, GroupLens wanted to “collaborative-filter” the information load.

These were the recommender systems that existed before people even knew there were recommender systems. Nowadays, users know recommender systems thanks to very popular service providers such as Netflix, Amazon or Youtube. Everybody binge watches Netflix’s creations. When you have been watching intensively “Stranger Things” and don’t know what to go for next, no worries! Based on other users’ similar tastes, you will get some new suggestions for other must-see TV shows.

Amazon has always been a reference in terms of user-user and item-item based recommender systems. Thanks to the activity of users, Amazon is directly suggesting products that have been bought/liked/viewed by people sharing your tastes. This is an efficient way to increase your number of sales. Funny story, the other day, I went to a book shop. The bookseller naturally talked about another author’s work because she guessed that I would be sensitive to the topic based on the book I was buying. She ended up being right!

My favorite services based on recommender systems are definitely Deezer, Pandora and Spotify. No matter if you are the user of one of those or the three of them, the ability they have to recommend new music based on your personal tastes make it easy to discover new tracks and artists. How easy? No need to keep actively looking for the new track released by your favorite artist, it’s already being suggested to you!

Source : https://www.rinapiccolo.com/piccolo-cartoons/

At Criteo, our engineers face challenges on topics such as evaluating recommender systems over time, ensuring to bring intellectual / material value to an user, recommending without historical data and using deep learning to build better representations of users and products.

However, the new concept that I find personally fascinating has been highlighted by Olivier Grisel during the last edition of the RecSysFR Meet-Up about the need to address ethical considerations for Recommender Systems (from 16’05 in the video). Indeed, based on the data they use, recommender systems can influence the behaviors of the users or add a certain bias. The need of transparency is becoming decisive when talking about recommender systems. This question is so critical that the 2017 RecSys Conference features a whole tutorial entitled “Privacy for Recommender Systems” led by Bart Knijnenburg (Clemson University) and Shlomo Berkovsky (CSIRO).

As a recruiter, I often feel that I am a walking recommender system myself. Looking for a Data Scientist? I can point out on the spot where to headhunt! Looking for a promising Machine Learning researcher? I already spotted one during the last conference I attended. A recruiter always starts by collecting the right information features in his mind to make sure the candidates are willing to engage in a recruitment process. We are matching a candidate’s ambitions to a Team Lead’s needs. But sometimes, things can be difficult to solve!

Hence, if you are as fascinated as me by those topics and happen to be an engineer or a researcher willing to work on challenging problems on Criteo’s recommender systems, feel free to contact me : c.chavier@criteo.com. Otherwise, if you are in Paris, feel free to attend any upcoming RecSysFR branch  meetup (created by engineers from Deezer, Rakuten and Criteo).

Post written by:


Caroline Chavier – Senior R&D Recruiter

Twitter: MrsCaroline_C

 

The post I am a recruiter and I love recommender systems appeared first on Criteo Labs.

ClrMD Part 7 – Manipulate nested structs using dynamic

$
0
0

In the previous post of the ClrMD series, we’ve seen how to use dynamic to manipulate objects from a memory dump the same way as you would with actual objects. However, the code we wrote was limited to class instances. This time, we’re going to see how to extend it to structs. The associated code is part of the DynaMD library and is available on GitHub and nuget.

Part 1: Bootstrap ClrMD to load a dump.

Part 2: Find duplicated strings with ClrMD heap traversing.

Part 3: List timers by following static fields links.

Part 4: Identify timers callback and other properties.

Part 5: Use ClrMD to extend SOS in WinDBG.

Part 6: Manipulate memory structures like real objects.

 

Let’s start with a reminder of the object we’re manipulating via our proxy:

Accessing the value of proxy.Description.Id is a special case. Why? Because Size is a struct, and therefore is embedded directly inside of Sample outside of the responsibility of the managed heap. This scenario isn’t directly supported by ClrMD, and calling GetValue on the Size field will return null instead of the address of the struct. We need to compute this address ourselves.

What is the layout of those objects in memory? To find out, we can create a Sample object and then check how the memory is structured within WinDBG

 

 

The first address (4 bytes because the dump was taken on a 32-bit process) points to the “method table”; some metadata used internally by the CLR to identify the strong type corresponding to this instance in the managed heap. Next, we get the “1” stored by the Value field, followed by a pointer to the “Test” string stored by the Name field. The field values of the Description structure (in dark blue) are then embedded directly inside of the Sample object. And inside of that structure, we find the values of the fields of the Size structure (in light blue). Finally, the reference to the Child field is stored last in the memory of the Sample object.

To sum it up, to get the address of the Description field, we need to take the address of the Sample instance, add 4 bytes for the method table (8 bytes for 64-bit memory dumps), then add the offset of the field (which counts the size of the previous fields and the padding if any):

We call this method from TryGetMember when we have a struct that isn’t a primitive type. ClrMD gives that information thanks to the HasSimpleValue property of ClrField:

There’s another subtlety though. As we’ve seen in the layout, the Description struct, embedded inside of the Sample class, doesn’t have a method table of its own. The consequence is that we can’t find out its type by using ClrHeap.GetObjectType. As a workaround, we add a constructor allowing us to manually set the underlying ClrMD type of the object impersonated by the proxy:

This constructor is called when we still have the type information (taken from the ClrField):

Now we’re able to read the value of a field of a nested struct:

Are we done now? Almost. There is one last case to handle, the Size struct that is nested inside of Description, itself a struct nested inside of a Sample instance. How is that an issue? If we use the same code to compute the address of the Size struct, then we will be adding the 4/8 bytes for the method table. Except that, as we’ve seen, the nested Size struct doesn’t have a method table! We need to add a condition to the code to know whenever we are inside of a nested struct and handle this corner case:

The Boolean flag _interior is set in the second constructor that accepts a ClrType:

We’re finally able to read the value of the nested-nested struct:

Next time, we’ll see how to use the same mechanisms to manipulate arrays in a convenient way.

Post written by:


 

 

 

 

 

Christophe Nasarre

Staff Software Engineer, R&D.

 

 

 

 

 

Kevin Gosse

Staff Software Engineer, R&D.

Twitter: KooKiz

 

The post ClrMD Part 7 – Manipulate nested structs using dynamic appeared first on Criteo Labs.

ICML 2017 highlights

$
0
0

The 34th International Conference on Machine Learning (ICML) in 2017, took place in Sydney, Australia on August 6-11. ICML is one of the most prestigious conferences in machine learning which covers a wide range of topics in machine learning both from practical and theoretical views. This year it brought together researchers and practitioners from machine learning, for 434 talks in 9 parallel tracks, 9 tutorial sessions, 22 workshops. We were there!

Machine learning is about learning from historical data. There are three distinct ways that machine learning systems can learn: supervised learning, unsupervised learning, reinforcement learning. Recently, there has been a tremendous success in machine learning due to the end-to-end training capabilities of deep neural networks – so called deep learning – for learning both prediction representations and parameters at the same time. Deep learning architectures were originally developed for supervised learning problems such as classifications but recently have been extended into a wide range of problems from regression in supervised learning to unsupervised learning domains, such as generative methods (such as GANs) as well as reinforcement learning (deep RL).

As expected, this year, deep learning was one the hottest topics along with other topics such as continuous optimization, reinforcement learning, GANs, as well as online learning. Deep learning continues to be a very active research area – over 20% of all sessions were devoted to this area. Here is a selection of our observations, topics and papers that captured our attention.

Deep learning 

The main widely discussed new challenges for deep learning were: transfer learning, attention and memory. There was a heavy emphasize on understanding of how and why deep learning works. There were some  papers and workshops   trying to address some theoretical aspects, in order to enhance understandings and interpret the results which is crucial for many real-world applications. For example, there were special workshops devoted to visualization for deep learning  or interpretability in machine learning and a lot of results of studies on the interpretability of predictive models,  developing methodology to interpret black-box machine learning models or even developing interpretable machine learning algorithms (e.g., architectures designed for interpretability).

The theory still seems to be far away from the point that can explain the effectiveness of current deep learning solutions. On this subject, the paper Sharp Minima Can Generalize For Deep Nets explains why the widely standing hypothesis: “flatness of local minimum of the objective function, found by the stochastic gradient-descent method, results in a good generalization”, is not necessarily the case for deep nets and sharp minima can also generalize well.

At Criteo, we use deep learning for product recommendation and user selection

Read more about our presence at ICML on our research blog. 

The post ICML 2017 highlights appeared first on Criteo Labs.

Criteo at EuroPython 2017

$
0
0

For the 3rd year, Criteo sent a delegation to EuroPython conference and for the 2nd time, Criteo was a proud sponsor of the event.

Europython is the biggest Python conference in Europe. This year, the conference was hosted in Rimini, Italy on the Adriatic Riviera from the 9th to the 16th of July. The conference lasted more than one week and was composed of workshops, presentations (200+ this year in 7 different tracks), poster sessions, sprints and lightning talks.

Why was Criteo present?

We use Python in our day to day work. For instance, the system in charge of managing our infrastructure inventory is developed in Python and uses the Django framework. Our data engineers and machine learning researchers are using Python for crunching numbers and inventing new approaches to support our business.

As part of the R&D Culture @ Criteo, we also consider it important that our engineers can attend big conferences to keep themselves up-to-date on the technologies they use.

Finally, we are constantly looking for new talents who’d like to tackle exciting problems at Criteo’s scale and such a conference is the occasion of meeting talented developers. So, if you’re interested in working with Python (or any of the other languages used at Criteo) in an international, technology-driven company, or just curious, drop us an email at r&drecruitment@criteo.com or have a look at our current opportunities. 🙂

Delegation’s selected talks

Like last year, with so many talks to listen to, it could be hard to decide where to start. To help you, each Criteo engineer who attended the conference picked one talk and explained in a few words why they liked it.

A Python for Future Generations – A. Ronacher (videoslides)

Armin, the creator of Flask, the famous web framework, gave a keynote about why we should be aware of the mistakes that have been made by the Python community in the past, what we could do to fix them and how our day-to-day work should pave the way to a better future of Python. His talk covers hot topics from packaging (a world of monkey-patches) to extension modules (please, start using cffi!), without forgetting unicode (utf-8 everywhere!). He notably denounces the fact that the Python language itself has no documented standard, so people tend to take the CPython implementation as a de facto standard, whose quirks are replicated e.g. by pypy for the sake of compatibility, blocking substantial improvements. (Hugues Lerebours)

There should be one obvious way to bring python into production – S. Neubauer (videoslides)

The Zen of Python (PEP20) states that there should be one obvious way of doing things. Sebastian starts his talk by exposing the many different ways of doing packaging and deploying today: virtualenv, pip, conda, OS package, nix, docker and probably many others that we are not even aware of. All of them have pros and cons and depending on your case or your company, one approach is privileged to another. So, we are far from the Zen of Python recommendation. But why are we there? Because of historical reasons. Why isn’t it one obvious way? Maybe because it still waits to be built! Sebastian talk continues with a call to change: we can do something about this and make python easy to go in production! (Let’s just pay attention so that it won’t be the new standard to unify all the existing ones as imagined in https://xkcd.com/927/) and Sebastian to give some interesting directions: containers are probably a good solution, automation, a must and DevOps, the new working mode. When this problem is solved, serverless could be something for python. (Rémi Guillard)

Developing elegant workflows in Python code with Apache Airflow – M. Karzynski (videoslides)

Michal gives a good overview of a tool to create elegant workflows. Every time a new batch of data comes in, you start a set of tasks. Some tasks can run in parallel, some must run in a sequence, perhaps on a number of different machines. That’s a workflow. Apache Airflow is an open-source Python tool for orchestrating data processing pipelines. In each workflow tasks are arranged into a directed acyclic graph (DAG). Shape of this graph decides the overall logic of the workflow. Through an example, Michal presents the key concepts of the tool. (Basha Mougamadou)

Inside Airbnb: Visualizing data that includes geographic locations – G. Ballester (videoslides)

In his presentation, Guillem presented multiple ways to build and test valuable geographical data visualizations. On top of the multiple tools to manipulate and prepare data sets, Python language also provides great libraries to create great data visualizations: Holoview / Geoview, Bokeh and Shapefiles to name a few. The general approach is that Python helps developers to prepare a data visualization and then the visualization can be ported to a more “web friendly” framework such as any Javascript framework. (Bastien Vallet)

How Facebook uses Python to build (and operate) datacenters at scale – N. Đipanov (video)

Nikola explained that when dealing with a huge number of servers and network equipment, you have to automate as much as possible to avoid human errors. For example, Facebook cabling management can no more be handled manually since they deal with between 10 and 20 million ports. If you want to discover how Python code can be used to help datacenters be more efficient, you have to watch this one. (Djothi Carpentier)

If Ethics is not None – K. Jarmul (videoslides)

With her walk through the computing history with a focus on ethical questions, Katharine succeeded in captivating the audience (and this, without a single line of code! :)). She looks at how the ethical questions kept popping and how they were discussed at the time. She raises also the fundamental question of the personal responsibility of the developer… a question that each of us should keep in mind when taking decisions. (Renaud Bauvin)

Not really a Delegation-selected talk but this year, 2 Criteos had the chance of presenting an introductory talk on introspection.

Inspect (or gadget?) – Hugues Lerebours / Renaud Bauvin (videoslides/notebook)

Introspection is often seen as a bad coding practice and as such a gadget. Nevertheless, the Python Standard Library provides different tools (among them the ‘inspect’ library) to easily identify a generator, recover the source code of a function or get a function signature. We propose to spend 1/2 hour to dig into what introspection has to offer to developers, to see what tools are available, what you can get out of them and some useful use cases that we met in our practice at Criteo.

The Criteo delegation is taking this opportunity to thank the organizers (EuroPython is a purely volunteers-based event), all the speakers and in general all the attendees to have made of this 2017 edition a success and a great moment to feel the Python community energy.

See you soon!

Post written by:


Basha Mougamadou, Renaud Bauvin, Bastien Vallet, Hugues Lerebours, Rémi Guillard, Djothi Carpentier.

The post Criteo at EuroPython 2017 appeared first on Criteo Labs.

Extending the new WinDbg, Part 1 – Buttons and commands

$
0
0

This article is the first of a three-parts series on how to extend the new Windows 10 WinDbg app in order to make your .NET debugging easier and faster. In the first part of the series, you are going to see how to add a button to the ribbon and react to commands. For that, we will create a button to load the SOS.dll extension for us:

In the second part, we will learn how to add custom panels to the UI, and use them to open multiple command windows with history:

In the third part, an interpreter will be embedded to let you type and run C# code. That code will have access to ClrMD and DynaMD to help you analyzing your app and CLR data structures as demonstrated in our previous series:

Be advised that the new WinDbg is still in preview, and the extensibility points aren’t documented. What is explained in those articles could change at any time, without any warning from Microsoft.

Also, please note that those articles assume preliminary knowledge of WPF and MEF (Managed Extensibility Framework).

 

Discovering the extension point

The old WinDbg is that tough child you love to hate. It is the one application you cannot remove from your debugging toolbox, the one you turn to in almost every postmortem debugging session. Yet, it is the absolute antithesis of enjoyable and efficient UX, a UI inherited straight from the early 90s that doesn’t answer even your basic needs without a certain dose of pain.

Of course, you can add new commands that can save you a lot of time, but it’s not really possible to tweak the UI.

In that light, you can imagine how excited we were when the new WinDbg was revealed.

After playing with the new UI a little bit, I was curious of what extension capabilities it offered. The first step was to find the binaries of the application. You could list the installed app packages yourself or simply start the new WinDbg and find it in Process Explorer to get the image location.

Launching a decompiler and spelunking in the C:\Program Files\WindowsApps\Microsoft.WinDbg_1.0.11.0_x86__8wekyb3d8bbwe folder, I started by confirming that the new UI is written in WPF. It means that, at the very least, it should be possible to modify the IL to inject custom code.


But, searching for an easier way, I stumbled across a subfolder named “Extensions”:


The folder already contains what looks like an extension, “DbgX.External.dll”. After decompilation, it seems to be a UI element injected with MEF:

Digging further, it seems that WinDbg uses MEF to automatically load any DLL put in the Extensions subfolder, as well as in the %LOCALAPPDATA%\DBG\UIExtensions folder.


Score! From there, it should be possible to write and inject our own extension.

 

Setting up the project

As mentioned in the introduction, we’re going to create a helper to load SOS with a click from the mouse.

First, let’s start by creating a new Class Library project. To save time, don’t hesitate to set a post-build action to copy the project output to the %LOCALAPPDATA%\DBG\UIExtensions folder. This way, you can quickly test your modifications just by compiling your project and launching WinDbg. It would also work with the Extensions subfolder of WinDbg, but I don’t recommend it since the store application folder has very restrictive permissions and is a pain to deal with. Maybe a forthcoming version of WinDbg will provide a dedicated UI to install our extensions; who knows?…

Next, add a reference to DbgX.Interfaces.dll, DbgX.Util.dll, and Fluent.dll (all of them are in the WinDbg binaries folder). Just make sure to set “Copy Local” to “false” for each of those references, so you won’t inadvertently copy them to the extensions folder of WinDbg.  Add a SosLoaderViewModel class that implements the IDbgRibbonTabGroupExtension interface. This interface will be called by WinDbg to know which controls are exposed by the extension. This interface has a single property, “Controls”, that will be called by Windbg to know what UI controls to add to the ribbon. Just return an empty list for now Note that you will also need to add a reference to PresentationFramework and System.xaml WPF assemblies and target a .NET framework higher or equal to 4.6.1 to make it compile.

Also, decorate the class with the RibbonTabGroupExtensionMetadata attribute. It has three parameters, extendedTabName, extendedGroupName, and order, that indicate where in the ribbon the UI element should be inserted. For now we’ll stick to the Home tab (“HomeRibbonTab”), in the Help group (“Help”), with no particular order (0).

Last but not least, add a reference to MEF (System.ComponentModel.Composition) and use the Export attribute to declare that the class should be discovered as a IDbgRibbonTabGroupExtension:

So far, since we left the Controls  property empty, nothing is displayed in WinDbg. This is hardly exciting, so let’s build a bit of UI.

Add a new WPF user control to the project (make sure not to pick the Winform user control), and name it SosLoaderButton:


In the code-behind file, remove the “UserControl” inheritance (because we’re going to change the base class in the XAML), and change the constructor to accept a SosLoaderViewModel. Assign it as the DataContext to be able to use it for databinding.

In the XAML, change the root node to “fluent:ToggleButton” and declare the fluent namespace as “urn:fluent-ribbon” to replace the local namespace. Set the Header property to “SOS”. You can also associate an icon using the “LargeIcon” property.

Now that we have a button, we just need to expose it from the viewmodel (of course, if you want to do true MVVM you may want to separate the IDbgRibbonTabGroupExtension from the actual viewmodel, but that’s not really the point here).

Now, if you start WinDbg (assuming you’ve properly copied your dll to the extension folder), you should see the new button appear:


Great! But it’s still a bit boring, since nothing happens when clicking on the button.

 

Executing commands

To load the SOS.dll WinDbg extension, we just need to ask WinDbg to execute the command “.loadby sos clr” when the button is clicked.

First, let’s bind the IsChecked and Command properties of our button, as we’re going to need them later:

Now we need a way to ask WinDbg to execute a command. For that, we need to import the IDbgConsole interface from the DbgX.Interfaces.Services namespace in the viewmodel:

To create the WPF command bound to the button, we’re going to use the AsyncDelegateCommand helper provided in DbgX.Util. But any WPF command would do the trick, so feel free to use your own implementation. Inside, we call the ExecuteCommandAsync method of the IDbgConsole to load SOS:

It works, but the status of the toggle button isn’t consistent: if we press it twice, then it won’t be toggled anymore, even though SOS is still loaded. Additionally, if the developer manually loads SOS out of habit, it would be nice if the button was automatically updated.

 

Listening to commands

To do so, we’re going to implement the IDbgCommandExecutionListener interface in the viewmodel. This interface only defines the OnCommandExecuted method that is called, as the name implies, when a command is executed. If the command is “.loadby sos clr”, we update the IsLoaded property, bound to the toggle button. We also need to implement INotifyPropertyChanged to make sure that the view picks the changes.

Note that we don’t have to set the IsLoaded property in the LoadSos method, because OnCommandExecuted isn’t limited to commands typed by the user, and will be raised in response to the call to _console.ExecuteCommandAsync. Basically, we’re listening to all commands received by the debugging console, either typed by the user in the UI or silently sent by any extensions, including our own actions.

This is looking good so far, but there is still one issue to fix. If we press the button before a debugging session is started, the command will fail (an exception will actually be thrown, and caught internally by WinDbg), and we will still mark SOS as loaded. It’d be great if we could activate the button only after a debugging session has started.

 

Monitoring the debugging engine

For that, we need to implement IDbgEngineConnectionListener. To know the status of the debugging engine at a given time, we’re also going to import IDbgEngineControl and use its ConnectionState property. Our MEF imports/exports now look like:

When creating the AsyncDelegateCommand, we now also use the second constructor parameter of the constructor, which is the CanExecute callback of the command. Inside, we check the status of the engine, and whether we previously loaded SOS:

Last but not least, we need to invalidate the status of the command when OnEngineConnectionChanged is called (inherited from IDbgEngineConnectionListener):

And we’re done! The full file can be seen here.

 

In this article, we’ve seen how we can extend the WinDbg UI and interact with it in just a few lines of code. Of course, the use-case was a bit naïve and not that useful, but we’ll see in the next articles how to implement much more powerful features, that can provide real productivity boosts when debugging applications.

 

Post written by:


 

 

 

 

 

Christophe Nasarre

Staff Software Engineer, R&D.

 

 

 

 

 

Kevin Gosse

Staff Software Engineer, R&D.

Twitter: KooKiz

 

The post Extending the new WinDbg, Part 1 – Buttons and commands appeared first on Criteo Labs.

Viewing all 176 articles
Browse latest View live